* add support for SwanLabTracker and update related documentation
* add emoji in FRAMWORK
* apply the style corrections and quality control
* add support for SwanLabTracker in tests
* fix bug in test_tracking
* init
* style
* is_hpu_available
* fix
* import habana_frameworks.torch.distributed.hccl
* style
* test
* initialize dist proc group
* revert
* set backend to hccl only if hccl initialization sets a local rank
* force backend hccl and multi_hpu type when sure of distributed launch
* style
* pass accelerator tests
* pas big modeling tests with bigger atol/rtol for accelerators
* fix hpu device count and skip tests requiring hpu:x
* hpu autocast
* hpu rng_state
* hpu launch
* hpu special device placement
* hpu launch
* rng state
* distributed data loop tests
* enforce non contiguity after device memory allocation
* pass fsdp tests
* enforce pt_hpu_lazy_mode=0 when fsdp testing
* pass cli tests
* pass and document grad sync tests
* pass kwargs handler and autocast tests
* memory utils
* found source of int64 errors
* skip some modeling utils tests
* enable int64
* skip optimizer tests
* pass checkpointing tests
* pass accelerator tests with safetensors main
* more hpu stuff
* style
* remove PT_HPU_LAZY_MODE and PT_ENABLE_INT64_SUPPORT as they should be in the testing environment
* start testing on gaudi2
* support fp16 on gaudi2
* add testing order
* custom hpu fsdp env dict
* fix torch trace malloc
* test ddp half precision comm hooks
* fix
* fix
* remove lower bound for hpu
* use 0.72 as lower bound
* lower lower bound
* order deepspeed tests
* fix
* deepspeed_use_hpu
* assert non lazy mode with offloaded optimizer
* make patching torch with habana frameworks the default
* less of require_non_hpu
* skip test_multi_device_merge_fsdp_weights for now as it halts
* skip another flaky test
* format
* use habana_visible_modules
* patch torch hpu device count
* avoid setting HABANA_VISIBLE_MODULES
* don't play with habana visible devices/modules
* only with hpu
* fixes and skips
* skip
* fix device ids and add some todos
* skip offloading with generate()
* fix
* reduced atol/rtol for hpu
* fix
* tag deepspeed tests that should run first
* enable a test path that was skipped
* revert a test that was customized for gaudi1
* some patching to enable HABANA_VISIBLE_MODULES
* fix zero3 test
* misc
* test DTensor TP
* remove gaudi1
* test
* style
* comment
* pass pad_across_processes
* require_fp16
* pass memory utils test
* test_ddp_comm_hook
* skip half precision comm hooks on hpu
* fix
* is_fp16_available
* fp16
* tp as part of integration tests
* fix
* write_basic_config
* safetensors
* local sgd and masked_fill_fwd_i64
* fix num_processes in test_load_states_by_steps
* fp8 support
* test
* fix
* add a workflow
* Update src/accelerate/accelerator.py
* review comments
* ci
* style
* comments
* test
* habana_frameworks.torch
* patch device count
* fix
* fix
* require_fp8
* fix
* fix
* gaudi 1
* remove unnecessary
* fixed maskd fill error in transformers
* style
* balanced_memory pass on hpu
* remove for now
* run first
* Apply suggestions from code review
* style after merge
* Update src/accelerate/accelerator.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* Update src/accelerate/utils/transformer_engine.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* empty cache review comments
* test_scirpt.py error messages
* AccelerateTestCase for accelerator state cleanup
* test
* add gaudi1 workflow
* fp8 avilability
* fix
* reduce batch size
* concurrency
* check cuda as well
* nits and comments
* mark fsdp tests that require_fp16
* style
* mark deepspeed fp16 tests
* update image
* fix
* updated
* better msgs
* skip pippy
* test
* test on 2 device
* support up to 1% relative error in test_accelerate
* skip hpu fp16
* allow for 1 byte differene
* revert torch_device change
* style
* skip memory release since it's flaky
* add accelerator state cleanup to fixture
* fix
* atol
* fix
* more rtol
* equal grad test
* revert
* pass pippy on gaudi2 and skip on gaudi1
* enable sd 1.5 test with require fp16
* added warning on memory release
* don't log warning in memory release as it requires PartialState to be initialized
* Apply suggestions from code review
---------
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* Add cross-entropy example in the gradient accumulation docs
* add example of logs
* correct skeleton code
* replace gather_for_metrics with gather
* batch_size -> per_device_batch_size
* remove main_process_only=True
* add autoregressive example in examples/
* Update docs/source/usage_guides/gradient_accumulation.md
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* ruff format
* add grad accum test
* update docs
* Update examples/by_feature/gradient_accumulation_for_autoregressive_models.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* update tests
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* v1
* More testing, need to try on H100
* Bigger batch for h100 test
* test tweak
* Fixup all tests!
* Bookmark
* Fix issues, working now
* rm num samples
* Uncomment
* Give stateful dl end of dl
* Make skip DL stateful
* Migrate to update_state_dict
* try/finally
* Add comments to test
* rm comment
* Document
* refactor out for eventual override
* Doc nit
* Brute force it
* Add ddp comm hook
* Fix dataclass order
* Merge ddp grad hook to ddp kwargs handler
* Reset ddp kwargs key
* Add test
* Fix test case
* Split ddp grad test
* Fix test case
* Ehance docstring
* Minor
* Use naive baseenum for ddp comm hook type
* Add by feature example
* Add multi device deco
* Add user guide
* Update examples/by_feature/ddp_comm_hook.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* Update examples/by_feature/ddp_comm_hook.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* Add wrapper and state option details
* Update toctree
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Mv ddp comm hook index
* Fix ddp comm hook user guid
* Del empty line
---------
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Let's try it out
* Let's try this out
* Some more cases
* String
* Require hub online for estimator
* Add CI checker to alert on hub status
* Format
* Oops death by ctrl z
* Fix import
* Fix tests
* Fixup tests
* Fix test
* Actually cast to string!
* Fixup deepspeed
* fsdp and deepspeed fix
* Since we're doing this, may as well get it all
* Stragglers
* Split only if we require config_file
* Make list
* Only convert if it's a path
* type
* Other func
* rm parenth
* early stopping
* Fix tests
* Works on multi-gpu, uncomment
* Rm reset
* Check for >=1
* equal
* Trigger
* Fix test
* Update docs/source/concept_guides/deferring_execution.md
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
* Explicit example loop
* Set to zero, not None
* rename test
* Check again to ensure it's been reset
---------
Co-authored-by: Benjamin Bossan <BenjaminBossan@users.noreply.github.com>
* Megatron-LM integration
* add code and resolve comment
Co-Authored-By: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add code
* add code
* fix many 🐛
* add code
* add code and reverting tracker processes
* updating logging utilities, fixing Pipeline Parallelism and dataset/dataloader 🐛 s
1. Fixing bugs related to Pipeline Parallelism
2. Fixing bugs related to dataloaders/datasets.
3. Fixing logging utilities so that all logging and tracking happens on last process when using Megatron.
* addressing comments
* resolving comments
* update code
* refactoring and adding code to support custom implementation of`AbstractTrainStep` class
* minor change
* Many fixes for supporting custom TrainStep and Megatron Indexed Datasets
* Add code, 🐛 fixes and a initial doc file with headings
* fixing a big 🐛 related to loading checkpoints
* adding doc and an example
* example test CI
* docs
* more docs
* more doc changes
* more doc changes
* docs
* more docs
* doc fixing
* trying if we can directly import megatronlm utils
* doc fixing and throwing error if megatron isn't available.
* resolving comments
* fixes to bert and t5 and more docs
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* deepspeed revamp
* Update dataclasses.py
* Update deepspeed.py
* quality
* fixing code
* quality
* FIx imports
* saving 16bit model in zero stage 3
1. Saving 16bit model in zero stage 3
2. zero init in stage 3 support using HFDeepSpeedConfig
* quality
* adding test and fixing bugs
* update makefile for deepspeed tests
* Update test.yml
* adding `deepspeed` as requirement for tests
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* quality
* addressing comments
* add example and minor updates
1. Add example to show the usage of config file with revamped deepspeed support.
2. update required deepspeed version to 0.6.5
2. reverting `reinit` change as it is not required,
3. raising Exception when using `clip_grad_value` with DeepSpeed/FSDP.
* Documentation and Zero-3 Inference Support
1. Changes to support ZeRo Stage-3 Inference support.
2. minor bug fixes.
3. Documentation.
* doc fix
* Apply suggestions from code review
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* addressing comments
* update doc to address comments and bug fixes
1. update tests and add new one testing autofill functionality of `prepare` method.
2. fix bug related to zero-3 init related to HFDeepSpeedConfig
3. Update documentation addressing comments.
* removing image and hosting it on `documentation-images` dataset
* check for hidden_size for zero_opt heurisitics
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Introduce nightly builds
* Fixup docker images slightly
* Make device-count specific test use `torch.cuda.device_count()` rather than `Accelerator.num_processes` to avoid bug.
* Create peak_memory_uasge_tracker.py
Adding the example by feature for tracking peak memory usage of GPU. One example of usage is to track the peak memory reduction when using FSDP.
* fixing the typo in the file name
* reformatting
* exclude peak_memory_usage_tracker.py from tests
* renaming and highlighting proper usage
* Update test_examples.py
😅