* Cleanup: context parallel
* Feat: cleanup
* Feat: concept guide
* Fix: rename + version check
* Style
* Fix: add to namespace in a test
* Fix: add skip_if on dataclass tests
* Fix: proper version for version check
* Feat: add tests and cleanup
* Fix: properly version check added tests
* Feat: address comments
* Fix: add both shift_labels and labels to make the model.forward calculate loss
* Fix: remove import, improve comment
* Fix: final checks
* Fix: style
* Fix: style
* add support for SwanLabTracker and update related documentation
* add emoji in FRAMWORK
* apply the style corrections and quality control
* add support for SwanLabTracker in tests
* fix bug in test_tracking
* Fix double wrap
* Clocking off, ~equal to torch baseline
* works?
* Working version
* Partial rewrite
* FSDP2 path works
* Fix back prepare
* Almost done, proper AC left
* Feat: should work, cleanup + test more benchmarks left
* Style+quality
* Feat: fp8 example
* Feat: better example
* Feat: add readme
* Docs + should be done
* Fix: typos
* Fix: protect imports
* Feat: address comments
* Feat: add flops image
* init
* style
* is_hpu_available
* fix
* import habana_frameworks.torch.distributed.hccl
* style
* test
* initialize dist proc group
* revert
* set backend to hccl only if hccl initialization sets a local rank
* force backend hccl and multi_hpu type when sure of distributed launch
* style
* pass accelerator tests
* pas big modeling tests with bigger atol/rtol for accelerators
* fix hpu device count and skip tests requiring hpu:x
* hpu autocast
* hpu rng_state
* hpu launch
* hpu special device placement
* hpu launch
* rng state
* distributed data loop tests
* enforce non contiguity after device memory allocation
* pass fsdp tests
* enforce pt_hpu_lazy_mode=0 when fsdp testing
* pass cli tests
* pass and document grad sync tests
* pass kwargs handler and autocast tests
* memory utils
* found source of int64 errors
* skip some modeling utils tests
* enable int64
* skip optimizer tests
* pass checkpointing tests
* pass accelerator tests with safetensors main
* more hpu stuff
* style
* remove PT_HPU_LAZY_MODE and PT_ENABLE_INT64_SUPPORT as they should be in the testing environment
* start testing on gaudi2
* support fp16 on gaudi2
* add testing order
* custom hpu fsdp env dict
* fix torch trace malloc
* test ddp half precision comm hooks
* fix
* fix
* remove lower bound for hpu
* use 0.72 as lower bound
* lower lower bound
* order deepspeed tests
* fix
* deepspeed_use_hpu
* assert non lazy mode with offloaded optimizer
* make patching torch with habana frameworks the default
* less of require_non_hpu
* skip test_multi_device_merge_fsdp_weights for now as it halts
* skip another flaky test
* format
* use habana_visible_modules
* patch torch hpu device count
* avoid setting HABANA_VISIBLE_MODULES
* don't play with habana visible devices/modules
* only with hpu
* fixes and skips
* skip
* fix device ids and add some todos
* skip offloading with generate()
* fix
* reduced atol/rtol for hpu
* fix
* tag deepspeed tests that should run first
* enable a test path that was skipped
* revert a test that was customized for gaudi1
* some patching to enable HABANA_VISIBLE_MODULES
* fix zero3 test
* misc
* test DTensor TP
* remove gaudi1
* test
* style
* comment
* pass pad_across_processes
* require_fp16
* pass memory utils test
* test_ddp_comm_hook
* skip half precision comm hooks on hpu
* fix
* is_fp16_available
* fp16
* tp as part of integration tests
* fix
* write_basic_config
* safetensors
* local sgd and masked_fill_fwd_i64
* fix num_processes in test_load_states_by_steps
* fp8 support
* test
* fix
* add a workflow
* Update src/accelerate/accelerator.py
* review comments
* ci
* style
* comments
* test
* habana_frameworks.torch
* patch device count
* fix
* fix
* require_fp8
* fix
* fix
* gaudi 1
* remove unnecessary
* fixed maskd fill error in transformers
* style
* balanced_memory pass on hpu
* remove for now
* run first
* Apply suggestions from code review
* style after merge
* Update src/accelerate/accelerator.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* Update src/accelerate/utils/transformer_engine.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* empty cache review comments
* test_scirpt.py error messages
* AccelerateTestCase for accelerator state cleanup
* test
* add gaudi1 workflow
* fp8 avilability
* fix
* reduce batch size
* concurrency
* check cuda as well
* nits and comments
* mark fsdp tests that require_fp16
* style
* mark deepspeed fp16 tests
* update image
* fix
* updated
* better msgs
* skip pippy
* test
* test on 2 device
* support up to 1% relative error in test_accelerate
* skip hpu fp16
* allow for 1 byte differene
* revert torch_device change
* style
* skip memory release since it's flaky
* add accelerator state cleanup to fixture
* fix
* atol
* fix
* more rtol
* equal grad test
* revert
* pass pippy on gaudi2 and skip on gaudi1
* enable sd 1.5 test with require fp16
* added warning on memory release
* don't log warning in memory release as it requires PartialState to be initialized
* Apply suggestions from code review
---------
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* Add cross-entropy example in the gradient accumulation docs
* add example of logs
* correct skeleton code
* replace gather_for_metrics with gather
* batch_size -> per_device_batch_size
* remove main_process_only=True
* add autoregressive example in examples/
* Update docs/source/usage_guides/gradient_accumulation.md
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
* ruff format
* add grad accum test
* update docs
* Update examples/by_feature/gradient_accumulation_for_autoregressive_models.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* update tests
---------
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* skeleton code
* fix some errors for downloading the model
* fix some tqdm error
* fix some error
* fix some gpu errors with torch
* fix some gpu errors with torch
* testing simple way
* testing simple way
* testing simple way
* testing simple way
* actual code
* actual code
* final testing with serialization
* add multi_gpu speech generation
* fix some comments
* fix some style and quality
* MNT Upgrade ruff to 0.6.4
Currently used version, 0.2.1, is quite old at this point.
Not a lot needed to be changed:
- Change ruff version in setup.py
- Remove deprecated ignore-init-module-imports option for ruff
- Type comparison should use is and not ==
- Use f-string instead of % formatting
- Some line wrapping and empty lines
* Oops
* rm warning
* Take 3
* Take 4
* Annotate
* Take 6
* Updated
* Spec
* Last fix
* Don't padd input
* Finished
* Continue refactor
* Rm comment
* Adjust the err
* Start adjustment
* GPT2 works, T5 does not
* llama too now I think
* Flag the t5 example
* v1
* More testing, need to try on H100
* Bigger batch for h100 test
* test tweak
* Fixup all tests!
* Bookmark
* Fix issues, working now
* rm num samples
* Uncomment
* Give stateful dl end of dl
* Make skip DL stateful
* Migrate to update_state_dict
* try/finally
* Add comments to test
* rm comment
* Document
* refactor out for eventual override
* Doc nit
* Brute force it
* Add ddp comm hook
* Fix dataclass order
* Merge ddp grad hook to ddp kwargs handler
* Reset ddp kwargs key
* Add test
* Fix test case
* Split ddp grad test
* Fix test case
* Ehance docstring
* Minor
* Use naive baseenum for ddp comm hook type
* Add by feature example
* Add multi device deco
* Add user guide
* Update examples/by_feature/ddp_comm_hook.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* Update examples/by_feature/ddp_comm_hook.py
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
* Add wrapper and state option details
* Update toctree
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update docs/source/usage_guides/ddp_comm_hook.md
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Mv ddp comm hook index
* Fix ddp comm hook user guid
* Del empty line
---------
Co-authored-by: Zach Mueller <muellerzr@gmail.com>
Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
* Update accelerate config and launch to abstract out mpirun
* Fix var
* Documentation updates, updating the launch script to work with other MPI programs, and fixing the nlp example when using IPEX
* Style fixes
* Add a test
* Style fixes
* Formatting fix
* Updates based on review feedback.
* Remove model.train()
* Doc update
* Update doc regarding the accelerate config with the old method of mpirun and accelerate
* Fix typo in comment
* Quality and test updates
* Updates based on review feedback
* Quality fix
* Fix mock patch path
* Updates based on review feedback
* Quality fixes
* Make torch xla available on GPU
* format code
* fix documentation build error
* update according to the comments
* Replace DistributedType.TPU with DistributedType.XLA
* make all ut pass
* format code
* update comments
* skip test
* format code
* skip FSDPPluginIntegration for torchxla
* bring back custom_sampler_check
* fix ut
* format code
* format code
---------
Co-authored-by: Zach Mueller <muellerzr@gmail.com>