* Changes for bfloat16 Zero2
* ZeRO stage3 optimizations, with some bug fixes
optimizations for stage3:
- prefetching improvements
- batching allgather calls to amortize fixed overhead and improve
bandwidth utilization
- batching reduce_scatter calls to amortize fixed overhead and
improve bandwidth utilization
- using *_base variants of allgather and reduce scatter to reduce memory
allocations and data movement
- more fine grained synchronization for communication that allows
blocking on less work
- precomputation of fetching code - using a fetch queue rather than
deciding what to (pre)fetch at each iteration
- limiting queued coalesced communication ops to reduce memory pressure
on pytorch cuda caching allocator (not elegant solution)
optimizations for stage3-offload:
- made some host-device tensor copies async to improve performance
bug fixes and qol improvements:
- fix init context method when parent modules modify child weights
- speed up model initialization by moving model to GPU before weight
initialization
- fixed unit test imports so that unit tests can be run from any
directory
- change performance logging to include memory consumption
- add logging w/ model size when done partitioning model
new features
- bfloat16 support for ZeRO 3
* fix import in ut
* ran yapf
* improvements to cache flush warn log
* backwards compatibility with older versions of pytorch
* handle edge case where reduced tensor smaller than world size
* moved event synchronization to allgather handle wait() call
* removed unnecessary barrier call
* formatting fix after resolving merge conflict
* skip nvme prefetch when trace not complete
* opportunistically avoid memory allocation in allgather coalesced where possible
* fix indentation after merge
* fixes to account for parameter offload
* accounting for torch.cuda.memory_stats not being available
* moved partition_all_params to optimizer step
* allgathering on params before item gets called
* fix param status checks
needed after moving partition_all_parameters call to optimizer step
* fix grad accumulation with optimizer offload
* grad norm computation fix for optimizer offload
* change post divide in reduce-scatter to pre divide
* fix gradient race condition w/ optimizer offload
* improve inf/nan gradient tracking
* don't prefetch when not in training mode
* format fix after merging
* fix prefetching issue when using NVME offload
* improved defragmentation for fp16 parameters
* relative imports for bf16 tests
* changes for bwd compatibility with pytorch 1.2
* remove buffered_reduce_fallback
* removed unused parameter offset bookkeeping
* fixed tracking for multiple param groups
* unbroke bfloat16 config after merge conflict
* using base allgather params when only 1 param
* cleanup/fixes for fp16 partition defragmentation
* switch to CRLF
* convert to same new-line style as master
* align new line with master
* Fix merge issues
* switch to CRLF
* fix to LF line endings
* minor merge fixes
* remove extra bfloat16_enabled definition
* asserting params inflight for AllGatherHandle
* remove get_cuda_mem_allocated_str
* Format fixes
* fix bfloat16 zero stage check (broken after merge commit)
* +self.communication_data_type, -self.allreduce_always_fp32; delete dead code
* Add self.reduce_scatter
* Format fix
* Fix merge issues
* iterate over params_to_fetch rather than make another iterator
* add some TODOs
* remove unnecessary division by micro_step_id
* rename config keys "bfloat16" -> "bf16"
* rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bit_weights_on_model_save
* add unit test to check backwards compatibility for gather_16bit_weights
* added test to confirm bf16 key bwd compatibility
* Format fixes
Co-authored-by: Rana Ali Amjad <raamjad@amazon.com>
Co-authored-by: Justin Chiu <justchiu@amazon.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* Fix typos in docs/
* Fix typos in code comments and output strings
* Fix typos in the code itself
* Fix typos in tests/
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
* add live zero checkpoint to fp32 consolidation version
* some more docs
* zero2 model states uses a different filename
* fix
* make debug mode cli configurable
* copy the script only on node 0 process 0
* validate that we have the right number of files
* revamp _get_zero_param_shapes, instrument with easier debug
* correct assertion
* rename API; add even simpler API
* style
* docs improve
* update the docs
* revert the unpartitioned_params detection and report as it's most likely persistent params
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Authors: @awan-10 @conglongli @samyam @jeffra
What's new:
NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., #813).
* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)
* NCCL based 1-bit Implementation + Refactor to add communication backends (#593)
* add nccl 1-bit optim.
* temporary commit to save stuff.
* Use dist collectives instead of mpi routines.
* remove old code for comm.
* Fix bugs. still does not work.
* modify to test the nccl side code path
* Initial gather impl. Works intra-node.
* Updates to comm. phase 2. nccl comm. passed the tests.
* refactor code to introduce nccl/mpi as backends for onebit adam.
* Refactor updates to test/engine.
* Fix compile/runtime errors.
* simplify support for nccl/mpi backends.
* Add missign file
* Add compression backend in constructor. Revert later.
* modify test with some perf counting.
* Implement a true non-blocking gather for nccl side.
* Revert "Add compression backend in constructor. Revert later."
This reverts commit df8c40d3105e9f2542a8aa6619e80d675a09753f.
* improve the 1-bit adam test.
* Refactor comm. and compression backend in 1-bit adam.
* Fix the test.
* Fix runtime errors and typos in nccl backend
* fix mpi backend. modify tests.
* modify nccl perf test.
* fix mpi side errors.
* Add an mpi perf test
* Sync DSE.
* Remove old collectives file.
* Undo a typo.
* Graceful failure for torch versions that don't support nccl pt2pt.
* Revert "Merge branch 'master' into staging-1bit-nccl-v2"
This reverts commit 78400850703b4b2d84f11b73c109f56919e748ea, reversing
changes made to a6dba72aeafad63661dfe566d3accd03d00be78c.
* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""
This reverts commit 6dbdd9858bafef4d340c089fdc0e3ddde3706f47.
* comm optimization + 1-bit lamb
* Saving/debugging commit.
* finalizing 1-bit lamb
* finalizing 1-bit lamb
* add momentum mask and chkpt handling for 1-bit adam
* Cleanup and modify nccl test to be runnable with deepspeed launcher.
* Fix format.
* fix formatting again.
* make test runnable without mpi4py
* Add dist.alltoall and dist.allgather instead of custom functions.
* remove debug prints.
* formatting and renaming
* renaming
* renaming
* add unit test, fix existing tests
* skip unit test when torch < 1.8
* revert 1-bit lamb
* flatten momentum when dimension is more than 1
* add warning message for 1-bit adam under fp32
* improve version check
* add fp32 test
* 1-bit adam doc
* fix file name
* doc fix
* torch 1.8 is released
* doc fix
* fix tests
* update news
* add doc for momentum mask
* fix checkpoing handling, add unit test
* checkpoint handling doc
* doc final cleanup
* bump dates
* update tests
* url change
* doc fix
* fix test
* doc update
Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
* add optimizers and schedules to rtd
* update ds website and fix links
* add optimizers and schedules to rtd
* update ds website and fix links
* add flops profiler to rtd
* fix
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
* add adamW to CPU-ADAM implementation
* supporting cpu-adam optimizer for zero-offload on deepspeed side
* bump DSE to match cpu-adam updates
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>