Commit Graph

48 Commits

Author SHA1 Message Date
e80ae08886 Empty ZeRO3 partition cache (#3060)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2023-03-23 17:15:34 -07:00
541e423ae6 Enable tensor fragments for zero 2 & 3 (#2727)
* Enable tensor fragments for zero 2

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Support offload

* Support multi-gpu

* Cleanup

* WIP

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Support padding

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* z3 optimizer state support; aligned api

* Support frozen z3 params

* Unit tests

* Check NVMe offload capability

* Formatting

* Docs

* More docs

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* More docs

* More docs

* Update docs/code-docs/source/zero3.rst

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* More docs

* Support unsharded fp32 grad

* Remove debug prints

* Fix off-by-one detection of empty grads

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/utils/tensor_fragment.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Update deepspeed/runtime/zero/stage3.py

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

* Fix off-by-one error

* Skip ranks with no gradient data

* Formatting

* Add license

* Fix license

---------

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-02-27 23:40:49 -05:00
da84e60d98 add missing license info to top of all source code (#2889)
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-02-27 11:20:41 -08:00
e4b3b610ba Refactor DS inference API. No longer need replace_method. (#2831)
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
2023-02-15 23:17:02 +00:00
d923f7c895 Refactor/Pydantify monitoring config (#2640)
* pydantify monitoring configs

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2023-01-31 20:58:13 +00:00
fe6785447d Add missing Inference sub-configs (#2518) 2022-11-17 13:33:58 -08:00
43bf035cfc Update docs to autogenerate pydantic config model docs (#2509)
* update zero config docs
* add autogenerated docs for pydantic models used in ZeRO and Inference configs
2022-11-15 21:27:22 +00:00
99fde3b7a5 [memory estimators] new config args sync (#2431)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-10-19 20:31:27 -07:00
316c4a43e0 Add flake8 to pre-commit checks (#2051) 2022-07-25 16:48:08 -07:00
559fb8e515 [docs] fix broken read-the-docs build (#2075) 2022-07-06 14:23:18 -07:00
b80e5624e2 01 adam optimizer (#1790)
Co-authored-by: Conglong Li <conglong.li@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-03-10 21:31:18 -08:00
c0af6d90f7 Refactor MoE and Groups API to simplify model creation and mangement (#1798)
Co-authored-by: yaozhewei <zheweiy@berkeley.edu>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
2022-02-28 11:46:40 -08:00
4912e0ad7e Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) (#1453)
* Changes for bfloat16 Zero2

* ZeRO stage3 optimizations, with some bug fixes

optimizations for stage3:
- prefetching improvements
- batching allgather calls to amortize fixed overhead and improve
  bandwidth utilization
- batching reduce_scatter calls to amortize fixed overhead and
  improve bandwidth utilization
- using *_base variants of allgather and reduce scatter to reduce memory
  allocations and data movement
- more fine grained synchronization for communication that allows
  blocking on less work
- precomputation of fetching code - using a fetch queue rather than
  deciding what to (pre)fetch at each iteration
- limiting queued coalesced communication ops to reduce memory pressure
  on pytorch cuda caching allocator (not elegant solution)

optimizations for stage3-offload:
- made some host-device tensor copies async to improve performance

bug fixes and qol improvements:
- fix init context method when parent modules modify child weights
- speed up model initialization by moving model to GPU before weight
  initialization
- fixed unit test imports so that unit tests can be run from any
  directory
- change performance logging to include memory consumption
- add logging w/ model size when done partitioning model

new features
- bfloat16 support for ZeRO 3

* fix import in ut

* ran yapf

* improvements to cache flush warn log

* backwards compatibility with older versions of pytorch

* handle edge case where reduced tensor smaller than world size

* moved event synchronization to allgather handle wait() call

* removed unnecessary barrier call

* formatting fix after resolving merge conflict

* skip nvme prefetch when trace not complete

* opportunistically avoid memory allocation in allgather coalesced where possible

* fix indentation after merge

* fixes to account for parameter offload

* accounting for torch.cuda.memory_stats not being available

* moved partition_all_params to optimizer step

* allgathering on params before item gets called

* fix param status checks

needed after moving partition_all_parameters call to optimizer step

* fix grad accumulation with optimizer offload

* grad norm computation fix for optimizer offload

* change post divide in reduce-scatter to pre divide

* fix gradient race condition w/ optimizer offload

* improve inf/nan gradient tracking

* don't prefetch when not in training mode

* format fix after merging

* fix prefetching issue when using NVME offload

* improved defragmentation for fp16 parameters

* relative imports for bf16 tests

* changes for bwd compatibility with pytorch 1.2

* remove buffered_reduce_fallback

* removed unused parameter offset bookkeeping

* fixed tracking for multiple param groups

* unbroke bfloat16 config after merge conflict

* using base allgather params when only 1 param

* cleanup/fixes for fp16 partition defragmentation

* switch to CRLF

* convert to same new-line style as master

* align new line with master

* Fix merge issues

* switch to CRLF

* fix to LF line endings

* minor merge fixes

* remove extra bfloat16_enabled definition

* asserting params inflight for AllGatherHandle

* remove get_cuda_mem_allocated_str

* Format fixes

* fix bfloat16 zero stage check (broken after merge commit)

* +self.communication_data_type, -self.allreduce_always_fp32; delete dead code

* Add self.reduce_scatter

* Format fix

* Fix merge issues

* iterate over params_to_fetch rather than make another iterator

* add some TODOs

* remove unnecessary division by micro_step_id

* rename config keys "bfloat16" -> "bf16"

* rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bit_weights_on_model_save

* add unit test to check backwards compatibility for gather_16bit_weights

* added test to confirm bf16 key bwd compatibility

* Format fixes

Co-authored-by: Rana Ali Amjad <raamjad@amazon.com>
Co-authored-by: Justin Chiu <justchiu@amazon.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2022-01-20 18:14:13 -08:00
fc2f378ece Improve pre-commit hooks (#1602)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2021-12-01 03:12:29 +00:00
a10e4811fe force set lf instead of crlf (https://github.com/pre-commit/pre-commit-hooks#mixed-line-ending) (#1598) 2021-11-29 15:41:18 -08:00
a8a17f234a Several fixes for our read-the-docs build (#1579) 2021-11-19 22:45:02 +00:00
fafc827d64 Render docs for pipe.ProcessTopology (#1505)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2021-11-17 20:23:11 -08:00
bda3d0e6b9 Add autotuning news post (#1565) 2021-11-15 20:08:38 -08:00
7567c76c05 Update offload parameter names (#1536)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2021-11-13 17:38:51 +00:00
9caa74e577 Autotuning (#1554)
* [squash] Staging autotuning v4

Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* add new extra, guard xgboost, cleanup dead files (#268)

* Fix autotuning docs (#1553)

* fix docs

* rewording the goal

* fix typos

* fix typos (#1556)

* fix typos

* fix format

* fix bug (#1557)

* fix bug

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Minjia Zhang <minjiaz@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2021-11-13 08:56:55 +00:00
e976accb57 Fix typo (#1501) 2021-10-29 09:10:19 -07:00
dd22428465 fix typos and add improvements (#1463) 2021-10-20 14:56:38 -07:00
be789b1665 Fix many typos (#1423)
* Fix typos in docs/

* Fix typos in code comments and output strings

* Fix typos in the code itself

* Fix typos in tests/

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2021-10-01 19:56:32 -07:00
9cb64a1fc5 MoE read the docs update (#1312) 2021-08-17 10:50:08 -07:00
2a921069d7 [model weights] zero_to_fp32 multiple improvements (#1181)
* add live zero checkpoint to fp32 consolidation version

* some more docs

* zero2 model states uses a different filename

* fix

* make debug mode cli configurable

* copy the script only on node 0 process 0

* validate that we have the right number of files

* revamp _get_zero_param_shapes, instrument with easier debug

* correct assertion

* rename API; add even simpler API

* style

* docs improve

* update the docs

* revert the unpartitioned_params detection and report as it's most likely persistent params

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2021-07-12 13:11:44 -07:00
0c1802cc8b ZeRO 2+3 memory estimators (#965)
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2021-06-23 12:59:42 -07:00
0449cbd36d formatting fix 2021-05-24 23:10:59 +00:00
0d701ec4ab Add inference API to RTD (#1096)
* update with inference refs

* updates
2021-05-24 12:07:56 -07:00
67a48aaa89 1-bit LAMB optimizer (#970)
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed.
Author: @conglongli, @awan-10, @samyam, Hanlin Tang, Yuxiong He
Paper: https://arxiv.org/abs/2104.06069

Co-authored-by: sdtblck <46172032+sdtblck@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2021-04-20 18:28:22 -07:00
fbece50b21 assert no Z2/Z3 with pipeline and fix some docs links (#980) 2021-04-19 11:26:17 -07:00
11279ae4d5 ZeRO-Infinity docs (#979)
* zinf tutorial

* more megatron integration docs

* ZInf + tiling docs
2021-04-19 09:19:12 -07:00
0d4a54a04d ZeRO-Infinity (#976)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
2021-04-18 23:45:37 -07:00
c83e49f9ed update lr scheduler doc for doing per step or epoch update (#913)
* update lr scheduler doc for doing per step or epoch update

* work

* trigger build

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2021-04-14 14:31:39 -07:00
316992913d docs (#909)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2021-04-07 13:17:20 -07:00
68c8481bcf 1-bit Adam v2 (#817)
Authors: @awan-10 @conglongli @samyam @jeffra

What's new:

NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation.
Add support to momentum masks for those parameters with constant zero gradients during training.
Bug fixes (e.g., #813).

* NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594)

* NCCL based 1-bit Implementation + Refactor to add communication backends (#593)

* add nccl 1-bit optim.

* temporary commit to save stuff.

* Use dist collectives instead of mpi routines.

* remove old code for comm.

* Fix bugs. still does not work.

* modify to test the nccl side code path

* Initial gather impl. Works intra-node.

* Updates to comm. phase 2. nccl comm. passed the tests.

* refactor code to introduce nccl/mpi as backends for onebit adam.

* Refactor updates to test/engine.

* Fix compile/runtime errors.

* simplify support for nccl/mpi backends.

* Add missign file

* Add compression backend in constructor. Revert later.

* modify test with some perf counting.

* Implement a true non-blocking gather for nccl side.

* Revert "Add compression backend in constructor. Revert later."

This reverts commit df8c40d3105e9f2542a8aa6619e80d675a09753f.

* improve the 1-bit adam test.

* Refactor comm. and compression backend in 1-bit adam.

* Fix the test.

* Fix runtime errors and typos in nccl backend

* fix mpi backend. modify tests.

* modify nccl perf test.

* fix mpi side errors.

* Add an mpi perf test

* Sync DSE.

* Remove old collectives file.

* Undo a typo.

* Graceful failure for torch versions that don't support nccl pt2pt.

* Revert "Merge branch 'master' into staging-1bit-nccl-v2"

This reverts commit 78400850703b4b2d84f11b73c109f56919e748ea, reversing
changes made to a6dba72aeafad63661dfe566d3accd03d00be78c.

* Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2""

This reverts commit 6dbdd9858bafef4d340c089fdc0e3ddde3706f47.

* comm optimization + 1-bit lamb

* Saving/debugging commit.

* finalizing 1-bit lamb

* finalizing 1-bit lamb

* add momentum mask and chkpt handling for 1-bit adam

* Cleanup and modify nccl test to be runnable with deepspeed launcher.

* Fix format.

* fix formatting again.

* make test runnable without mpi4py

* Add dist.alltoall and dist.allgather instead of custom functions.

* remove debug prints.

* formatting and renaming

* renaming

* renaming

* add unit test, fix existing tests

* skip unit test when torch < 1.8

* revert 1-bit lamb

* flatten momentum when dimension is more than 1

* add warning message for 1-bit adam under fp32

* improve version check

* add fp32 test

* 1-bit adam doc

* fix file name

* doc fix

* torch 1.8 is released

* doc fix

* fix tests

* update news

* add doc for momentum mask

* fix checkpoing handling, add unit test

* checkpoint handling doc

* doc final cleanup

* bump dates

* update tests

* url change

* doc fix

* fix test

* doc update

Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2021-03-16 16:27:20 -07:00
7925d0c3f2 small tweaks (#839) 2021-03-11 10:36:34 -08:00
e0f36ed5a1 Add optimizers and schedules to RTD and updated the corresponding part in the website (#799)
* add optimizers and schedules to rtd

* update ds website and fix links

* add optimizers and schedules to rtd

* update ds website and fix links

* add flops profiler to rtd

* fix

Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
2021-03-11 09:45:36 -08:00
599258f979 ZeRO 3 Offload (#834)
* Squash stage3 v1 (#146)

Co-authored-by: Samyam <samyamr@microsoft.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>

* Fix correctness bug (#147)

* formatting fix (#150)

* stage3 bugfix (API) update and simplified FP16 Z3 tests (#151)

* fp16 Z3 API update and bugfix

* revert debug change

* ZeRO-3 detach and race condition bugfixes (#149)

* trying out ZeRO-3 race condition fix

* CUDA sync instead of stream

* reduction stream sync

* remove commented code

* Fix optimizer state_dict KeyError (#148)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152)

* Simplifying the logic for getting averaged gradients (#153)

* skip for now

* Z3 Docs redux (#154)

* removing some TODOs and commented code (#155)

* New Z3 defaults (#156)

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>

* formatting

* megatron external params

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: eltonzheng <eltonz@microsoft.com>
2021-03-08 12:54:54 -08:00
4e2dc4e4c0 Add deepspeed.init_distributed to RTD page (#645)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2021-01-07 15:52:23 -08:00
6380ee3511 Fixes for RTD build errors (#606)
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
2020-12-15 15:29:21 -08:00
f5aa2547d8 Add CPUAdam optimizer for zero-offload in deepspeed engine (#484)
* add adamW to CPU-ADAM implementation

* supporting cpu-adam optimizer for zero-offload on deepspeed side

* bump DSE to match cpu-adam updates

Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-10-30 09:01:04 -07:00
5812e84544 readthedocs yaml configuration (#410)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-09-16 18:57:43 -07:00
c82756cd15 readthedocs upgrade (#402) 2020-09-10 15:44:47 -07:00
65c2f974d8 Pipeline parallel training engine. (#392)
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
2020-09-09 23:14:55 -07:00
734d8991c8 Transformer kernel release (#242)
* Transformer kernels release

Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Reza Yazdani <reyazda@microsoft.com>
Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com>
Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
2020-05-29 13:15:36 -07:00
f2ac7eafd5 ZeRO-2 (#217)
Updates for ZeRO stage 2 + ZeRO stage 1 w. RS

Co-authored-by: Tunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: Elton Zheng <eltonz@microsoft.com>
Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: yuxionghe <yuxhe@microsoft.com>
Co-authored-by: Arash Ashari <arashari@microsoft.com>
2020-05-19 01:00:53 -07:00
dd166ee6b6 README and RTD improvements. (#198) 2020-04-21 22:18:47 -07:00
5042dc0085 drafting Jekyll webpage (#143) 2020-03-17 13:49:48 -07:00