DeepSpeed

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-21 08:43:50 +08:00

Author	SHA1	Message	Date
Olatunji Ruwase	e80ae08886	Empty ZeRO3 partition cache (#3060 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2023-03-23 17:15:34 -07:00
Olatunji Ruwase	541e423ae6	Enable tensor fragments for zero 2 & 3 (#2727 ) * Enable tensor fragments for zero 2 * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Support offload * Support multi-gpu * Cleanup * WIP * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Support padding * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * z3 optimizer state support; aligned api * Support frozen z3 params * Unit tests * Check NVMe offload capability * Formatting * Docs * More docs * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * More docs * More docs * Update docs/code-docs/source/zero3.rst Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * More docs * Support unsharded fp32 grad * Remove debug prints * Fix off-by-one detection of empty grads * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/utils/tensor_fragment.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Update deepspeed/runtime/zero/stage3.py Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> * Fix off-by-one error * Skip ranks with no gradient data * Formatting * Add license * Fix license --------- Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-02-27 23:40:49 -05:00
Jeff Rasley	da84e60d98	add missing license info to top of all source code (#2889 ) Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-02-27 11:20:41 -08:00
Ammar Ahmad Awan	e4b3b610ba	Refactor DS inference API. No longer need replace_method. (#2831 ) Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>	2023-02-15 23:17:02 +00:00
Michael Wyatt	d923f7c895	Refactor/Pydantify monitoring config (#2640 ) * pydantify monitoring configs --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2023-01-31 20:58:13 +00:00
Michael Wyatt	fe6785447d	Add missing Inference sub-configs (#2518 )	2022-11-17 13:33:58 -08:00
Michael Wyatt	43bf035cfc	Update docs to autogenerate pydantic config model docs (#2509 ) * update zero config docs * add autogenerated docs for pydantic models used in ZeRO and Inference configs	2022-11-15 21:27:22 +00:00
Stas Bekman	99fde3b7a5	[memory estimators] new config args sync (#2431 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-10-19 20:31:27 -07:00
Alex Hedges	316c4a43e0	Add flake8 to pre-commit checks (#2051 )	2022-07-25 16:48:08 -07:00
Jeff Rasley	559fb8e515	[docs] fix broken read-the-docs build (#2075 )	2022-07-06 14:23:18 -07:00
Yucheng Lu	b80e5624e2	01 adam optimizer (#1790 ) Co-authored-by: Conglong Li <conglong.li@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-03-10 21:31:18 -08:00
Ammar Ahmad Awan	c0af6d90f7	Refactor MoE and Groups API to simplify model creation and mangement (#1798 ) Co-authored-by: yaozhewei <zheweiy@berkeley.edu> Co-authored-by: Reza Yazdani <reyazda@microsoft.com>	2022-02-28 11:46:40 -08:00
Justin Chiu	4912e0ad7e	Various ZeRO Stage3 Optimizations + Improvements (including bfloat16 support) (#1453 ) * Changes for bfloat16 Zero2 * ZeRO stage3 optimizations, with some bug fixes optimizations for stage3: - prefetching improvements - batching allgather calls to amortize fixed overhead and improve bandwidth utilization - batching reduce_scatter calls to amortize fixed overhead and improve bandwidth utilization - using _base variants of allgather and reduce scatter to reduce memory allocations and data movement - more fine grained synchronization for communication that allows blocking on less work - precomputation of fetching code - using a fetch queue rather than deciding what to (pre)fetch at each iteration - limiting queued coalesced communication ops to reduce memory pressure on pytorch cuda caching allocator (not elegant solution) optimizations for stage3-offload: - made some host-device tensor copies async to improve performance bug fixes and qol improvements: - fix init context method when parent modules modify child weights - speed up model initialization by moving model to GPU before weight initialization - fixed unit test imports so that unit tests can be run from any directory - change performance logging to include memory consumption - add logging w/ model size when done partitioning model new features - bfloat16 support for ZeRO 3 fix import in ut * ran yapf * improvements to cache flush warn log * backwards compatibility with older versions of pytorch * handle edge case where reduced tensor smaller than world size * moved event synchronization to allgather handle wait() call * removed unnecessary barrier call * formatting fix after resolving merge conflict * skip nvme prefetch when trace not complete * opportunistically avoid memory allocation in allgather coalesced where possible * fix indentation after merge * fixes to account for parameter offload * accounting for torch.cuda.memory_stats not being available * moved partition_all_params to optimizer step * allgathering on params before item gets called * fix param status checks needed after moving partition_all_parameters call to optimizer step * fix grad accumulation with optimizer offload * grad norm computation fix for optimizer offload * change post divide in reduce-scatter to pre divide * fix gradient race condition w/ optimizer offload * improve inf/nan gradient tracking * don't prefetch when not in training mode * format fix after merging * fix prefetching issue when using NVME offload * improved defragmentation for fp16 parameters * relative imports for bf16 tests * changes for bwd compatibility with pytorch 1.2 * remove buffered_reduce_fallback * removed unused parameter offset bookkeeping * fixed tracking for multiple param groups * unbroke bfloat16 config after merge conflict * using base allgather params when only 1 param * cleanup/fixes for fp16 partition defragmentation * switch to CRLF * convert to same new-line style as master * align new line with master * Fix merge issues * switch to CRLF * fix to LF line endings * minor merge fixes * remove extra bfloat16_enabled definition * asserting params inflight for AllGatherHandle * remove get_cuda_mem_allocated_str * Format fixes * fix bfloat16 zero stage check (broken after merge commit) * +self.communication_data_type, -self.allreduce_always_fp32; delete dead code * Add self.reduce_scatter * Format fix * Fix merge issues * iterate over params_to_fetch rather than make another iterator * add some TODOs * remove unnecessary division by micro_step_id * rename config keys "bfloat16" -> "bf16" * rename stage3_gather_fp16_weights_on_model_save -> stage3_gather_16bit_weights_on_model_save * add unit test to check backwards compatibility for gather_16bit_weights * added test to confirm bf16 key bwd compatibility * Format fixes Co-authored-by: Rana Ali Amjad <raamjad@amazon.com> Co-authored-by: Justin Chiu <justchiu@amazon.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2022-01-20 18:14:13 -08:00
Alex Hedges	fc2f378ece	Improve pre-commit hooks (#1602 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2021-12-01 03:12:29 +00:00
Jeff Rasley	a10e4811fe	force set lf instead of crlf (https://github.com/pre-commit/pre-commit-hooks#mixed-line-ending ) (#1598 )	2021-11-29 15:41:18 -08:00
Jeff Rasley	a8a17f234a	Several fixes for our read-the-docs build (#1579 )	2021-11-19 22:45:02 +00:00
James Reed	fafc827d64	Render docs for pipe.ProcessTopology (#1505 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2021-11-17 20:23:11 -08:00
Cheng Li	bda3d0e6b9	Add autotuning news post (#1565 )	2021-11-15 20:08:38 -08:00
Olatunji Ruwase	7567c76c05	Update offload parameter names (#1536 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2021-11-13 17:38:51 +00:00
Cheng Li	9caa74e577	Autotuning (#1554 ) * [squash] Staging autotuning v4 Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Minjia Zhang <minjiaz@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * add new extra, guard xgboost, cleanup dead files (#268) * Fix autotuning docs (#1553) * fix docs * rewording the goal * fix typos * fix typos (#1556) * fix typos * fix format * fix bug (#1557) * fix bug Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Minjia Zhang <minjiaz@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2021-11-13 08:56:55 +00:00
Manuel R. Ciosici	e976accb57	Fix typo (#1501 )	2021-10-29 09:10:19 -07:00
Stas Bekman	dd22428465	fix typos and add improvements (#1463 )	2021-10-20 14:56:38 -07:00
Alex Hedges	be789b1665	Fix many typos (#1423 ) * Fix typos in docs/ * Fix typos in code comments and output strings * Fix typos in the code itself * Fix typos in tests/ Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2021-10-01 19:56:32 -07:00
Jeff Rasley	9cb64a1fc5	MoE read the docs update (#1312 )	2021-08-17 10:50:08 -07:00
Stas Bekman	2a921069d7	[model weights] zero_to_fp32 multiple improvements (#1181 ) * add live zero checkpoint to fp32 consolidation version * some more docs * zero2 model states uses a different filename * fix * make debug mode cli configurable * copy the script only on node 0 process 0 * validate that we have the right number of files * revamp _get_zero_param_shapes, instrument with easier debug * correct assertion * rename API; add even simpler API * style * docs improve * update the docs * revert the unpartitioned_params detection and report as it's most likely persistent params Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2021-07-12 13:11:44 -07:00
Stas Bekman	0c1802cc8b	ZeRO 2+3 memory estimators (#965 ) Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2021-06-23 12:59:42 -07:00
Jeff Rasley	0449cbd36d	formatting fix	2021-05-24 23:10:59 +00:00
Jeff Rasley	0d701ec4ab	Add inference API to RTD (#1096 ) * update with inference refs * updates	2021-05-24 12:07:56 -07:00
Conglong Li	67a48aaa89	1-bit LAMB optimizer (#970 ) 1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed. Author: @conglongli, @awan-10, @samyam, Hanlin Tang, Yuxiong He Paper: https://arxiv.org/abs/2104.06069 Co-authored-by: sdtblck <46172032+sdtblck@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2021-04-20 18:28:22 -07:00
Shaden Smith	fbece50b21	assert no Z2/Z3 with pipeline and fix some docs links (#980 )	2021-04-19 11:26:17 -07:00
Shaden Smith	11279ae4d5	ZeRO-Infinity docs (#979 ) * zinf tutorial * more megatron integration docs * ZInf + tiling docs	2021-04-19 09:19:12 -07:00
Jeff Rasley	0d4a54a04d	ZeRO-Infinity (#976 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>	2021-04-18 23:45:37 -07:00
Cheng Li	c83e49f9ed	update lr scheduler doc for doing per step or epoch update (#913 ) * update lr scheduler doc for doing per step or epoch update * work * trigger build Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2021-04-14 14:31:39 -07:00
Stas Bekman	316992913d	docs (#909 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2021-04-07 13:17:20 -07:00
Conglong Li	68c8481bcf	1-bit Adam v2 (#817 ) Authors: @awan-10 @conglongli @samyam @jeffra What's new: NCCL-based implementation which provides better performance and usability compared to the MPI-based implementation. Add support to momentum masks for those parameters with constant zero gradients during training. Bug fixes (e.g., #813). * NCCL-based 1-bit Adam + Code Refactor for Comm. Backends (#594) * NCCL based 1-bit Implementation + Refactor to add communication backends (#593) * add nccl 1-bit optim. * temporary commit to save stuff. * Use dist collectives instead of mpi routines. * remove old code for comm. * Fix bugs. still does not work. * modify to test the nccl side code path * Initial gather impl. Works intra-node. * Updates to comm. phase 2. nccl comm. passed the tests. * refactor code to introduce nccl/mpi as backends for onebit adam. * Refactor updates to test/engine. * Fix compile/runtime errors. * simplify support for nccl/mpi backends. * Add missign file * Add compression backend in constructor. Revert later. * modify test with some perf counting. * Implement a true non-blocking gather for nccl side. * Revert "Add compression backend in constructor. Revert later." This reverts commit df8c40d3105e9f2542a8aa6619e80d675a09753f. * improve the 1-bit adam test. * Refactor comm. and compression backend in 1-bit adam. * Fix the test. * Fix runtime errors and typos in nccl backend * fix mpi backend. modify tests. * modify nccl perf test. * fix mpi side errors. * Add an mpi perf test * Sync DSE. * Remove old collectives file. * Undo a typo. * Graceful failure for torch versions that don't support nccl pt2pt. * Revert "Merge branch 'master' into staging-1bit-nccl-v2" This reverts commit 78400850703b4b2d84f11b73c109f56919e748ea, reversing changes made to a6dba72aeafad63661dfe566d3accd03d00be78c. * Revert "Revert "Merge branch 'master' into staging-1bit-nccl-v2"" This reverts commit 6dbdd9858bafef4d340c089fdc0e3ddde3706f47. * comm optimization + 1-bit lamb * Saving/debugging commit. * finalizing 1-bit lamb * finalizing 1-bit lamb * add momentum mask and chkpt handling for 1-bit adam * Cleanup and modify nccl test to be runnable with deepspeed launcher. * Fix format. * fix formatting again. * make test runnable without mpi4py * Add dist.alltoall and dist.allgather instead of custom functions. * remove debug prints. * formatting and renaming * renaming * renaming * add unit test, fix existing tests * skip unit test when torch < 1.8 * revert 1-bit lamb * flatten momentum when dimension is more than 1 * add warning message for 1-bit adam under fp32 * improve version check * add fp32 test * 1-bit adam doc * fix file name * doc fix * torch 1.8 is released * doc fix * fix tests * update news * add doc for momentum mask * fix checkpoing handling, add unit test * checkpoint handling doc * doc final cleanup * bump dates * update tests * url change * doc fix * fix test * doc update Co-authored-by: Ammar Ahmad Awan <ammar.awan@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2021-03-16 16:27:20 -07:00
Stas Bekman	7925d0c3f2	small tweaks (#839 )	2021-03-11 10:36:34 -08:00
Cheng Li	e0f36ed5a1	Add optimizers and schedules to RTD and updated the corresponding part in the website (#799 ) * add optimizers and schedules to rtd * update ds website and fix links * add optimizers and schedules to rtd * update ds website and fix links * add flops profiler to rtd * fix Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>	2021-03-11 09:45:36 -08:00
Samyam Rajbhandari	599258f979	ZeRO 3 Offload (#834 ) * Squash stage3 v1 (#146) Co-authored-by: Samyam <samyamr@microsoft.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: eltonzheng <eltonz@microsoft.com> * Fix correctness bug (#147) * formatting fix (#150) * stage3 bugfix (API) update and simplified FP16 Z3 tests (#151) * fp16 Z3 API update and bugfix * revert debug change * ZeRO-3 detach and race condition bugfixes (#149) * trying out ZeRO-3 race condition fix * CUDA sync instead of stream * reduction stream sync * remove commented code * Fix optimizer state_dict KeyError (#148) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * fix for smaller SGS sizes, ensures each grad is backed by unique tensors (#152) * Simplifying the logic for getting averaged gradients (#153) * skip for now * Z3 Docs redux (#154) * removing some TODOs and commented code (#155) * New Z3 defaults (#156) Co-authored-by: Jeff Rasley <jerasley@microsoft.com> * formatting * megatron external params Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: eltonzheng <eltonz@microsoft.com>	2021-03-08 12:54:54 -08:00
Jeff Rasley	4e2dc4e4c0	Add deepspeed.init_distributed to RTD page (#645 ) Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2021-01-07 15:52:23 -08:00
Jeff Rasley	6380ee3511	Fixes for RTD build errors (#606 ) Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com>	2020-12-15 15:29:21 -08:00
Reza Yazdani	f5aa2547d8	Add CPUAdam optimizer for zero-offload in deepspeed engine (#484 ) * add adamW to CPU-ADAM implementation * supporting cpu-adam optimizer for zero-offload on deepspeed side * bump DSE to match cpu-adam updates Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2020-10-30 09:01:04 -07:00
Shaden Smith	5812e84544	readthedocs yaml configuration (#410 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2020-09-16 18:57:43 -07:00
Shaden Smith	c82756cd15	readthedocs upgrade (#402 )	2020-09-10 15:44:47 -07:00
Shaden Smith	65c2f974d8	Pipeline parallel training engine. (#392 ) Co-authored-by: Jeff Rasley <jerasley@microsoft.com>	2020-09-09 23:14:55 -07:00
Jeff Rasley	734d8991c8	Transformer kernel release (#242 ) * Transformer kernels release Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Reza Yazdani <reyazda@microsoft.com> Co-authored-by: RezaYazdaniAminabadi <44502768+RezaYazdaniAminabadi@users.noreply.github.com> Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com>	2020-05-29 13:15:36 -07:00
Jeff Rasley	f2ac7eafd5	ZeRO-2 (#217 ) Updates for ZeRO stage 2 + ZeRO stage 1 w. RS Co-authored-by: Tunji Ruwase <olruwase@microsoft.com> Co-authored-by: Samyam Rajbhandari <samyamr@microsoft.com> Co-authored-by: Shaden Smith <ShadenTSmith@gmail.com> Co-authored-by: Elton Zheng <eltonz@microsoft.com> Co-authored-by: Shaden Smith <Shaden.Smith@microsoft.com> Co-authored-by: yuxionghe <yuxhe@microsoft.com> Co-authored-by: Arash Ashari <arashari@microsoft.com>	2020-05-19 01:00:53 -07:00
Shaden Smith	dd166ee6b6	README and RTD improvements. (#198 )	2020-04-21 22:18:47 -07:00
Shaden Smith	5042dc0085	drafting Jekyll webpage (#143 )	2020-03-17 13:49:48 -07:00

48 Commits