DeepSpeed

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-20 15:33:51 +08:00

Author	SHA1	Message	Date
JoshWoo2003	7cb1b88ec4	Add ZenFlow code for Stage 3 (#7516 ) This PR completes the ZenFlow integration for DeepSpeed ZeRO Stage 3. Highlights: - ZenFlowSelectiveAdamW_stage3: Optimizer with importance-aware selective parameter updates for ZeRO Stage 3. - ZenFlowZeroOptimizer_Stage3: Full Stage 3 optimizer integration with partitioned parameters and CPU offload. - Configurable via ZenFlowConfig, fully integrated with DeepSpeedZeroConfig for Stage 3. - Unit tests for Stage 3 cases ensuring correctness and compatibility. Note: Intergration with ZeRO Stage 1&2 was introduced in #7391 --------- Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Ma, Guokai <guokai.ma@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu>	2025-10-13 12:19:18 -04:00
Masahiro Tanaka	71d077da73	Enable grad scaler for ZeRO-0 + torch.autocast path (#7619 ) Currently, the DeepSpeed engine does not enable the grad scaler for the ZeRO-0 and `torch.autocast` path, even when dtype is set to `fp16`. This leads to errors in tests when we replace our hard-coded tolerances with PyTorch’s [standard tolerances](https://docs.pytorch.org/docs/stable/testing.html#torch.testing.assert_close) (Thank you @stas00 for you suggestion regarding the previous PR). This PR enables the grad scaler for this path to improve accuracy, and refactors the tests to simplify validation by using `torch.testing.assert_close`. The tests now rely on PyTorch’s standard (and stricter) tolerances, and they still pass. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-10-04 13:21:08 +00:00
Masahiro Tanaka	7d9a2f2bf3	Improve leaf module interface (enable via config, relax matching criteria, add document, etc.) (#7604 ) This PR improves the usability of the leaf module feature. Here are the changes: - Allow enabling the leaf module via both the DeepSpeed config and APIs. - Relax matching criteria to support class-based matching. - Support multiple ways of specifying the target module: class, class name (with or without package name), module name, or suffix. - Add documentation to the training guide, including config snippets and explanations of default behavior. - Add default classes (e.g., Mixtral, Qwen2/Qwen3) that automatically enable the leaf module feature. (Welcoming requests to add more classes) --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-10-03 09:45:28 +00:00
Masahiro Tanaka	82a9db7eba	Show mismatching values when DeepCompile test fails (#7618 ) This PR improves error message when DeepCompile test fails. Tests of DeepCompile occasionally fail ([example](https://github.com/deepspeedai/DeepSpeed/actions/runs/18160078309/job/51688736712?pr=7604)) because of mismatching loss values. To make sure this is not a synchronization bug that causes `nan` loss values, the change in this PR shows the mismatching values. We can consider increasing the tolerances once we confirm the mismatch is reasonable. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-10-03 05:23:13 -04:00
Junjie Mao	4efd7eca73	DeepCompile: Fuse allgather and downcast (#7588 ) With autocast enabled, a majority of weights are downcasted before being used in calculations. Today zero3_compile gathers the FP32 weights before they are downcasted. That is sub-optimal because FP32 weights consumes more bandwidth to allgather and takes more time to downcast. To reduce communication and downcast time, fuse allgather and downcast in the dc ops. The target type is now passed to allgather_param() and prefetch_params_fused() which will downcast the (partial) weights before launching allgathers. This corresponds to issue 1 of #7577. Tested with https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 (run with `deepspeed --num_gpus=N this_file.py -c -p -m 23` to collect torch and memory profiles, and with DINOV2_DEPTH = SIGLIP_DEPTH = 3, LLAMA2_DEPTH = 4 for faster compileation) on 5090 (which has limited inter-GPU bandwidth), time per step decreases from 438ms to 337ms and peak GPU memory usage from 9.5GB to 8.5GB. Profiles of a single step before this PR: <img width="1235" height="1029" alt="image" src="https://github.com/user-attachments/assets/d9fe5296-7731-4542-924b-421ff7415054" /> <img width="1466" height="616" alt="image" src="https://github.com/user-attachments/assets/aa192802-8633-4e36-b2c4-f28b1b432663" /> After this PR: <img width="1218" height="1006" alt="image" src="https://github.com/user-attachments/assets/18a0e09c-155b-4783-adb5-b4d36c5c3691" /> <img width="1537" height="559" alt="image" src="https://github.com/user-attachments/assets/16a2ca74-8a89-4db9-9b68-81844295c61b" /> This PR also reduces peak memory usage because the `fast_free_schedule()` today always arranges param allgathers and downcasts at the beginning of the graph. While the original FP32 params can be freed early, all FP16/BF16-casted params are kept in GPU memory at the beginning of the backward graph, leading to a higher peak in memory usage. P.S. Probably due to organization branch rule settings, I don't find anywhere to allow reviewers to modify the branch. So I'll update the branch per reviewers' comments and rebase if needed. Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-09-29 03:15:33 +00:00
zhengchenyu	91d14527b6	Fix the universal checkpoint issue for stage3 when there are multiple subgroups. (#7585 ) Describe the bug When the model is large and there are multiple subgroups, we use ds_to_universal.py, will fail ,the error log are below: ``` * 1. Extracting ZeRO fragments 0%\| \| 0/1 [00:03<?, ?it/s] Traceback (most recent call last): File "/work/zhengchenyu/ai-project/qwen3/scripts/ds_to_universal_example.py", line 21, in <module> main() File "/work/zhengchenyu/ai-project/qwen3/scripts/ds_to_universal_example.py", line 18, in main ds_to_universal_main(args) File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 523, in main _extract_zero_shard_files_stage3(args, optim_files, param_shapes, dp_degree, temp_dir) File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 375, in _extract_zero_shard_files_stage3 _do_parallel_work(do_work, list(range(dp_degree)), args.num_extract_workers) File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 359, in _do_parallel_work results.append(do_work(work)) ^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 167, in extract_zero_shards_stage3 dump_param_fragment(temp_dir, 0, dp_index, state_key, flat_state[state_key], name, offset, File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 194, in dump_param_fragment state_flat_tensor = state_flat_tensor.narrow(0, offset, numel).clone() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: start (0) + length (155582464) exceeds dimension size (74499072). ``` To Reproduce Steps to reproduce the behavior: 1. Use large model to run, or set sub_group_size to a lower value. Then train and save model 2. Run ds_to_universal.py The reason** I found that the previous stage3 universal checkpoint implementation did not take subgroups into account. I also found the following problems during debugging. * Unable to handle multiple sub-groups, which will result in data loss * When load_checkpoint is True, then all process will save to same zero model checkpoint file. If multiple processes write at the same time, the file will be corrupted. Occasionally, file corruption was discovered during testing. Relete issue: #7584 --------- Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-27 17:39:43 +00:00
Xinyu Lian	af56ed4d37	SuperOffload Release (#7559 ) This PR introduces SuperOffload—an optimizer designed for Superchips (Nvidia GH200 & GB200, AMD MI300A) with high CPU–GPU bandwidth. It enables full fine-tuning of GPT-OSS-20B, Qwen3-14B, and Phi-4 on a single GH200 GPU, achieving up to ~500 TFLOPS, using Hugging Face Transformers and DeepSpeed—no custom modeling code required. SuperOffload extends ZeRO-Offload with fine-grained control and CPUAdam rollback utilities, allowing GPU execution to overlap with CPUAdam. This reduces GPU idle time and improves overall efficiency. Key changes: - New SuperOffloadOptimizer_Stage3 optimizer. - C++/CUDA binding for adam_rollback to revert one optimization step. - Config additions including super_offload and cpuadam_cores_perc. A detailed blog and tutorial will be available soon. --------- Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-24 13:09:23 +00:00
Junjie Mao	17d80ce440	Deepcompile: Make size of activation to free configurable (#7582 ) In deepcompile free-activation mode, only activations larger than a threshold are eagerly freed. The threshold is hardcoded today and thus may not be suitable in all cases. This PR first generalizes the dc.init() interface to take the whole compile_config object, and then converts the threshold into a config item. This corresponds to issue 3 of #7577. --------- Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-09-24 01:37:46 +00:00
Jupiter-Guy	325c6c5e9c	DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… (#7489 ) … meta key (max_mem) --------- Signed-off-by: Abhishek <dalakotiashu150@gmail.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Abhishek <dalakotiashu150@gmail.com> Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-09-22 16:45:00 -07:00
Ma, Guokai	2585881ae9	Make Muon optimizer easier to enable (#7555 ) The original Muon optimizer PR (https://github.com/deepspeedai/DeepSpeed/pull/7509) requires user to explicitly set `use_muon` flags in `model.parameters()`, as shown in test https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/ops/muon/test_muon.py#L27 . This PR integrate setting of `use_muon` into DeepSpeed before engine initialization. This makes Muon optimizer easier to use. User only needs to change optimizer in `config.json` from `AdamW` to `Muon`, no need to change code. It will solve the following issue https://github.com/deepspeedai/DeepSpeed/issues/7552 --------- Signed-off-by: Ma, Guokai <guokai.ma@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-09-17 09:52:11 -04:00
Masahiro Tanaka	b9bd03a2ec	Move modal tests to tests/v1 (#7557 ) This PR moves active tests under `tests/unit/v1` to clarify which tests are run on modal. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-09-12 17:28:47 -04:00
Masahiro Tanaka	b82ef716c8	Improve error message and reduce validation in autocast test (#7547 ) This PR improves error logging and relaxes loss value checks in the autocast test. Previously, the test displayed error messages and mismatched loss values on all ranks, even if the mismatch only occurred on some ranks. This was confusing, since logs from other ranks could appear correct. This PR changes the behavior so that error messages are shown only on the ranks where the mismatch occurs. Additionally, this PR skips loss value validation for `test_lower_precision_model`, where we intentionally use a different communication dtype from the baseline (standard PyTorch autocast). --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>	2025-09-05 07:04:18 +00:00
Masahiro Tanaka	66bf2a642d	Relax restrictions of torch.autocast integration (#7543 ) This PR relaxes two restrictions on torch.autocast in the DeepSpeed engine: 1) Nesting torch.autocast Currently, we do not expect `torch.autocast` to be used outside the DeepSpeed engine. Here is the current behavior: - If `torch.autocast` is enabled in the DeepSpeed config and the engine detects it is also enabled outside, a warning is displayed. - If it is disabled in the config, the engine raises an error. This design prevents the following usage: ```python with torch.autocast(...): logits = deepspeed_model(...) loss = criteria_fn(logits) ``` In this case, we also want to apply autocast to `criteria_fn`. With the current behavior, we would need move `deepspeed_model(...)` outside the `torch.autocast` context, leading to inconsistent code between DeepSpeed and non-DeepSpeed setups. (cannot be handled with `enabled` arg of `torch.autocast`) Change in this PR: `torch.autocast` outside the DeepSpeed engine is ignored, and - If `torch_autocast` is enabled in the config, DeepSpeed will follow that setting. - If it is disabled, DeepSpeed falls back to its own mixed-precision support (or FP32). In these cases, DeepSpeed engine shows a message to explain the behavior. 2) Model’s dtype Previously, DeepSpeed assumed the model’s dtype must be FP32 when `torch.autocast` was enabled. However, models with lower-precision parameters (e.g., BF16) can also be used with autocast. For example, if both the model and `torch.autocast` use BF16, autocast will upcast precision-sensitive ops as needed. Change in this PR: Removed the assertion that restricted the model’s dtype to FP32. This PR also adds and updates tests to cover these new behaviors. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-09-03 12:15:10 -07:00
Olatunji Ruwase	889f0ead27	Enable non-ZeRO mode (#7515 ) Enabled via `stage=0` which corresponds to DDP. Remove hardwired path to b16_optimizer. Enable`torch.autocast` for DDP training Enable native mixed precision DDP for bfloat16 Update torch.autocast and native mixed precision UTs <img width="976" height="184" alt="image" src="https://github.com/user-attachments/assets/92904cdc-e312-46a4-943f-011eb5ab146a" /> --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>	2025-08-27 14:07:29 -04:00
Zhipeng Wang	66ad278048	Enabling Muon Optimizer in DeepSpeed (#7509 ) Authorship: @pengdurice and @PKUWZP Related Issue: #7438 # Introduction [Muon](https://arxiv.org/abs/2502.16982), a new optimizer that has attracted the community’s attention recently shows promising results in training large language models. Adding the Muon Optimizer to DeepSpeed, a popular OSS framework for large scale training and inference is critically important for DeepSpeed users and developers. There has been a [PR](https://github.com/deepspeedai/DeepSpeed/pull/7454) attempting the adoption. (Huge Thanks to @qimcis), which is a good starting point. It still requires more substantial effort to make it fully compatible and work within DeepSpeed. We are publishing this PR to fully enable Muon Optimizer capabilities for DeepSpeed. # Issues and solutions ## Issues 1. With stage 1, 2 or 3, the optimizer states will be partitioned within the same data parallel group. This means that each process is already handling only parts of the model parameters and there is no need to use the DP solution as in the [code](https://github.com/KellerJordan/Muon/blob/master/muon.py#L195). 2. The parameters (and the gradients) will be flattened to 1D vector before being used in the optimizer, thus nullifying the major hypothesis of the muon optimizer: it works by orthogonalizing the updates for each matrix (dim >=2) ## Solutions To solve the issues, we propose this new PR in which: 1. We simplify the Muon code by [removing](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-c9052994e41caee9ca88363749c10af08655f8019f08dc971c018663d25a3712R22) the partitioning and muon updates logics. 2. We [move](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1867) the muon update to the [get_flat_partition](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1848) function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter gradients are collected before being flattened and used by the optimizer to update the model parameters. Since each parameter is still in its original shape, we can easily apply the muon updates. 3. We also save the momentum buffer into the optimizer’ state so that we have a smooth convergence after applying the saved checkpoints. 4. We added comprehensive unit tests to validate Muon Optimizer's correctness and functionality. # Future directions and roadmap In the future, several follow up works are of interests: - [ ] Create a CPU offload version. - [ ] Apply Muon to Stage 3 - [ ] Use the highly optimized version of Adam for the Adam part of MuonWithAuxAdam optimizer. - [ ] More efficient implementations e.g. a) add specialized kernels for Newton-Schulz iteration and muon updates; b) parallelize updates for the parameters (currently, each parameter is updated separately and sequentially) --------- Co-authored-by: Peng Du <pedu@linkedin.com> Co-authored-by: pengdurice <pengduhit@gmail.com> Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-08-26 18:34:35 -07:00
YiMing Liu	bc8c0db3b4	Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 (#7421 ) Please refer to https://github.com/deepspeedai/DeepSpeed/issues/7251 --------- Signed-off-by: lym <letusgo126@126.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: Alex Kiefer <alexkiefer51@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: Sam Foreman <saforem2@gmail.com> Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Signed-off-by: huanyuqu <yc37960@um.edu.mo> Signed-off-by: weeknan <zhounan0431@163.com> Signed-off-by: WoosungMyung <dntjd517@naver.com> Signed-off-by: Nir Sonnenschein <nsonnenschein@habana.ai> Signed-off-by: Junjie Mao <banxing.mjj@alibaba-inc.com> Signed-off-by: vinceliu <lpnpcs@gmail.com> Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com> Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: Yao, Matrix <matrix.yao@intel.com> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Signed-off-by: cyy <cyyever@outlook.com> Co-authored-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Alexander Kiefer <56556451+alexk101@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Sam Foreman <saforem2@gmail.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: huanyuqu <55744355+huanyuqu@users.noreply.github.com> Co-authored-by: weeknan <57584045+weeknan@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com> Co-authored-by: WoosungMyung <115716986+WoosungMyung@users.noreply.github.com> Co-authored-by: Nir Sonnenschein <nsonnenschein@habana.ai> Co-authored-by: Junjie Mao <junjie.mao@hotmail.com> Co-authored-by: Junjie Mao <banxing.mjj@alibaba-inc.com> Co-authored-by: lpnpcs <lpnpcs@vip.qq.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Tingfeng Lan <tafflann@outlook.com> Co-authored-by: Rui Yan <49115835+yanrui27@users.noreply.github.com> Co-authored-by: Feng Yunlong <20281571+AlongWY@users.noreply.github.com> Co-authored-by: Yao Matrix <matrix.yao@intel.com> Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Yuanyuan Chen <cyyever@outlook.com> Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>	2025-08-20 22:03:26 +00:00
Yuanyuan Chen	1c03d1b1bb	Fix invalid f-strings (#7457 ) Fix invalid f-strings detected by ruff. --------- Signed-off-by: cyy <cyyever@outlook.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>	2025-08-16 18:22:19 +00:00
Tingfeng Lan	1d7b90adc4	Add Zenflow code for Stage 1 & 2 (#7391 ) This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls. Highlights: - New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW) - ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration - Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig - Unit tests and documentation included Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR. --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Guokai Ma <guokai.ma@gmail.com>	2025-08-15 17:32:22 +00:00
Olatunji Ruwase	a12de38db6	Modal CI (#7289 ) This is an initial effort to migrate CI unto Modal infra. This PR creates two new workflows that run on Modal 1. modal-torch-latest: a subset of nv-torch-latest-v100 that includes `tests/unit/runtime/zero/test_zero.py`. 2. modal-accelerate: a full copy of nv-accelerate-v100. Follow up PRs will selectively migrate relevant workflows onto Modal. --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com> Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>	2025-08-11 20:13:39 +00:00
Stas Bekman	3292e07a92	adding TiledFusedLogitsLoss (#7437 ) This PR adds `TiledFusedLogitsLoss` for an efficient fused logits+loss computation - this version pre-calculates grads in `forward`, avoiding recomputation in the backward (similar to the Liger-Kernel implementation). --------- Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>	2025-07-30 14:15:33 -04:00
huanyuqu	092625c7eb	Fix: Adapt Llama injection policy for newer transformers versions (#7443 ) This PR fixes an `AttributeError` that occurs during `deepspeed.init_inference` when using kernel injection (`replace_with_kernel_inject=True`) with Llama models from recent versions of `transformers`. The Bug: In newer `transformers` versions (e.g., `4.53.3`), configurations like `num_heads` and `rope_theta` were moved from direct attributes of the `LlamaAttention` module into a nested `config` object. The current DeepSpeed injection policy tries to access these attributes from their old, direct location, causing the initialization to fail with an `AttributeError: 'LlamaAttention' object has no attribute 'num_heads'`. The Solution: This change updates the Llama injection logic to be more robust: 1. It first tries to read attributes like `num_heads` from the new `config` object location. 2. If that fails, it falls back to the legacy direct attribute path. --------- Signed-off-by: huanyuqu <yc37960@um.edu.mo>	2025-07-26 14:27:33 -07:00
Logan Adams	43f00ba31c	Remove additional unused tests (human-eval) (#7445 )	2025-07-24 13:16:57 -07:00
Stas Bekman	c2bb53f20f	TiledMLP + SequenceTiledCompute: improve the bs>1 use-case (#7422 ) Improved TiledMLP and SequenceTiledCompute for bs>1 This PR: - extends the testing utils to add `CaptureStd`, `CaptureLogger` context managers - extends the test to run both bs=1 and bs=2 - use an uneven seqlen to test varlen shards - flattens bs+seqlen dim, to avoid problems with grad tensor strides when bs>1 - mlp doesn't care for the bs dimension so using a pretend `bsseqlen` seqlen instead and restoring the shape at the end for the grad. --------- Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-07-16 09:30:08 -07:00
Stas Bekman	2790220d31	[TiledMLP]: fix for bs>1 (#7412 ) It looks like my TiledMLP was working correctly only for batch_size=1 fixing to work with any bs thanks to @winglian for detecting the problem and sending me an easy repro --------- Signed-off-by: Stas Bekman <stas@stason.org>	2025-07-07 14:34:03 -04:00
Masahiro Tanaka	e049bbfa1c	Fix dtype mismatch in `TestParamPartitioningSkipInit` (#7377 ) `TestParamPartitioningSkipInit` throws the following error. ``` ====================================== short test summary info ====================================== FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16 ========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) ========= ``` The test always sets the model's dtype to `torch.bfloat16` and ignores the test parameter `dtype` when bfloat16 is supported. This causes a dtype mismatch when `dtype=torch.float16` is given as the test parameter because the data loader respects the test parameter dtype. --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-06-23 16:44:32 +00:00
Ramya Ramineni	d33baf009b	Relax tolerances for FP8 unit test only for ROCm + FP16 (#7373 ) Relaxing the tolerance values to enable the below unit test, with FP16 data type on ROCm `unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[fp16] ` ``` # Relax tolerance only for ROCm + FP16 if is_rocm_pytorch() and model_dtype == torch.float16: rtol, atol = 3e-07, 3e-05 ``` cc: @jithunnair-amd	2025-06-20 15:24:43 +00:00
Masahiro Tanaka	ed5f737554	Enable torch.autocast with ZeRO (#6993 ) DeepSpeed supports mixed precision training, but the behavior is different from `torch.autocast`. DeepSpeed maintains parameters and gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex AMP style) and computes all modules in the lower precision while `torch.autocast` maintains parameters in FP32 but computes only certain operators in the lower precision. This leads to differences in: - performance: `torch.autocast` needs downcast in forward/backward - memory usage: DeepSpeed needs more memory to keep copies of parameters and gradients in lower precision - accuracy: `torch.autocast` has a list of modules that can safely be computed in lower precision. Some precision-sensitive operators (e.g. softmax) are computed in FP32. To align DeepSpeed's behavior with `torch.autocast` when necessary, this PR adds the integration with `torch.autocast` with ZeRO. Here is an examples of the configuration. ```json "torch_autocast": { "enabled": true, "dtype": "bfloat16", "lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"] } ``` Each configuration works as follows: - `enabled`: Enable the integration with `torch.autocast` if this is set to `True`. You don't need to call `torch.autocast` in your code. The grad scaler is also applied in the DeepSpeed optimizer. - `dtype`: lower precision dtype passed to `torch.autocast`. Gradients for allreduce (reduce-scatter) and parameters for allgather (only for ZeRO3) of `lower_precision_safe_modules` are also downcasted to this dtype. - `lower_precision_safe_modules`: Downcast for allreduce (reduce-scatter) and allgather (ZeRO3) are applied only to modules specified in this list. (The precision for PyTorch operators in forward/backward follows `torch.autocast`'s policy, not this list.) You can set names of classes with their packages. If you don't set this item, DeepSpeed uses the default list: `[torch.nn.Linear, torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d]`. Note that we only maintain FP32 parameters with this feature enabled. For consistency, you cannot enable `fp16` or `bf16` in DeepSpeed config. --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Omar Elayan <oelayan@habana.ai> Signed-off-by: Roman Fitzjalen <romaactor@gmail.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr> Co-authored-by: Liangliang Ma <1906710196@qq.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Roman Fitzjalen <romaactor@gmail.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com> Co-authored-by: Siddharth Singh <siddharth9820@gmail.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>	2025-06-19 21:36:03 +00:00
Olatunji Ruwase	10b106619a	Don't break set_start_method (#7349 ) Fix #7347 --------- Signed-off-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal> Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal>	2025-06-11 13:00:58 -04:00
Stas Bekman	d7e60fd0f6	s/UlyssesPlus/Arctic Long Sequence Training (ALST)/ (#7348 ) The project has been renamed at the last moment, so this PR is adapting to that change. There are no code changes in this PR, just docs. --------- Signed-off-by: Stas Bekman <stas@stason.org>	2025-06-10 17:10:54 -07:00
Olatunji Ruwase	e440506bee	Improve overflow handling in ZeRO (#6976 ) Fix #5241: Improve overflow handling - [x] ZeRO 1 - [x] ZeRO 2 - [ ] ZeRO 3 - [ ] BF16Optimizer Enable pydantic configuration for mixed precision - [x] bf16 - [x] fp16 --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Xinyu Lian <lian7@illinois.edu> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com> Co-authored-by: Fabio Geraci <118277438+fabiosanger@users.noreply.github.com> Co-authored-by: Sam Foreman <saforem2@gmail.com> Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr> Co-authored-by: Liangliang Ma <1906710196@qq.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-06-09 17:30:51 +00:00
xiongjyu	770967f5f0	fixed: Modified the topkgating function and modified the test_moe file for testing (#7163 ) Since the previous PR encountered the DCO problem and could not be solved for some reason, I resubmitted a completely identical PR but without the problem. --------- Signed-off-by: xiongjyu <xiongjyu@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-06-06 16:42:41 -07:00
Olatunji Ruwase	24a1d8f936	DeepNVMe update (#7215 ) - FastPersist - ZeRO-Inference+SGLang --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: jerryyangli <jerryyangli@gmail.com> Co-authored-by: Yang Li <yangli2@microsoft.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: Connor Holmes <connorholmes@microsoft.com> Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com> Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com> Co-authored-by: Cheng Li <pistasable@gmail.com> Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com> Co-authored-by: Ubuntu <jomayeri@microsoft.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>	2025-06-06 18:49:41 -04:00
inkcherry	8b03a35646	Fix ci hang in torch2.7& improve ut (#7321 ) fix ci hang. improve the ut. --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-06-02 15:41:10 -04:00
Stas Bekman	4d00b38ada	Ulysses SP for HF Integration (#7268 ) This is the Deepspeed counterpart of https://github.com/snowflakedb/ArcticTraining/pull/45 - as the new feature(s) require changes on both sides. For PR reviewers: Readiness status: - [x] Code - [x] Tests - [ ] Docs - working on it Features: - [x] add support for delaying grad addition via `param.ds_grad_is_ready` flag (used when performing tiled compute in an autograd function) - [x] add light sp-only mpu version (Jeff Rasley) - [x] improved debug - [x] added `all_gather_object` to `dist` - [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from Megatron-Deepspeed plus modern MHA-variations) - [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL batches to be used by `UlyssesSPAttentionHF` - [x] `SequenceTiledCompute` - generic autograd function to perform compute after tiling on the sequence dimension - [x] `TiledMLP` - a specific autograd function to perform tiled MLP (it's much easier to understand before trying to grok `SequenceTiledCompute`) - [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari) - [x] torch-dist-check now allows `torch.distributed.nn` (which is needed since deepspeed's dist is not up to date with `torch.distributed.nn`) --------- Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-05-31 07:25:23 +00:00
Stas Bekman	b66c81077c	anchor transformers version (#7316 ) some features require minimal transformers versions so let's start anchoring. and fixing tests that break with recent transformers. I need this fixed to be able to merge https://github.com/deepspeedai/DeepSpeed/pull/7268 which requires `transformers>=4.51.3` --------- Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>	2025-05-29 06:19:54 +00:00
Stas Bekman	e5afb88760	`tests/conftest.py`: automatically add local deepspeed repo when running tests (#7317 ) This is a follow up to https://github.com/deepspeedai/DeepSpeed/pull/923 my original code was a copy from transformers, which has a different fs layout and I missed that. So this PR is fixing it to actually do the right thing. Now you can have multiple clones of deepspeed and the tests will use the local repo automatically and not the pre-installed deepspeed.	2025-05-28 23:32:49 +00:00
Stas Bekman	b4cc079eee	CI: prefer bf16 over fp16 (#7304 ) these days fp16 is barely ever used, so we should be testing bf16 instead of fp16 where possible. had to fix a bunch of tests to adapt to this change. a few bugs as well on the way. --------- Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>	2025-05-28 00:49:21 +00:00
Naveenraj Kamalakannan	b9af5d8d61	Fix: Update grad norm calculation for CPU offload (#7302 ) ## Description This PR fixes an issue where gradient clipping modifications are not reflected in the global gradient norm calculation when CPU offloading is enabled. The issue occurs because the `averaged_gradients` are not being updated with the clipped gradients when CPU offloading is active. ## Problem When using CPU offloading with gradient clipping: 1. The gradients are successfully clipped using `safe_set_local_grad` 2. However, the `_global_grad_norm` calculation still uses the original unclipped gradients. 3. This leads to incorrect gradient norm reporting and potential issues with gradient clipping effectiveness ## Solution The fix ensures that the `averaged_gradients` are properly updated with the clipped gradients when CPU offloading is enabled, similar to how it works when CPU offloading is disabled. ## Testing The fix has been tested with: - CPU offloading enabled and disabled - Different gradient clipping values - A simple model with linear layers - Both FP16 and BF16 ## Related Issues Fixes #7292 --------- Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>	2025-05-27 12:13:54 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	b666844ffc	Fix AutoTP gathering replaced layer params when bias is not None (#7257 ) Some params are one-dimensional, this PR adds support for these params. Resolve #7249 ```log param.shape torch.Size([768, 1536]) param.shape torch.Size([768]) ... ``` ```log with deepspeed.module_inject.layers.GatherReplacedLayerParams([param], model, enabled=True): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "deepspeed/module_inject/layers.py", line 359, in __enter__ self.params[0].gather_params(self.params) File "torch/utils/_contextlib.py", line 116, in decorate_context return func(args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "deepspeed/module_inject/layers.py", line 473, in gather_params param.shape[1], ~~~~~~~~~~~^^^ IndexError: tuple index out of range ``` --------- Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com>	2025-05-25 04:03:20 +00:00
Olatunji Ruwase	0e741714f5	Enable ZeRO set/get APIs for NVMe offload (#7046 ) - Extend APIs for [debugging](https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging) and [modifying](https://deepspeed.readthedocs.io/en/latest/zero3.html#modifying-partitioned-states) ZeRO partitioned states to NVMe offload. - Add vectorized update API. This is performance-critical for NVMe offloading scenarios. --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>	2025-05-20 00:11:17 +00:00
Ma, Guokai	f45950258b	rollback #6726 (#7258 ) This PR rollback #6726 which caused https://github.com/deepspeedai/DeepSpeed/issues/7116 . --------- Signed-off-by: Guokai Ma <guokai.ma@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-05-19 04:54:16 +00:00
Logan Adams	d46947db4a	Temporarily skip AIO tests due to an issue with runners (#7288 ) Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-05-18 23:36:06 +00:00
Masahiro Tanaka	227a60c0c4	DeepCompile for enhanced compiler integration (#7154 ) This PR introduces DeepCompile, a new feature that efficiently integrates compiler optimizations with other DeepSpeed features. DeepCompile utilizes torch's dynamo to capture the computation graph and modifies it to incorporate DeepSpeed’s optimizations seamlessly. Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements such as proactive prefetching and selective unsharding to improve performance. (More details will be added later.) --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2025-04-16 04:33:53 +00:00
inkcherry	b8cc1eb078	async tp allreduce (#7115 ) Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Signed-off-by: Liang Cheng <astarxp777@gmail.com> Signed-off-by: A-transformer <astarxp777@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: A-transformer <cl5743590921@gmail.com> Co-authored-by: Raza Sikander <srsikander@habana.ai> Co-authored-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com> Co-authored-by: A-transformer <astarxp777@gmail.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>	2025-03-28 22:48:17 +00:00
Masahiro Tanaka	9ae010e629	Add destroy to tests to free memory (#7160 ) ZeRO3 requires explicit cleaning in tests when reusing the environment. This PR adds `destroy` calls to the tests to free memory and avoid potential errors due to memory leaks. Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>	2025-03-24 21:49:09 +00:00
Masahiro Tanaka	2e7f8e580e	fix leak of z3 buffer Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>	2025-03-20 18:58:15 +00:00
saurabhkoshatwar	591d4d4d54	Conditionally quote env vars (#7071 ) Resolves #6997 This PR conditionally quotes environment variable values—only wrapping those containing special characters (like parentheses) that could trigger bash errors. Safe values remain unquoted. --------- Signed-off-by: Saurabh <saurabhkoshatwar1996@gmail.com> Signed-off-by: Saurabh Koshatwar <saurabhkoshatwar1996@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-03-17 14:51:27 +00:00
Olatunji Ruwase	b418cf6c1b	Training multiple models (#7018 ) Support training multiple models, such as in [HF](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed_multiple_model) Here is some update on supporting multiple DS engines with single loss.backward(). The main message is that I think we can support this. First, some context. Backward pass in ZeRO is complicated because the optimizations/features require special handling of gradients, such as: 1. Gradient partitioning 2. Overlapping backward and reduction 3. Upcasting for fp32 grad accumulation So, we created engine.backward(loss) as a wrapper function to provide us fine-grained control over backward as below ```python def backward(loss): backward_prologue() # setup logic for special gradient handling loss.backward() backward_epilogue() # cleanup/teardown logic ``` As demonstrated by @muellerzr, this approach breaks down when loss originates from multiple DS engines. Our proposed solution is to use backward hooks on the module to launch backward_prologue() and backward_epilogue() . Specifically, 1. backward pre hook on engine.module to launch backward_prologue() before any module gradient is created. 2. backward post hook on engine.module to launch backward_epilogue() after all module gradients are created. We plan for this solution to preserve BC, i.e., engine.backward() will remain correct for single engine scenarios. The current status is that (1) is completed, while (2) is in progress. To unblock e2e testing for multi-engine scenarios, since there are probably other issues, we have a temporarily added engine._backward_prologue() . You can try this out via the following artifacts. 1. Simple multi-engine test code: https://gist.github.com/tjruwase/f1adccf087b8fa269ffce2ab91c4f1c6#file-multi_engine-py 2. DS branch: https://github.com/microsoft/DeepSpeed/tree/olruwase/zero_multi_models --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>	2025-03-11 20:59:23 +00:00
Logan Adams	436c126211	Add sequential pytest mark to TestNVMeCheckpointing to resolve pytest forked hangs (#7131 ) Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-03-11 17:20:32 +00:00
Wei Wu	38327e07f6	Bug Fix for offload_states API (#7050 ) @fukun07 and I discovered a bug when using the `offload_states` and `reload_states` APIs of the Zero3 optimizer. When using grouped parameters (for example, in weight decay or grouped lr scenarios), the order of the parameters mapping in `reload_states` ([here](`14b3cce4aa/deepspeed/runtime/zero/stage3.py (L2953)`)) does not correspond with the initialization of `self.lp_param_buffer` ([here](`14b3cce4aa/deepspeed/runtime/zero/stage3.py (L731)`)), which leads to misaligned parameter loading. This issue was overlooked by the corresponding unit tests ([here](https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/runtime/zero/test_offload_states.py)), so we fixed the bug in our PR and added the corresponding unit tests. --------- Signed-off-by: Wei Wu <wuwei211x@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-02-21 12:05:41 +00:00

1 2 3 4 5 ...

559 Commits