559 Commits

Author SHA1 Message Date
7cb1b88ec4 Add ZenFlow code for Stage 3 (#7516)
This PR completes the ZenFlow integration for DeepSpeed ZeRO Stage 3. 

Highlights:

- ZenFlowSelectiveAdamW_stage3: Optimizer with importance-aware
selective parameter updates for ZeRO Stage 3.
- ZenFlowZeroOptimizer_Stage3: Full Stage 3 optimizer integration with
partitioned parameters and CPU offload.
- Configurable via ZenFlowConfig, fully integrated with
DeepSpeedZeroConfig for Stage 3.
- Unit tests for Stage 3 cases ensuring correctness and compatibility.

Note: Intergration with ZeRO Stage 1&2 was introduced in #7391

---------

Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Co-authored-by: Ma, Guokai <guokai.ma@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu>
2025-10-13 12:19:18 -04:00
71d077da73 Enable grad scaler for ZeRO-0 + torch.autocast path (#7619)
Currently, the DeepSpeed engine does not enable the grad scaler for the
ZeRO-0 and `torch.autocast` path, even when dtype is set to `fp16`. This
leads to errors in tests when we replace our hard-coded tolerances with
PyTorch’s [standard
tolerances](https://docs.pytorch.org/docs/stable/testing.html#torch.testing.assert_close)
(Thank you @stas00 for you suggestion regarding the previous PR).

This PR enables the grad scaler for this path to improve accuracy, and
refactors the tests to simplify validation by using
`torch.testing.assert_close`. The tests now rely on PyTorch’s standard
(and stricter) tolerances, and they still pass.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-10-04 13:21:08 +00:00
7d9a2f2bf3 Improve leaf module interface (enable via config, relax matching criteria, add document, etc.) (#7604)
This PR improves the usability of the leaf module feature.

Here are the changes:
- Allow enabling the leaf module via both the DeepSpeed config and APIs.
- Relax matching criteria to support class-based matching.
- Support multiple ways of specifying the target module: class, class
name (with or without package name), module name, or suffix.
- Add documentation to the training guide, including config snippets and
explanations of default behavior.
- Add default classes (e.g., Mixtral, Qwen2/Qwen3) that automatically
enable the leaf module feature. (Welcoming requests to add more classes)

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-10-03 09:45:28 +00:00
82a9db7eba Show mismatching values when DeepCompile test fails (#7618)
This PR improves error message when DeepCompile test fails.

Tests of DeepCompile occasionally fail
([example](https://github.com/deepspeedai/DeepSpeed/actions/runs/18160078309/job/51688736712?pr=7604))
because of mismatching loss values.
To make sure this is not a synchronization bug that causes `nan` loss
values, the change in this PR shows the mismatching values. We can
consider increasing the tolerances once we confirm the mismatch is
reasonable.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-10-03 05:23:13 -04:00
4efd7eca73 DeepCompile: Fuse allgather and downcast (#7588)
With autocast enabled, a majority of weights are downcasted before being
used in calculations. Today zero3_compile gathers the FP32 weights
before they are downcasted. That is sub-optimal because FP32 weights
consumes more bandwidth to allgather and takes more time to downcast.

To reduce communication and downcast time, fuse allgather and downcast
in the dc ops. The target type is now passed to allgather_param() and
prefetch_params_fused() which will downcast the (partial) weights before
launching allgathers.

This corresponds to issue 1 of #7577.

Tested with
https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3
(run with `deepspeed --num_gpus=N this_file.py -c -p -m 23` to collect
torch and memory profiles, and with DINOV2_DEPTH = SIGLIP_DEPTH = 3,
LLAMA2_DEPTH = 4 for faster compileation) on 5090 (which has limited
inter-GPU bandwidth), time per step decreases from 438ms to 337ms and
peak GPU memory usage from 9.5GB to 8.5GB.

Profiles of a single step before this PR:

<img width="1235" height="1029" alt="image"
src="https://github.com/user-attachments/assets/d9fe5296-7731-4542-924b-421ff7415054"
/>

<img width="1466" height="616" alt="image"
src="https://github.com/user-attachments/assets/aa192802-8633-4e36-b2c4-f28b1b432663"
/>

After this PR:

<img width="1218" height="1006" alt="image"
src="https://github.com/user-attachments/assets/18a0e09c-155b-4783-adb5-b4d36c5c3691"
/>

<img width="1537" height="559" alt="image"
src="https://github.com/user-attachments/assets/16a2ca74-8a89-4db9-9b68-81844295c61b"
/>

This PR also reduces peak memory usage because the
`fast_free_schedule()` today always arranges param allgathers and
downcasts at the beginning of the graph. While the original FP32 params
can be freed early, all FP16/BF16-casted params are kept in GPU memory
at the beginning of the backward graph, leading to a higher peak in
memory usage.

P.S. Probably due to organization branch rule settings, I don't find
anywhere to allow reviewers to modify the branch. So I'll update the
branch per reviewers' comments and rebase if needed.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-29 03:15:33 +00:00
91d14527b6 Fix the universal checkpoint issue for stage3 when there are multiple subgroups. (#7585)
**Describe the bug**

When the model is large and there are multiple subgroups, we use
ds_to_universal.py, will fail ,the error log are below:

```
*** 1. Extracting ZeRO fragments
  0%|                                                     | 0/1 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/work/zhengchenyu/ai-project/qwen3/scripts/ds_to_universal_example.py", line 21, in <module>
    main()
  File "/work/zhengchenyu/ai-project/qwen3/scripts/ds_to_universal_example.py", line 18, in main
    ds_to_universal_main(args)
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 523, in main
    _extract_zero_shard_files_stage3(args, optim_files, param_shapes, dp_degree, temp_dir)
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 375, in _extract_zero_shard_files_stage3
    _do_parallel_work(do_work, list(range(dp_degree)), args.num_extract_workers)
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 359, in _do_parallel_work
    results.append(do_work(work))
                   ^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 167, in extract_zero_shards_stage3
    dump_param_fragment(temp_dir, 0, dp_index, state_key, flat_state[state_key], name, offset,
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 194, in dump_param_fragment
    state_flat_tensor = state_flat_tensor.narrow(0, offset, numel).clone()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (0) + length (155582464) exceeds dimension size (74499072).

```

**To Reproduce**
Steps to reproduce the behavior:
1. Use large model to run, or set sub_group_size to a lower value. Then
train and save model
2. Run ds_to_universal.py

**The reason**

I found that the previous stage3 universal checkpoint implementation did
not take subgroups into account. I also found the following problems
during debugging.

* Unable to handle multiple sub-groups, which will result in data loss
* When load_checkpoint is True, then all process will save to same zero
model checkpoint file. If multiple processes write at the same time, the
file will be corrupted. Occasionally, file corruption was discovered
during testing.

Relete issue:  #7584

---------

Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-27 17:39:43 +00:00
af56ed4d37 SuperOffload Release (#7559)
This PR introduces **SuperOffload**—an optimizer designed for Superchips
(Nvidia GH200 & GB200, AMD MI300A) with high CPU–GPU bandwidth. It
enables **full fine-tuning** of **GPT-OSS-20B, Qwen3-14B, and Phi-4** on
a single GH200 GPU, achieving up to **~500 TFLOPS**, using Hugging Face
Transformers and DeepSpeed—no custom modeling code required.

SuperOffload extends ZeRO-Offload with fine-grained control and CPUAdam
rollback utilities, allowing GPU execution to overlap with CPUAdam. This
reduces GPU idle time and improves overall efficiency.

Key changes:
- New SuperOffloadOptimizer_Stage3 optimizer.
- C++/CUDA binding for adam_rollback to revert one optimization step.
- Config additions including super_offload and cpuadam_cores_perc.

A detailed blog and tutorial will be available soon.

---------

Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-24 13:09:23 +00:00
17d80ce440 Deepcompile: Make size of activation to free configurable (#7582)
In deepcompile free-activation mode, only activations larger than a
threshold are eagerly freed. The threshold is hardcoded today and thus
may not be suitable in all cases.

This PR first generalizes the dc.init() interface to take the whole
compile_config object, and then converts the threshold into a config
item.

This corresponds to issue 3 of #7577.

---------

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-24 01:37:46 +00:00
325c6c5e9c DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… (#7489)
… meta key (max_mem)

---------

Signed-off-by: Abhishek <dalakotiashu150@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Abhishek <dalakotiashu150@gmail.com>
Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-22 16:45:00 -07:00
2585881ae9 Make Muon optimizer easier to enable (#7555)
The original Muon optimizer PR
(https://github.com/deepspeedai/DeepSpeed/pull/7509) requires user to
explicitly set `use_muon` flags in `model.parameters()`, as shown in
test
https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/ops/muon/test_muon.py#L27
.

This PR integrate setting of `use_muon` into DeepSpeed before engine
initialization. This makes Muon optimizer easier to use. User only needs
to change optimizer in `config.json` from `AdamW` to `Muon`, no need to
change code. It will solve the following issue
https://github.com/deepspeedai/DeepSpeed/issues/7552

---------

Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2025-09-17 09:52:11 -04:00
b9bd03a2ec Move modal tests to tests/v1 (#7557)
This PR moves active tests under `tests/unit/v1` to clarify which tests
are run on modal.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-12 17:28:47 -04:00
b82ef716c8 Improve error message and reduce validation in autocast test (#7547)
This PR improves error logging and relaxes loss value checks in the
autocast test.

Previously, the test displayed error messages and mismatched loss values
on all ranks, even if the mismatch only occurred on some ranks. This was
confusing, since logs from other ranks could appear correct. This PR
changes the behavior so that error messages are shown only on the ranks
where the mismatch occurs.

Additionally, this PR skips loss value validation for
`test_lower_precision_model`, where we intentionally use a different
communication dtype from the baseline (standard PyTorch autocast).

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-09-05 07:04:18 +00:00
66bf2a642d Relax restrictions of torch.autocast integration (#7543)
This PR relaxes two restrictions on torch.autocast in the DeepSpeed
engine:

1) Nesting torch.autocast
Currently, we do not expect `torch.autocast` to be used outside the
DeepSpeed engine. Here is the current behavior:
- If `torch.autocast` is enabled in the DeepSpeed config and the engine
detects it is also enabled outside, a warning is displayed.
- If it is disabled in the config, the engine raises an error.

This design prevents the following usage:
```python
with torch.autocast(...):
    logits = deepspeed_model(...)
    loss = criteria_fn(logits)
```
In this case, we also want to apply autocast to `criteria_fn`. With the
current behavior, we would need move `deepspeed_model(...)` outside the
`torch.autocast` context, leading to inconsistent code between DeepSpeed
and non-DeepSpeed setups. (cannot be handled with `enabled` arg of
`torch.autocast`)

Change in this PR:
`torch.autocast` outside the DeepSpeed engine is ignored, and
- If `torch_autocast` is enabled in the config, DeepSpeed will follow
that setting.
- If it is disabled, DeepSpeed falls back to its own mixed-precision
support (or FP32).

In these cases, DeepSpeed engine shows a message to explain the
behavior.

2) Model’s dtype

Previously, DeepSpeed assumed the model’s dtype must be FP32 when
`torch.autocast` was enabled. However, models with lower-precision
parameters (e.g., BF16) can also be used with autocast. For example, if
both the model and `torch.autocast` use BF16, autocast will upcast
precision-sensitive ops as needed.

Change in this PR:
Removed the assertion that restricted the model’s dtype to FP32.

This PR also adds and updates tests to cover these new behaviors.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-03 12:15:10 -07:00
889f0ead27 Enable non-ZeRO mode (#7515)
Enabled via `stage=0` which corresponds to DDP. 
Remove hardwired path to b16_optimizer.
Enable`torch.autocast` for DDP training
Enable native mixed precision DDP for bfloat16
Update torch.autocast and native mixed precision UTs

<img width="976" height="184" alt="image"
src="https://github.com/user-attachments/assets/92904cdc-e312-46a4-943f-011eb5ab146a"
/>

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-08-27 14:07:29 -04:00
66ad278048 Enabling Muon Optimizer in DeepSpeed (#7509)
Authorship: @pengdurice and @PKUWZP 

Related Issue: #7438

# Introduction

[Muon](https://arxiv.org/abs/2502.16982), a new optimizer that has
attracted the community’s attention recently shows promising results in
training large language models. Adding the Muon Optimizer to DeepSpeed,
a popular OSS framework for large scale training and inference is
critically important for DeepSpeed users and developers. There has been
a [PR](https://github.com/deepspeedai/DeepSpeed/pull/7454) attempting
the adoption. (Huge Thanks to @qimcis), which is a good starting point.
It still requires more substantial effort to make it fully compatible
and work within DeepSpeed. We are publishing this PR to fully enable
Muon Optimizer capabilities for DeepSpeed.

# Issues and solutions
## Issues
1. With stage 1, 2 or 3, the optimizer states will be partitioned within
the same data parallel group. This means that each process is already
handling only parts of the model parameters and there is no need to use
the DP solution as in the
[code](https://github.com/KellerJordan/Muon/blob/master/muon.py#L195).
2. The parameters (and the gradients) will be flattened to 1D vector
before being used in the optimizer, thus nullifying the major hypothesis
of the muon optimizer: it works by orthogonalizing the updates for each
matrix (dim >=2)

## Solutions
To solve the issues, we propose this new PR in which: 
1. We simplify the Muon code by
[removing](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-c9052994e41caee9ca88363749c10af08655f8019f08dc971c018663d25a3712R22)
the partitioning and muon updates logics.

2. We
[move](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1867)
the muon update to the
[get_flat_partition](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1848)
function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter
gradients are collected before being flattened and used by the optimizer
to update the model parameters. Since each parameter is still in its
original shape, we can easily apply the muon updates.
3. We also save the momentum buffer into the optimizer’ state so that we
have a smooth convergence after applying the saved checkpoints.
4. We added comprehensive unit tests to validate Muon Optimizer's
correctness and functionality.

# Future directions and roadmap
In the future, several follow up works are of interests:
- [ ] Create a CPU offload version.
- [ ] Apply Muon to Stage 3
- [ ] Use the highly optimized version of Adam for the Adam part of
MuonWithAuxAdam optimizer.
- [ ] More efficient implementations e.g. a) add specialized kernels for
Newton-Schulz iteration and muon updates; b) parallelize updates for the
parameters (currently, each parameter is updated separately and
sequentially)

---------

Co-authored-by: Peng Du <pedu@linkedin.com>
Co-authored-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-26 18:34:35 -07:00
bc8c0db3b4 Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 (#7421)
Please refer to https://github.com/deepspeedai/DeepSpeed/issues/7251

---------

Signed-off-by: lym <letusgo126@126.com>
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Signed-off-by: Alex Kiefer <alexkiefer51@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: Sam Foreman <saforem2@gmail.com>
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: huanyuqu <yc37960@um.edu.mo>
Signed-off-by: weeknan <zhounan0431@163.com>
Signed-off-by: WoosungMyung <dntjd517@naver.com>
Signed-off-by: Nir Sonnenschein <nsonnenschein@habana.ai>
Signed-off-by: Junjie Mao <banxing.mjj@alibaba-inc.com>
Signed-off-by: vinceliu <lpnpcs@gmail.com>
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com>
Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: cyy <cyyever@outlook.com>
Co-authored-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Alexander Kiefer <56556451+alexk101@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Sam Foreman <saforem2@gmail.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: huanyuqu <55744355+huanyuqu@users.noreply.github.com>
Co-authored-by: weeknan <57584045+weeknan@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
Co-authored-by: WoosungMyung <115716986+WoosungMyung@users.noreply.github.com>
Co-authored-by: Nir Sonnenschein <nsonnenschein@habana.ai>
Co-authored-by: Junjie Mao <junjie.mao@hotmail.com>
Co-authored-by: Junjie Mao <banxing.mjj@alibaba-inc.com>
Co-authored-by: lpnpcs <lpnpcs@vip.qq.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Tingfeng Lan <tafflann@outlook.com>
Co-authored-by: Rui Yan <49115835+yanrui27@users.noreply.github.com>
Co-authored-by: Feng Yunlong <20281571+AlongWY@users.noreply.github.com>
Co-authored-by: Yao Matrix <matrix.yao@intel.com>
Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu>
Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
Co-authored-by: Yuanyuan Chen <cyyever@outlook.com>
Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>
2025-08-20 22:03:26 +00:00
1c03d1b1bb Fix invalid f-strings (#7457)
Fix invalid f-strings detected by ruff.

---------

Signed-off-by: cyy <cyyever@outlook.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>
2025-08-16 18:22:19 +00:00
1d7b90adc4 Add Zenflow code for Stage 1 & 2 (#7391)
This PR adds ZenFlow, a importance-aware offloaded training framework
for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between
computation and communication during offloaded training, improving GPU
utilization and reducing stalls.

Highlights:
- New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW)
- ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration
- Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig
- Unit tests and documentation included

Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will
be introduced in a follow-up PR.

---------

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Guokai Ma <guokai.ma@gmail.com>
2025-08-15 17:32:22 +00:00
a12de38db6 Modal CI (#7289)
This is an initial effort to migrate CI unto Modal infra. This PR
creates two new workflows that run on Modal
1. modal-torch-latest: a subset of nv-torch-latest-v100 that includes
`tests/unit/runtime/zero/test_zero.py`.
2. modal-accelerate: a full copy of nv-accelerate-v100. 

Follow up PRs will selectively migrate relevant workflows onto Modal.

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com>
Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-08-11 20:13:39 +00:00
3292e07a92 adding TiledFusedLogitsLoss (#7437)
This PR adds `TiledFusedLogitsLoss` for an efficient fused logits+loss
computation - this version pre-calculates grads in `forward`, avoiding
recomputation in the backward (similar to the Liger-Kernel
implementation).

---------

Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
2025-07-30 14:15:33 -04:00
092625c7eb Fix: Adapt Llama injection policy for newer transformers versions (#7443)
This PR fixes an `AttributeError` that occurs during
`deepspeed.init_inference` when using kernel injection
(`replace_with_kernel_inject=True`) with Llama models from recent
versions of `transformers`.

**The Bug:**

In newer `transformers` versions (e.g., `4.53.3`), configurations like
`num_heads` and `rope_theta` were moved from direct attributes of the
`LlamaAttention` module into a nested `config` object.

The current DeepSpeed injection policy tries to access these attributes
from their old, direct location, causing the initialization to fail with
an `AttributeError: 'LlamaAttention' object has no attribute
'num_heads'`.

**The Solution:**

This change updates the Llama injection logic to be more robust:
1. It first tries to read attributes like `num_heads` from the new
`config` object location.
2. If that fails, it falls back to the legacy direct attribute path.

---------

Signed-off-by: huanyuqu <yc37960@um.edu.mo>
2025-07-26 14:27:33 -07:00
43f00ba31c Remove additional unused tests (human-eval) (#7445) 2025-07-24 13:16:57 -07:00
c2bb53f20f TiledMLP + SequenceTiledCompute: improve the bs>1 use-case (#7422)
Improved TiledMLP and SequenceTiledCompute for bs>1

This PR:
- extends the testing utils to add `CaptureStd*`, `CaptureLogger`
context managers
- extends the test to run both bs=1 and bs=2
- use an uneven seqlen to test varlen shards
- flattens bs+seqlen dim, to avoid problems with grad tensor strides
when bs>1 - mlp doesn't care for the bs dimension so using a pretend
`bs*seqlen` seqlen instead and restoring the shape at the end for the
grad.

---------

Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-07-16 09:30:08 -07:00
2790220d31 [TiledMLP]: fix for bs>1 (#7412)
It looks like my TiledMLP was working correctly only for batch_size=1

fixing to work with any bs 

thanks to @winglian for detecting the problem and sending me an easy
repro

---------

Signed-off-by: Stas Bekman <stas@stason.org>
2025-07-07 14:34:03 -04:00
e049bbfa1c Fix dtype mismatch in TestParamPartitioningSkipInit (#7377)
`TestParamPartitioningSkipInit` throws the following error.
```
====================================== short test summary info ======================================
FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) =========
```

The test always sets the model's dtype to `torch.bfloat16` and ignores
the test parameter `dtype` when bfloat16 is supported. This causes a
dtype mismatch when `dtype=torch.float16` is given as the test parameter
because the data loader respects the test parameter dtype.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-06-23 16:44:32 +00:00
d33baf009b Relax tolerances for FP8 unit test only for ROCm + FP16 (#7373)
Relaxing the tolerance values to enable the below unit test, with FP16
data type on ROCm


`unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[fp16]
`

```
        # Relax tolerance only for ROCm + FP16
        if is_rocm_pytorch() and model_dtype == torch.float16:
            rtol, atol = 3e-07, 3e-05
```

cc: @jithunnair-amd
2025-06-20 15:24:43 +00:00
ed5f737554 Enable torch.autocast with ZeRO (#6993)
DeepSpeed supports mixed precision training, but the behavior is
different from `torch.autocast`. DeepSpeed maintains parameters and
gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex
AMP style) and computes all modules in the lower precision while
`torch.autocast` maintains parameters in FP32 but computes only certain
operators in the lower precision.
This leads to differences in:
- performance: `torch.autocast` needs downcast in forward/backward
- memory usage: DeepSpeed needs more memory to keep copies of parameters
and gradients in lower precision
- accuracy: `torch.autocast` has a list of modules that can safely be
computed in lower precision. Some precision-sensitive operators (e.g.
softmax) are computed in FP32.

To align DeepSpeed's behavior with `torch.autocast` when necessary, this
PR adds the integration with `torch.autocast` with ZeRO. Here is an
examples of the configuration.

```json
"torch_autocast": {
  "enabled": true,
  "dtype": "bfloat16",
  "lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"]
}
```

Each configuration works as follows:
- `enabled`: Enable the integration with `torch.autocast` if this is set
to `True`. You don't need to call `torch.autocast` in your code. The
grad scaler is also applied in the DeepSpeed optimizer.
- `dtype`: lower precision dtype passed to `torch.autocast`. Gradients
for allreduce (reduce-scatter) and parameters for allgather (only for
ZeRO3) of `lower_precision_safe_modules` are also downcasted to this
dtype.
- `lower_precision_safe_modules`: Downcast for allreduce
(reduce-scatter) and allgather (ZeRO3) are applied only to modules
specified in this list. (The precision for PyTorch operators in
forward/backward follows `torch.autocast`'s policy, not this list.) You
can set names of classes with their packages. If you don't set this
item, DeepSpeed uses the default list: `[torch.nn.Linear,
torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d]`.

Note that we only maintain FP32 parameters with this feature enabled.
For consistency, you cannot enable `fp16` or `bf16` in DeepSpeed config.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: Omar Elayan <oelayan@habana.ai>
Signed-off-by: Roman Fitzjalen <romaactor@gmail.com>
Signed-off-by: Hongwei <hongweichen@microsoft.com>
Signed-off-by: shaomin <wukon1992@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: siqi <siqi@tecorigin.com>
Signed-off-by: Wei Wu <wuwei211x@gmail.com>
Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il>
Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Roman Fitzjalen <romaactor@gmail.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: siqi654321 <siqi202311@163.com>
Co-authored-by: siqi <siqi@tecorigin.com>
Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com>
Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com>
Co-authored-by: snahir <snahir@habana.ai>
Co-authored-by: Yejing-Lai <yejing.lai@intel.com>
Co-authored-by: Siddharth Singh <siddharth9820@gmail.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
2025-06-19 21:36:03 +00:00
10b106619a Don't break set_start_method (#7349)
Fix #7347

---------

Signed-off-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal>
Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal>
2025-06-11 13:00:58 -04:00
d7e60fd0f6 s/UlyssesPlus/Arctic Long Sequence Training (ALST)/ (#7348)
The project has been renamed at the last moment, so this PR is adapting
to that change.

There are no code changes in this PR, just docs.

---------

Signed-off-by: Stas Bekman <stas@stason.org>
2025-06-10 17:10:54 -07:00
e440506bee Improve overflow handling in ZeRO (#6976)
Fix #5241: Improve overflow handling 
- [x] ZeRO 1
- [x] ZeRO 2
- [ ] ZeRO 3
- [ ] BF16Optimizer

Enable pydantic configuration for mixed precision
- [x] bf16
- [x] fp16

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Xinyu Lian <lian7@illinois.edu>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>
Co-authored-by: Fabio Geraci <118277438+fabiosanger@users.noreply.github.com>
Co-authored-by: Sam Foreman <saforem2@gmail.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-06-09 17:30:51 +00:00
770967f5f0 fixed: Modified the topkgating function and modified the test_moe file for testing (#7163)
Since the previous PR encountered the DCO problem and could not be
solved for some reason, I resubmitted a completely identical PR but
without the problem.

---------

Signed-off-by: xiongjyu <xiongjyu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2025-06-06 16:42:41 -07:00
24a1d8f936 DeepNVMe update (#7215)
- FastPersist
- ZeRO-Inference+SGLang

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com>
Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com>
Co-authored-by: Ubuntu <jomayeri@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
2025-06-06 18:49:41 -04:00
8b03a35646 Fix ci hang in torch2.7& improve ut (#7321)
fix ci hang.
improve the ut.

---------

Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-06-02 15:41:10 -04:00
4d00b38ada Ulysses SP for HF Integration (#7268)
This is the Deepspeed counterpart of
https://github.com/snowflakedb/ArcticTraining/pull/45 - as the new
feature(s) require changes on both sides.


For PR reviewers: 

Readiness status:
- [x] Code
- [x] Tests
- [ ] Docs - working on it


Features:

- [x] add support for delaying grad addition via
`param.ds_grad_is_ready` flag (used when performing tiled compute in an
autograd function)
- [x] add light sp-only mpu version (Jeff Rasley)
- [x] improved debug
- [x] added `all_gather_object` to `dist`
- [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from
Megatron-Deepspeed plus modern MHA-variations)
- [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL
batches to be used by `UlyssesSPAttentionHF`
- [x] `SequenceTiledCompute` - generic autograd function to perform
compute after tiling on the sequence dimension
- [x] `TiledMLP` - a specific autograd function to perform tiled MLP
(it's much easier to understand before trying to grok
`SequenceTiledCompute`)
- [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari)
- [x] torch-dist-check now allows `torch.distributed.nn` (which is
needed since deepspeed's dist is not up to date with
`torch.distributed.nn`)

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-05-31 07:25:23 +00:00
b66c81077c anchor transformers version (#7316)
some features require minimal transformers versions so let's start
anchoring.

and fixing tests that break with recent transformers.

I need this fixed to be able to merge
https://github.com/deepspeedai/DeepSpeed/pull/7268 which requires
`transformers>=4.51.3`

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-05-29 06:19:54 +00:00
e5afb88760 tests/conftest.py: automatically add local deepspeed repo when running tests (#7317)
This is a follow up to https://github.com/deepspeedai/DeepSpeed/pull/923

my original code was a copy from transformers, which has a different fs
layout and I missed that. So this PR is fixing it to actually do the
right thing.

Now you can have multiple clones of deepspeed and the tests will use the
local repo automatically and not the pre-installed deepspeed.
2025-05-28 23:32:49 +00:00
b4cc079eee CI: prefer bf16 over fp16 (#7304)
these days fp16 is barely ever used, so we should be testing bf16
instead of fp16 where possible.

had to fix a bunch of tests to adapt to this change. a few bugs as well
on the way.

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-05-28 00:49:21 +00:00
b9af5d8d61 Fix: Update grad norm calculation for CPU offload (#7302)
## Description
This PR fixes an issue where gradient clipping modifications are not
reflected in the global gradient norm calculation when CPU offloading is
enabled. The issue occurs because the `averaged_gradients` are not being
updated with the clipped gradients when CPU offloading is active.

## Problem
When using CPU offloading with gradient clipping:
1. The gradients are successfully clipped using `safe_set_local_grad`
2. However, the `_global_grad_norm` calculation still uses the original
unclipped gradients.
3. This leads to incorrect gradient norm reporting and potential issues
with gradient clipping effectiveness

## Solution
The fix ensures that the `averaged_gradients` are properly updated with
the clipped gradients when CPU offloading is enabled, similar to how it
works when CPU offloading is disabled.

## Testing
The fix has been tested with:
- CPU offloading enabled and disabled
- Different gradient clipping values
- A simple model with linear layers
- Both FP16 and BF16

## Related Issues
Fixes #7292

---------

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
2025-05-27 12:13:54 +00:00
b666844ffc Fix AutoTP gathering replaced layer params when bias is not None (#7257)
Some params are one-dimensional, this PR adds support for these params.

Resolve #7249

```log
param.shape torch.Size([768, 1536])
param.shape torch.Size([768])
...
```

```log
with deepspeed.module_inject.layers.GatherReplacedLayerParams([param], model, enabled=True):
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 359, in __enter__
self.params[0].gather_params(self.params)
File "torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 473, in gather_params
param.shape[1],
~~~~~~~~~~~^^^
IndexError: tuple index out of range
```

---------

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
2025-05-25 04:03:20 +00:00
0e741714f5 Enable ZeRO set/get APIs for NVMe offload (#7046)
- Extend APIs for
[debugging](https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging)
and
[modifying](https://deepspeed.readthedocs.io/en/latest/zero3.html#modifying-partitioned-states)
ZeRO partitioned states to NVMe offload.
- Add vectorized update API. This is performance-critical for NVMe
offloading scenarios.

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
2025-05-20 00:11:17 +00:00
f45950258b rollback #6726 (#7258)
This PR rollback #6726 which caused
https://github.com/deepspeedai/DeepSpeed/issues/7116 .

---------

Signed-off-by: Guokai Ma <guokai.ma@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-05-19 04:54:16 +00:00
d46947db4a Temporarily skip AIO tests due to an issue with runners (#7288)
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-05-18 23:36:06 +00:00
227a60c0c4 DeepCompile for enhanced compiler integration (#7154)
This PR introduces *DeepCompile*, a new feature that efficiently
integrates compiler optimizations with other DeepSpeed features.
DeepCompile utilizes torch's dynamo to capture the computation graph and
modifies it to incorporate DeepSpeed’s optimizations seamlessly.

Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements
such as proactive prefetching and selective unsharding to improve
performance.
(More details will be added later.)

---------

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2025-04-16 04:33:53 +00:00
b8cc1eb078 async tp allreduce (#7115)
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Signed-off-by: shaomin <wukon1992@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: siqi <siqi@tecorigin.com>
Signed-off-by: Wei Wu <wuwei211x@gmail.com>
Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il>
Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Signed-off-by: Hongwei <hongweichen@microsoft.com>
Signed-off-by: Liang Cheng <astarxp777@gmail.com>
Signed-off-by: A-transformer <astarxp777@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: A-transformer <cl5743590921@gmail.com>
Co-authored-by: Raza Sikander <srsikander@habana.ai>
Co-authored-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: siqi654321 <siqi202311@163.com>
Co-authored-by: siqi <siqi@tecorigin.com>
Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com>
Co-authored-by: snahir <snahir@habana.ai>
Co-authored-by: Yejing-Lai <yejing.lai@intel.com>
Co-authored-by: A-transformer <astarxp777@gmail.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
2025-03-28 22:48:17 +00:00
9ae010e629 Add destroy to tests to free memory (#7160)
ZeRO3 requires explicit cleaning in tests when reusing the environment.
This PR adds `destroy` calls to the tests to free memory and avoid
potential errors due to memory leaks.

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
2025-03-24 21:49:09 +00:00
2e7f8e580e fix leak of z3 buffer
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
2025-03-20 18:58:15 +00:00
591d4d4d54 Conditionally quote env vars (#7071)
Resolves #6997 

This PR conditionally quotes environment variable values—only wrapping
those containing special characters (like parentheses) that could
trigger bash errors. Safe values remain unquoted.

---------

Signed-off-by: Saurabh <saurabhkoshatwar1996@gmail.com>
Signed-off-by: Saurabh Koshatwar <saurabhkoshatwar1996@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-03-17 14:51:27 +00:00
b418cf6c1b Training multiple models (#7018)
Support training multiple models, such as in
[HF](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed_multiple_model)

Here is some update on supporting multiple DS engines with single
loss.backward(). The main message is that I think we can support this.
First, some context. Backward pass in ZeRO is complicated because the
optimizations/features require special handling of gradients, such as:

1. Gradient partitioning
2. Overlapping backward and reduction
3. Upcasting for fp32 grad accumulation

So, we created engine.backward(loss) as a wrapper function to provide us
fine-grained control over backward as below

```python
def backward(loss):
 backward_prologue() # setup logic for special gradient handling
 loss.backward()
 backward_epilogue() # cleanup/teardown logic
```

As demonstrated by @muellerzr, this approach breaks down when loss
originates from multiple DS engines. Our proposed solution is to use
backward hooks on the module to launch backward_prologue() and
backward_epilogue() . Specifically,

1. backward pre hook on engine.module to launch backward_prologue()
before any module gradient is created.
2. backward post hook on engine.module to launch backward_epilogue()
after all module gradients are created.

We plan for this solution to preserve BC, i.e., engine.backward() will
remain correct for single engine scenarios.
The current status is that (1) is completed, while (2) is in progress.
To unblock e2e testing for multi-engine scenarios, since there are
probably other issues, we have a temporarily added
engine._backward_prologue() . You can try this out via the following
artifacts.

1. Simple multi-engine test code:
https://gist.github.com/tjruwase/f1adccf087b8fa269ffce2ab91c4f1c6#file-multi_engine-py
2. DS branch:
https://github.com/microsoft/DeepSpeed/tree/olruwase/zero_multi_models

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-03-11 20:59:23 +00:00
436c126211 Add sequential pytest mark to TestNVMeCheckpointing to resolve pytest forked hangs (#7131)
Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-03-11 17:20:32 +00:00
38327e07f6 Bug Fix for offload_states API (#7050)
@fukun07 and I discovered a bug when using the `offload_states` and
`reload_states` APIs of the Zero3 optimizer. When using grouped
parameters (for example, in weight decay or grouped lr scenarios), the
order of the parameters mapping in `reload_states`
([here](14b3cce4aa/deepspeed/runtime/zero/stage3.py (L2953)))
does not correspond with the initialization of `self.lp_param_buffer`
([here](14b3cce4aa/deepspeed/runtime/zero/stage3.py (L731))),
which leads to misaligned parameter loading. This issue was overlooked
by the corresponding unit tests
([here](https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/runtime/zero/test_offload_states.py)),
so we fixed the bug in our PR and added the corresponding unit tests.

---------

Signed-off-by: Wei Wu <wuwei211x@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2025-02-21 12:05:41 +00:00