This PR completes the ZenFlow integration for DeepSpeed ZeRO Stage 3.
Highlights:
- ZenFlowSelectiveAdamW_stage3: Optimizer with importance-aware
selective parameter updates for ZeRO Stage 3.
- ZenFlowZeroOptimizer_Stage3: Full Stage 3 optimizer integration with
partitioned parameters and CPU offload.
- Configurable via ZenFlowConfig, fully integrated with
DeepSpeedZeroConfig for Stage 3.
- Unit tests for Stage 3 cases ensuring correctness and compatibility.
Note: Intergration with ZeRO Stage 1&2 was introduced in #7391
---------
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Co-authored-by: Ma, Guokai <guokai.ma@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu>
Currently, the DeepSpeed engine does not enable the grad scaler for the
ZeRO-0 and `torch.autocast` path, even when dtype is set to `fp16`. This
leads to errors in tests when we replace our hard-coded tolerances with
PyTorch’s [standard
tolerances](https://docs.pytorch.org/docs/stable/testing.html#torch.testing.assert_close)
(Thank you @stas00 for you suggestion regarding the previous PR).
This PR enables the grad scaler for this path to improve accuracy, and
refactors the tests to simplify validation by using
`torch.testing.assert_close`. The tests now rely on PyTorch’s standard
(and stricter) tolerances, and they still pass.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
This PR improves the usability of the leaf module feature.
Here are the changes:
- Allow enabling the leaf module via both the DeepSpeed config and APIs.
- Relax matching criteria to support class-based matching.
- Support multiple ways of specifying the target module: class, class
name (with or without package name), module name, or suffix.
- Add documentation to the training guide, including config snippets and
explanations of default behavior.
- Add default classes (e.g., Mixtral, Qwen2/Qwen3) that automatically
enable the leaf module feature. (Welcoming requests to add more classes)
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
This PR improves error message when DeepCompile test fails.
Tests of DeepCompile occasionally fail
([example](https://github.com/deepspeedai/DeepSpeed/actions/runs/18160078309/job/51688736712?pr=7604))
because of mismatching loss values.
To make sure this is not a synchronization bug that causes `nan` loss
values, the change in this PR shows the mismatching values. We can
consider increasing the tolerances once we confirm the mismatch is
reasonable.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
With autocast enabled, a majority of weights are downcasted before being
used in calculations. Today zero3_compile gathers the FP32 weights
before they are downcasted. That is sub-optimal because FP32 weights
consumes more bandwidth to allgather and takes more time to downcast.
To reduce communication and downcast time, fuse allgather and downcast
in the dc ops. The target type is now passed to allgather_param() and
prefetch_params_fused() which will downcast the (partial) weights before
launching allgathers.
This corresponds to issue 1 of #7577.
Tested with
https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3
(run with `deepspeed --num_gpus=N this_file.py -c -p -m 23` to collect
torch and memory profiles, and with DINOV2_DEPTH = SIGLIP_DEPTH = 3,
LLAMA2_DEPTH = 4 for faster compileation) on 5090 (which has limited
inter-GPU bandwidth), time per step decreases from 438ms to 337ms and
peak GPU memory usage from 9.5GB to 8.5GB.
Profiles of a single step before this PR:
<img width="1235" height="1029" alt="image"
src="https://github.com/user-attachments/assets/d9fe5296-7731-4542-924b-421ff7415054"
/>
<img width="1466" height="616" alt="image"
src="https://github.com/user-attachments/assets/aa192802-8633-4e36-b2c4-f28b1b432663"
/>
After this PR:
<img width="1218" height="1006" alt="image"
src="https://github.com/user-attachments/assets/18a0e09c-155b-4783-adb5-b4d36c5c3691"
/>
<img width="1537" height="559" alt="image"
src="https://github.com/user-attachments/assets/16a2ca74-8a89-4db9-9b68-81844295c61b"
/>
This PR also reduces peak memory usage because the
`fast_free_schedule()` today always arranges param allgathers and
downcasts at the beginning of the graph. While the original FP32 params
can be freed early, all FP16/BF16-casted params are kept in GPU memory
at the beginning of the backward graph, leading to a higher peak in
memory usage.
P.S. Probably due to organization branch rule settings, I don't find
anywhere to allow reviewers to modify the branch. So I'll update the
branch per reviewers' comments and rebase if needed.
Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
**Describe the bug**
When the model is large and there are multiple subgroups, we use
ds_to_universal.py, will fail ,the error log are below:
```
*** 1. Extracting ZeRO fragments
0%| | 0/1 [00:03<?, ?it/s]
Traceback (most recent call last):
File "/work/zhengchenyu/ai-project/qwen3/scripts/ds_to_universal_example.py", line 21, in <module>
main()
File "/work/zhengchenyu/ai-project/qwen3/scripts/ds_to_universal_example.py", line 18, in main
ds_to_universal_main(args)
File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 523, in main
_extract_zero_shard_files_stage3(args, optim_files, param_shapes, dp_degree, temp_dir)
File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 375, in _extract_zero_shard_files_stage3
_do_parallel_work(do_work, list(range(dp_degree)), args.num_extract_workers)
File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 359, in _do_parallel_work
results.append(do_work(work))
^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 167, in extract_zero_shards_stage3
dump_param_fragment(temp_dir, 0, dp_index, state_key, flat_state[state_key], name, offset,
File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 194, in dump_param_fragment
state_flat_tensor = state_flat_tensor.narrow(0, offset, numel).clone()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (0) + length (155582464) exceeds dimension size (74499072).
```
**To Reproduce**
Steps to reproduce the behavior:
1. Use large model to run, or set sub_group_size to a lower value. Then
train and save model
2. Run ds_to_universal.py
**The reason**
I found that the previous stage3 universal checkpoint implementation did
not take subgroups into account. I also found the following problems
during debugging.
* Unable to handle multiple sub-groups, which will result in data loss
* When load_checkpoint is True, then all process will save to same zero
model checkpoint file. If multiple processes write at the same time, the
file will be corrupted. Occasionally, file corruption was discovered
during testing.
Relete issue: #7584
---------
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
This PR introduces **SuperOffload**—an optimizer designed for Superchips
(Nvidia GH200 & GB200, AMD MI300A) with high CPU–GPU bandwidth. It
enables **full fine-tuning** of **GPT-OSS-20B, Qwen3-14B, and Phi-4** on
a single GH200 GPU, achieving up to **~500 TFLOPS**, using Hugging Face
Transformers and DeepSpeed—no custom modeling code required.
SuperOffload extends ZeRO-Offload with fine-grained control and CPUAdam
rollback utilities, allowing GPU execution to overlap with CPUAdam. This
reduces GPU idle time and improves overall efficiency.
Key changes:
- New SuperOffloadOptimizer_Stage3 optimizer.
- C++/CUDA binding for adam_rollback to revert one optimization step.
- Config additions including super_offload and cpuadam_cores_perc.
A detailed blog and tutorial will be available soon.
---------
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
In deepcompile free-activation mode, only activations larger than a
threshold are eagerly freed. The threshold is hardcoded today and thus
may not be suitable in all cases.
This PR first generalizes the dc.init() interface to take the whole
compile_config object, and then converts the threshold into a config
item.
This corresponds to issue 3 of #7577.
---------
Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
This PR moves active tests under `tests/unit/v1` to clarify which tests
are run on modal.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
This PR improves error logging and relaxes loss value checks in the
autocast test.
Previously, the test displayed error messages and mismatched loss values
on all ranks, even if the mismatch only occurred on some ranks. This was
confusing, since logs from other ranks could appear correct. This PR
changes the behavior so that error messages are shown only on the ranks
where the mismatch occurs.
Additionally, this PR skips loss value validation for
`test_lower_precision_model`, where we intentionally use a different
communication dtype from the baseline (standard PyTorch autocast).
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
This PR relaxes two restrictions on torch.autocast in the DeepSpeed
engine:
1) Nesting torch.autocast
Currently, we do not expect `torch.autocast` to be used outside the
DeepSpeed engine. Here is the current behavior:
- If `torch.autocast` is enabled in the DeepSpeed config and the engine
detects it is also enabled outside, a warning is displayed.
- If it is disabled in the config, the engine raises an error.
This design prevents the following usage:
```python
with torch.autocast(...):
logits = deepspeed_model(...)
loss = criteria_fn(logits)
```
In this case, we also want to apply autocast to `criteria_fn`. With the
current behavior, we would need move `deepspeed_model(...)` outside the
`torch.autocast` context, leading to inconsistent code between DeepSpeed
and non-DeepSpeed setups. (cannot be handled with `enabled` arg of
`torch.autocast`)
Change in this PR:
`torch.autocast` outside the DeepSpeed engine is ignored, and
- If `torch_autocast` is enabled in the config, DeepSpeed will follow
that setting.
- If it is disabled, DeepSpeed falls back to its own mixed-precision
support (or FP32).
In these cases, DeepSpeed engine shows a message to explain the
behavior.
2) Model’s dtype
Previously, DeepSpeed assumed the model’s dtype must be FP32 when
`torch.autocast` was enabled. However, models with lower-precision
parameters (e.g., BF16) can also be used with autocast. For example, if
both the model and `torch.autocast` use BF16, autocast will upcast
precision-sensitive ops as needed.
Change in this PR:
Removed the assertion that restricted the model’s dtype to FP32.
This PR also adds and updates tests to cover these new behaviors.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Authorship: @pengdurice and @PKUWZP
Related Issue: #7438
# Introduction
[Muon](https://arxiv.org/abs/2502.16982), a new optimizer that has
attracted the community’s attention recently shows promising results in
training large language models. Adding the Muon Optimizer to DeepSpeed,
a popular OSS framework for large scale training and inference is
critically important for DeepSpeed users and developers. There has been
a [PR](https://github.com/deepspeedai/DeepSpeed/pull/7454) attempting
the adoption. (Huge Thanks to @qimcis), which is a good starting point.
It still requires more substantial effort to make it fully compatible
and work within DeepSpeed. We are publishing this PR to fully enable
Muon Optimizer capabilities for DeepSpeed.
# Issues and solutions
## Issues
1. With stage 1, 2 or 3, the optimizer states will be partitioned within
the same data parallel group. This means that each process is already
handling only parts of the model parameters and there is no need to use
the DP solution as in the
[code](https://github.com/KellerJordan/Muon/blob/master/muon.py#L195).
2. The parameters (and the gradients) will be flattened to 1D vector
before being used in the optimizer, thus nullifying the major hypothesis
of the muon optimizer: it works by orthogonalizing the updates for each
matrix (dim >=2)
## Solutions
To solve the issues, we propose this new PR in which:
1. We simplify the Muon code by
[removing](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-c9052994e41caee9ca88363749c10af08655f8019f08dc971c018663d25a3712R22)
the partitioning and muon updates logics.
2. We
[move](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1867)
the muon update to the
[get_flat_partition](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1848)
function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter
gradients are collected before being flattened and used by the optimizer
to update the model parameters. Since each parameter is still in its
original shape, we can easily apply the muon updates.
3. We also save the momentum buffer into the optimizer’ state so that we
have a smooth convergence after applying the saved checkpoints.
4. We added comprehensive unit tests to validate Muon Optimizer's
correctness and functionality.
# Future directions and roadmap
In the future, several follow up works are of interests:
- [ ] Create a CPU offload version.
- [ ] Apply Muon to Stage 3
- [ ] Use the highly optimized version of Adam for the Adam part of
MuonWithAuxAdam optimizer.
- [ ] More efficient implementations e.g. a) add specialized kernels for
Newton-Schulz iteration and muon updates; b) parallelize updates for the
parameters (currently, each parameter is updated separately and
sequentially)
---------
Co-authored-by: Peng Du <pedu@linkedin.com>
Co-authored-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
This PR adds ZenFlow, a importance-aware offloaded training framework
for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between
computation and communication during offloaded training, improving GPU
utilization and reducing stalls.
Highlights:
- New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW)
- ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration
- Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig
- Unit tests and documentation included
Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will
be introduced in a follow-up PR.
---------
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Guokai Ma <guokai.ma@gmail.com>
This PR adds `TiledFusedLogitsLoss` for an efficient fused logits+loss
computation - this version pre-calculates grads in `forward`, avoiding
recomputation in the backward (similar to the Liger-Kernel
implementation).
---------
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
This PR fixes an `AttributeError` that occurs during
`deepspeed.init_inference` when using kernel injection
(`replace_with_kernel_inject=True`) with Llama models from recent
versions of `transformers`.
**The Bug:**
In newer `transformers` versions (e.g., `4.53.3`), configurations like
`num_heads` and `rope_theta` were moved from direct attributes of the
`LlamaAttention` module into a nested `config` object.
The current DeepSpeed injection policy tries to access these attributes
from their old, direct location, causing the initialization to fail with
an `AttributeError: 'LlamaAttention' object has no attribute
'num_heads'`.
**The Solution:**
This change updates the Llama injection logic to be more robust:
1. It first tries to read attributes like `num_heads` from the new
`config` object location.
2. If that fails, it falls back to the legacy direct attribute path.
---------
Signed-off-by: huanyuqu <yc37960@um.edu.mo>
Improved TiledMLP and SequenceTiledCompute for bs>1
This PR:
- extends the testing utils to add `CaptureStd*`, `CaptureLogger`
context managers
- extends the test to run both bs=1 and bs=2
- use an uneven seqlen to test varlen shards
- flattens bs+seqlen dim, to avoid problems with grad tensor strides
when bs>1 - mlp doesn't care for the bs dimension so using a pretend
`bs*seqlen` seqlen instead and restoring the shape at the end for the
grad.
---------
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
It looks like my TiledMLP was working correctly only for batch_size=1
fixing to work with any bs
thanks to @winglian for detecting the problem and sending me an easy
repro
---------
Signed-off-by: Stas Bekman <stas@stason.org>
`TestParamPartitioningSkipInit` throws the following error.
```
====================================== short test summary info ======================================
FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) =========
```
The test always sets the model's dtype to `torch.bfloat16` and ignores
the test parameter `dtype` when bfloat16 is supported. This causes a
dtype mismatch when `dtype=torch.float16` is given as the test parameter
because the data loader respects the test parameter dtype.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Relaxing the tolerance values to enable the below unit test, with FP16
data type on ROCm
`unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[fp16]
`
```
# Relax tolerance only for ROCm + FP16
if is_rocm_pytorch() and model_dtype == torch.float16:
rtol, atol = 3e-07, 3e-05
```
cc: @jithunnair-amd
DeepSpeed supports mixed precision training, but the behavior is
different from `torch.autocast`. DeepSpeed maintains parameters and
gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex
AMP style) and computes all modules in the lower precision while
`torch.autocast` maintains parameters in FP32 but computes only certain
operators in the lower precision.
This leads to differences in:
- performance: `torch.autocast` needs downcast in forward/backward
- memory usage: DeepSpeed needs more memory to keep copies of parameters
and gradients in lower precision
- accuracy: `torch.autocast` has a list of modules that can safely be
computed in lower precision. Some precision-sensitive operators (e.g.
softmax) are computed in FP32.
To align DeepSpeed's behavior with `torch.autocast` when necessary, this
PR adds the integration with `torch.autocast` with ZeRO. Here is an
examples of the configuration.
```json
"torch_autocast": {
"enabled": true,
"dtype": "bfloat16",
"lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"]
}
```
Each configuration works as follows:
- `enabled`: Enable the integration with `torch.autocast` if this is set
to `True`. You don't need to call `torch.autocast` in your code. The
grad scaler is also applied in the DeepSpeed optimizer.
- `dtype`: lower precision dtype passed to `torch.autocast`. Gradients
for allreduce (reduce-scatter) and parameters for allgather (only for
ZeRO3) of `lower_precision_safe_modules` are also downcasted to this
dtype.
- `lower_precision_safe_modules`: Downcast for allreduce
(reduce-scatter) and allgather (ZeRO3) are applied only to modules
specified in this list. (The precision for PyTorch operators in
forward/backward follows `torch.autocast`'s policy, not this list.) You
can set names of classes with their packages. If you don't set this
item, DeepSpeed uses the default list: `[torch.nn.Linear,
torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d]`.
Note that we only maintain FP32 parameters with this feature enabled.
For consistency, you cannot enable `fp16` or `bf16` in DeepSpeed config.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: Omar Elayan <oelayan@habana.ai>
Signed-off-by: Roman Fitzjalen <romaactor@gmail.com>
Signed-off-by: Hongwei <hongweichen@microsoft.com>
Signed-off-by: shaomin <wukon1992@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: siqi <siqi@tecorigin.com>
Signed-off-by: Wei Wu <wuwei211x@gmail.com>
Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il>
Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Roman Fitzjalen <romaactor@gmail.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: siqi654321 <siqi202311@163.com>
Co-authored-by: siqi <siqi@tecorigin.com>
Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com>
Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com>
Co-authored-by: snahir <snahir@habana.ai>
Co-authored-by: Yejing-Lai <yejing.lai@intel.com>
Co-authored-by: Siddharth Singh <siddharth9820@gmail.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
The project has been renamed at the last moment, so this PR is adapting
to that change.
There are no code changes in this PR, just docs.
---------
Signed-off-by: Stas Bekman <stas@stason.org>
This is the Deepspeed counterpart of
https://github.com/snowflakedb/ArcticTraining/pull/45 - as the new
feature(s) require changes on both sides.
For PR reviewers:
Readiness status:
- [x] Code
- [x] Tests
- [ ] Docs - working on it
Features:
- [x] add support for delaying grad addition via
`param.ds_grad_is_ready` flag (used when performing tiled compute in an
autograd function)
- [x] add light sp-only mpu version (Jeff Rasley)
- [x] improved debug
- [x] added `all_gather_object` to `dist`
- [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from
Megatron-Deepspeed plus modern MHA-variations)
- [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL
batches to be used by `UlyssesSPAttentionHF`
- [x] `SequenceTiledCompute` - generic autograd function to perform
compute after tiling on the sequence dimension
- [x] `TiledMLP` - a specific autograd function to perform tiled MLP
(it's much easier to understand before trying to grok
`SequenceTiledCompute`)
- [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari)
- [x] torch-dist-check now allows `torch.distributed.nn` (which is
needed since deepspeed's dist is not up to date with
`torch.distributed.nn`)
---------
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
This is a follow up to https://github.com/deepspeedai/DeepSpeed/pull/923
my original code was a copy from transformers, which has a different fs
layout and I missed that. So this PR is fixing it to actually do the
right thing.
Now you can have multiple clones of deepspeed and the tests will use the
local repo automatically and not the pre-installed deepspeed.
these days fp16 is barely ever used, so we should be testing bf16
instead of fp16 where possible.
had to fix a bunch of tests to adapt to this change. a few bugs as well
on the way.
---------
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
## Description
This PR fixes an issue where gradient clipping modifications are not
reflected in the global gradient norm calculation when CPU offloading is
enabled. The issue occurs because the `averaged_gradients` are not being
updated with the clipped gradients when CPU offloading is active.
## Problem
When using CPU offloading with gradient clipping:
1. The gradients are successfully clipped using `safe_set_local_grad`
2. However, the `_global_grad_norm` calculation still uses the original
unclipped gradients.
3. This leads to incorrect gradient norm reporting and potential issues
with gradient clipping effectiveness
## Solution
The fix ensures that the `averaged_gradients` are properly updated with
the clipped gradients when CPU offloading is enabled, similar to how it
works when CPU offloading is disabled.
## Testing
The fix has been tested with:
- CPU offloading enabled and disabled
- Different gradient clipping values
- A simple model with linear layers
- Both FP16 and BF16
## Related Issues
Fixes#7292
---------
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Some params are one-dimensional, this PR adds support for these params.
Resolve#7249
```log
param.shape torch.Size([768, 1536])
param.shape torch.Size([768])
...
```
```log
with deepspeed.module_inject.layers.GatherReplacedLayerParams([param], model, enabled=True):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 359, in __enter__
self.params[0].gather_params(self.params)
File "torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 473, in gather_params
param.shape[1],
~~~~~~~~~~~^^^
IndexError: tuple index out of range
```
---------
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
This PR introduces *DeepCompile*, a new feature that efficiently
integrates compiler optimizations with other DeepSpeed features.
DeepCompile utilizes torch's dynamo to capture the computation graph and
modifies it to incorporate DeepSpeed’s optimizations seamlessly.
Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements
such as proactive prefetching and selective unsharding to improve
performance.
(More details will be added later.)
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
ZeRO3 requires explicit cleaning in tests when reusing the environment.
This PR adds `destroy` calls to the tests to free memory and avoid
potential errors due to memory leaks.
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Support training multiple models, such as in
[HF](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed_multiple_model)
Here is some update on supporting multiple DS engines with single
loss.backward(). The main message is that I think we can support this.
First, some context. Backward pass in ZeRO is complicated because the
optimizations/features require special handling of gradients, such as:
1. Gradient partitioning
2. Overlapping backward and reduction
3. Upcasting for fp32 grad accumulation
So, we created engine.backward(loss) as a wrapper function to provide us
fine-grained control over backward as below
```python
def backward(loss):
backward_prologue() # setup logic for special gradient handling
loss.backward()
backward_epilogue() # cleanup/teardown logic
```
As demonstrated by @muellerzr, this approach breaks down when loss
originates from multiple DS engines. Our proposed solution is to use
backward hooks on the module to launch backward_prologue() and
backward_epilogue() . Specifically,
1. backward pre hook on engine.module to launch backward_prologue()
before any module gradient is created.
2. backward post hook on engine.module to launch backward_epilogue()
after all module gradients are created.
We plan for this solution to preserve BC, i.e., engine.backward() will
remain correct for single engine scenarios.
The current status is that (1) is completed, while (2) is in progress.
To unblock e2e testing for multi-engine scenarios, since there are
probably other issues, we have a temporarily added
engine._backward_prologue() . You can try this out via the following
artifacts.
1. Simple multi-engine test code:
https://gist.github.com/tjruwase/f1adccf087b8fa269ffce2ab91c4f1c6#file-multi_engine-py
2. DS branch:
https://github.com/microsoft/DeepSpeed/tree/olruwase/zero_multi_models
---------
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>