This reverts commit 047a7599d24622dfb37fa5e5a32c671b1bb44233.
Unfortunately, the above required substantial redesign of existing HPU
stack, which is currently not feasible, so reverting.
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
This PR improves error logging and relaxes loss value checks in the
autocast test.
Previously, the test displayed error messages and mismatched loss values
on all ranks, even if the mismatch only occurred on some ranks. This was
confusing, since logs from other ranks could appear correct. This PR
changes the behavior so that error messages are shown only on the ranks
where the mismatch occurs.
Additionally, this PR skips loss value validation for
`test_lower_precision_model`, where we intentionally use a different
communication dtype from the baseline (standard PyTorch autocast).
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
command: python3 -c 'import
deepspeed;deepspeed.ops.adam.cpu_adam.CPUAdamBuilder().load()'
when running on the rocm platform, it encounter an error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File
"/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py",
line 538, in load
return self.jit_load(verbose)
File
"/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py",
line 570, in jit_load
cxx_args = self.strip_empty_entries(self.cxx_args())
File
"/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py",
line 401, in strip_empty_entries
return [x for x in args if len(x) > 0]
File
"/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py",
line 401, in <listcomp>
return [x for x in args if len(x) > 0]
TypeError: object of type 'NoneType' has no len()
Compare with version 0.16.5:
https://github.com/deepspeedai/DeepSpeed/blob/v0.16.5/op_builder/builder.py#L435
The current version of code is missing a return when
self.is_rocm_pytorch() is True. Just add return '-D__DISABLE_CUDA__' is
ok!
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
This PR address the following ZenFlow optimizer core binding issue.
https://github.com/deepspeedai/DeepSpeed/issues/7478
With this PR, ZenFlow optimizer worker would derive its core binding
from deepspeed core binding mechanism. The algorithm is as following:
1. Each DeepSpeed rank get its core binding by using DeepSpeed command
line `--bind_cores_to_rank`, this command would assign each CPU physical
cores to different workers
2. When spawing ZenFlow optimizer worker, DeepSpeed would split current
CPU affinity list into two sublist: pt_affinity and zf_affinity
3. zf_affinity would be used to set affinity of ZenFlow optimizer
worker. pt_affinity would be used to set current pytorch process.
4. By default, one cores is reserved by each pytorch process, the rest
is used by ZenFlow optimizer worker. The number of cores reserved for
pytorch process can be changed by ZenFlow config variable:
`pt_reserved_cores`
---------
Signed-off-by: Guokai Ma <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Signed-off-by: aeeeeeep <aeeeeeep@proton.me>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: aeeeeeep <aeeeeeep@proton.me>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
Co-authored-by: Zhipeng Wang <zwanga@wustl.edu>
Co-authored-by: Peng Du <pedu@linkedin.com>
Co-authored-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR relaxes two restrictions on torch.autocast in the DeepSpeed
engine:
1) Nesting torch.autocast
Currently, we do not expect `torch.autocast` to be used outside the
DeepSpeed engine. Here is the current behavior:
- If `torch.autocast` is enabled in the DeepSpeed config and the engine
detects it is also enabled outside, a warning is displayed.
- If it is disabled in the config, the engine raises an error.
This design prevents the following usage:
```python
with torch.autocast(...):
logits = deepspeed_model(...)
loss = criteria_fn(logits)
```
In this case, we also want to apply autocast to `criteria_fn`. With the
current behavior, we would need move `deepspeed_model(...)` outside the
`torch.autocast` context, leading to inconsistent code between DeepSpeed
and non-DeepSpeed setups. (cannot be handled with `enabled` arg of
`torch.autocast`)
Change in this PR:
`torch.autocast` outside the DeepSpeed engine is ignored, and
- If `torch_autocast` is enabled in the config, DeepSpeed will follow
that setting.
- If it is disabled, DeepSpeed falls back to its own mixed-precision
support (or FP32).
In these cases, DeepSpeed engine shows a message to explain the
behavior.
2) Model’s dtype
Previously, DeepSpeed assumed the model’s dtype must be FP32 when
`torch.autocast` was enabled. However, models with lower-precision
parameters (e.g., BF16) can also be used with autocast. For example, if
both the model and `torch.autocast` use BF16, autocast will upcast
precision-sensitive ops as needed.
Change in this PR:
Removed the assertion that restricted the model’s dtype to FP32.
This PR also adds and updates tests to cover these new behaviors.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
`_disable_dynamo_if_unsupported` fallback wasn't getting created under
certain conditions. This PR is fixing this. Also removed debug print.
Fixes issue installing deepspeed on torch 2.4 and 2.1 that triggered
this:
```
#42 15.84 Traceback (most recent call last):
#42 15.84 File "<string>", line 2, in <module>
#42 15.84 File "<pip-setuptools-caller>", line 34, in <module>
#42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/setup.py", line 40, in <module>
#42 15.84 from op_builder import get_default_compute_capabilities, OpBuilder
#42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/op_builder/__init__.py", line 18, in <module>
#42 15.84 import deepspeed.ops.op_builder # noqa: F401 # type: ignore
#42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/__init__.py", line 25, in <module>
#42 15.84 from . import ops
#42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/__init__.py", line 6, in <module>
#42 15.84 from . import adam
#42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/adam/__init__.py", line 9, in <module>
#42 15.84 from .zenflow_torch_adam import ZenFlowSelectiveAdamW
#42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/adam/zenflow_torch_adam.py", line 685, in <module>
#42 15.84 @_disable_dynamo_if_unsupported(single_tensor_fn=_single_tensor_adamw)
#42 15.84 NameError: name '_disable_dynamo_if_unsupported' is not defined
#42 15.84 [WARNING] ZenFlow disabled: torch internal optimizer symbols could not be imported: cannot import name '_disable_dynamo_if_unsupported' from 'torch.optim.optimizer' (/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py)
```
---------
Signed-off-by: Stas Bekman <stas@stason.org>
MoE tutorial fixes:
1. cifar example has been moved - fix the url
2. fixing text and improving markup
---------
Signed-off-by: Stas Bekman <stas@stason.org>
This PR removes some and enables removing other startup noise -
especially when it's replicated rank-times and doesn't carry any
informative payload.
1. add `--log_level` flag which sets the launcher's logger to a desired
setting - defaulting to `logging.INFO` for now for BC, but will change
to `logging.WARNING` in v1
2. add `--quiet/-q` flag which sets the launcher's logger to
`logging.ERROR` which essentially disables startup info messages
3. change the logging defaults elsewhere to `logging.WARNING` (main
impact is the accelerator.py), once deepspeed started the frameworks
control its loglevel for each rank, so the tricky part is this pre-start
stage info logs. this part is breaking BC as there is no machinery to
set the logger level for `real_accelerator.py`)
4. builder is changed to non-verbose (BC breaking)
---------
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
ZeRO3 tracks DDP (SPMD) behavior by matching values different training
states across ranks. Some of these states are represented as lists, and
mismatches sometimes manifests as hangs during error detection. This PR
improves error detection by first validating the list lengths across
ranks before validating the list contents.
Motivated by
https://github.com/deepspeedai/DeepSpeed/issues/7461#issuecomment-3235146207
---------
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
This patch adds riscv64 support for the deepspeed_shm_comm
operator,enabling DeepSpeed to perform CPU training/inference on RISCV64
hosts, for research purposes. Based on the discussion in pull #7387 ,
this patch refactors some original code to support multiple CPU
architectures.
Related tests have passed on x86 and RISC-V CPU, and I successfully ran
Qwen2.5 on a RISC-V CPU,
```bash
(myenv) [root@openeuler-riscv64 DeepSpeed ]$ pytest tests/unit/comm/test_dist.py::TestDistInferenceAllReduce -vv
====================================================================== test session starts =======================================================================
platform linux -- Python 3.11.4, pytest-7.2.0, pluggy-1.6.0 -- /root/myenv/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /root/ecosystem/DeepSpeed/tests, configfile: pytest.ini
plugins: mock-3.14.1, hypothesis-6.135.14, forked-1.6.0
collected 3 items
tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype0] PASSED [ 33%]
tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype1] PASSED [ 66%]
tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype2] PASSED [100%]
(myenv) root@ubuntu-2204:~/soft-working-dir/DeepSpeed# pytest tests/unit/comm/test_dist.py::TestDistInferenceAllReduce -vv
====================================================================== test session starts =======================================================================
platform linux -- Python 3.12.3, pytest-7.2.0, pluggy-1.6.0 -- /root/soft-working-dir/myenv/bin/python3
cachedir: .pytest_cache
rootdir: /root/soft-working-dir/DeepSpeed/tests, configfile: pytest.ini
plugins: forked-1.6.0
collected 3 items
tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype0] PASSED [ 33%]
tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype1] PASSED [ 66%]
tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype2] PASSED [100%]
```
---------
Signed-off-by: heyujiao99 <he.yujiao@sanechips.com.cn>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Authorship: @pengdurice and @PKUWZP
Related Issue: #7438
# Introduction
[Muon](https://arxiv.org/abs/2502.16982), a new optimizer that has
attracted the community’s attention recently shows promising results in
training large language models. Adding the Muon Optimizer to DeepSpeed,
a popular OSS framework for large scale training and inference is
critically important for DeepSpeed users and developers. There has been
a [PR](https://github.com/deepspeedai/DeepSpeed/pull/7454) attempting
the adoption. (Huge Thanks to @qimcis), which is a good starting point.
It still requires more substantial effort to make it fully compatible
and work within DeepSpeed. We are publishing this PR to fully enable
Muon Optimizer capabilities for DeepSpeed.
# Issues and solutions
## Issues
1. With stage 1, 2 or 3, the optimizer states will be partitioned within
the same data parallel group. This means that each process is already
handling only parts of the model parameters and there is no need to use
the DP solution as in the
[code](https://github.com/KellerJordan/Muon/blob/master/muon.py#L195).
2. The parameters (and the gradients) will be flattened to 1D vector
before being used in the optimizer, thus nullifying the major hypothesis
of the muon optimizer: it works by orthogonalizing the updates for each
matrix (dim >=2)
## Solutions
To solve the issues, we propose this new PR in which:
1. We simplify the Muon code by
[removing](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-c9052994e41caee9ca88363749c10af08655f8019f08dc971c018663d25a3712R22)
the partitioning and muon updates logics.
2. We
[move](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1867)
the muon update to the
[get_flat_partition](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1848)
function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter
gradients are collected before being flattened and used by the optimizer
to update the model parameters. Since each parameter is still in its
original shape, we can easily apply the muon updates.
3. We also save the momentum buffer into the optimizer’ state so that we
have a smooth convergence after applying the saved checkpoints.
4. We added comprehensive unit tests to validate Muon Optimizer's
correctness and functionality.
# Future directions and roadmap
In the future, several follow up works are of interests:
- [ ] Create a CPU offload version.
- [ ] Apply Muon to Stage 3
- [ ] Use the highly optimized version of Adam for the Adam part of
MuonWithAuxAdam optimizer.
- [ ] More efficient implementations e.g. a) add specialized kernels for
Newton-Schulz iteration and muon updates; b) parallelize updates for the
parameters (currently, each parameter is updated separately and
sequentially)
---------
Co-authored-by: Peng Du <pedu@linkedin.com>
Co-authored-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Fix assert `'pp_int' object has no attribute 'custom_print_str'` when
tracking deepspeed module with some track debug tools like
[objwatch](https://github.com/aeeeeeep/objwatch)
```python3
import objwatch
objwatch.watch(targets=[deepspeed], framework="torch.distributed", indexes=[0,], with_locals=True)
```
Signed-off-by: aeeeeeep <aeeeeeep@proton.me>
For some accelerators (such as HPU) running in a non-compile scenarios,
the `compiler.enable` decorator can cause significant performance drops
up to 8-12%.
We can easily avoid the performance hit in non-compile scenarios, by
detecting the ongoing compilation and returning immediately.
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
This PR updates the kernel generation function arguments in Inductor to
ensure DeepCompile is compatible with PyTorch v2.8.
It also fixes the logging output of DeepCompile.
This PR adds ZenFlow, a importance-aware offloaded training framework
for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between
computation and communication during offloaded training, improving GPU
utilization and reducing stalls.
Highlights:
- New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW)
- ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration
- Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig
- Unit tests and documentation included
Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will
be introduced in a follow-up PR.
---------
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Guokai Ma <guokai.ma@gmail.com>
currently passing `deepspeed ... --venv_script foo.sh` ends up with a
pdsh cmd like:
```
pdsh -S -f 1024 -w 10.4.11.15,10.4.10.1 source foo.sh export NCCL_NET_PLUGIN=blah; ...
```
you can see, `;` is missing before exports start, so the first export is
ignored.
It should be:
```
pdsh -S -f 1024 -w 10.4.11.15,10.4.10.1 source foo.sh; export NCCL_NET_PLUGIN=blah; ...
```
This PR is fixing it.
In ZeRO offload, significant time is spent on CPUAdam, which is CPU
code. Thus use `--bind_cores_to_rank` in deepspeed launch command would
help improve the performance of ZeRO offload. This PR add this command
to ZeRO offload tutorial to increase user awareness.
For Qwen2.5-3B finetuning on 2 A100-40B cards, running on CPU host with
128 CPU cores, the average step time is as follow, near 1.3x performance
improvement:
without `--bind_cores_to_rank`: 3084.44ms per step
with `--bind_cores_to_rank`: 2383.16ms per step
---------
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
The following assertion error arises when torch autocast is enabled.
[rank3]: File
"/opt/deepspeed/deepspeed/runtime/zero/partitioned_param_coordinator.py",
line 337, in fetch_sub_module
[rank3]:
self.__inflight_param_registry.pop(param).wait(handle_dependency=not
fast_fetch)
[rank3]: File
"/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line
787, in wait
[rank3]: handle.wait(handle_dependency)
[rank3]: File "/opt/deepspeed/deepspeed/utils/nvtx.py", line 20, in
wrapped_fn
[rank3]: ret_val = func(*args, **kwargs)
[rank3]: File
"/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line
750, in wait
[rank3]: assert param.ds_status == ZeroParamStatus.INFLIGHT, f"expected
param {param.ds_summary()} to be inflight"
[rank3]: AssertionError: expected param {'id': 685, 'status':
'AVAILABLE', 'numel': 131334144, 'ds_numel': 131334144, 'shape': (32064,
4096), 'ds_shape': (32064, 4096), 'requires_grad': True, 'grad_shape':
None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape':
torch.Size([16416768])} to be inflight
This is due to multiple all-gather ops in the same coalesced all-gather
sharing the same list of params (of mixed dtypes).
Make each all-gather exchange only params of a certain dtype. Also pass
the allgather dtype that matches the params.
Signed-off-by: Junjie Mao <banxing.mjj@alibaba-inc.com>
Co-authored-by: Junjie Mao <banxing.mjj@alibaba-inc.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
This commit combines fixes for 37 potential code issues found in
Coverity scans.
the issues include but are not limited to potential access to
uninitialized variables, dead and redundant code.
We understand that reviewing such a commit can be difficult and will be
happy to help with any questions or changes required.
---------
Signed-off-by: Nir Sonnenschein <nsonnenschein@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Thanks again for giving opportunity for improving this Community!
This PR is from Issue #7423.
1) Motivation
To improve compatibility with low-level profiling tools (e.g., NVIDIA
CUPTI or DCGM), it can be useful to expose parallelism-specific rank
(tensor/pipeline/data) at the engine level.
2) Changes
I Added three getter methods to DeepSpeedEngine:
- get_tensor_parallel_rank()
- get_pipeline_parallel_rank()
- get_data_parallel_rank()
Thank you for reviewing this contribution!
---------
Signed-off-by: WoosungMyung <dntjd517@naver.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR adds `TiledFusedLogitsLoss` for an efficient fused logits+loss
computation - this version pre-calculates grads in `forward`, avoiding
recomputation in the backward (similar to the Liger-Kernel
implementation).
---------
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>