Commit Graph

2923 Commits

Author SHA1 Message Date
3870f5c7e0 clarify
Signed-off-by: Stas Bekman <stas@stason.org>
2025-09-22 16:27:57 +00:00
6d418802a7 logging: Also set log level of logger handlers
After #7526 the default logger passes logs to a StreamHandler, which has
its own log level. Changing the log level of the logger alone does not
take effect in such case.

Update the log level of all handlers when changing the parent logger's.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-22 12:16:31 +08:00
80033a8293 Update version.txt post 0.17.6 release (#7572) 2025-09-19 14:33:22 -07:00
e4f6da9685 [bugfix] fix partition context unpatch (#7566)
## Fix asymmetric patching/unpatching in
InsertPostInitMethodToModuleSubClasses

### Problem Description

The `InsertPostInitMethodToModuleSubClasses` context manager patches
`__init__` methods of model classes during entry and unpatches them
during exit.

However, asymmetric condition checks between patching and unpatching can
introduce subtle inheritance bugs.

### Root Cause Analysis

The issue occurs with classes that have multiple inheritance where:
1. **Child class A** does not override `__init__`
2. **Parent class B** does not inherit from `nn.Module`
3. **Parent class C** inherits from `nn.Module`

**Current asymmetric logic:**
```python
# Patching (entry): Only patch classes with explicit __init__
def _enable_class(cls):
    if '__init__' in cls.__dict__:  #  Strict check
        cls._old_init = cls.__init__
        cls.__init__ = partition_after(cls.__init__)

# Unpatching (exit): Restore any class with _old_init
def _disable_class(cls):
    if hasattr(cls, '_old_init'):  #  Permissive check
        cls.__init__ = cls._old_init
```

**Execution flow:**
1. **During entry**: Child A is skipped (no explicit `__init__`), Parent
C is patched
2. **During exit**: Child A inherits `_old_init` from Parent C and gets
incorrectly "restored"

**Result**: Child A's `__init__` points to Parent C's original
`__init__`, bypassing Parent B and breaking the inheritance chain.

### Reproduction Case

This pattern is common in Hugging Face models:
```python
class Qwen3ForSequenceClassification(GenericForSequenceClassification, Qwen3PreTrainedModel):
    pass  # No explicit __init__

# GenericForSequenceClassification - not a nn.Module subclass
# Qwen3PreTrainedModel - inherits from nn.Module
```

### Solution

Apply symmetric condition checking in both patch and unpatch operations:

```python
def _disable_class(cls):
    # Match the patching condition: only restore classes we explicitly patched
    if '__init__' in cls.__dict__ and hasattr(cls, '_old_init'):
        cls.__init__ = cls._old_init
        delattr(cls, '_old_init')  # Optional cleanup
```

This ensures that only classes that were explicitly patched during entry
get restored during exit.

### Testing

The fix has been validated against the Qwen3ForSequenceClassification
reproduction case and resolves the inheritance chain corruption.

### Related Issues
- External issue: https://github.com/modelscope/ms-swift/pull/5820

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
v0.17.6
2025-09-19 07:24:33 +00:00
6b731c5c96 scripts: Check .is_cuda only in non-C++ files (#7561)
The check-torchcuda.py today will search for all occurrences of .is_cuda
in the repository when a commit only modifies C++ headers and sources,
which I believe is not intended.

Check usage of .is_cuda only when a commit modifies any non-C++ file.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-19 05:01:50 +00:00
2585881ae9 Make Muon optimizer easier to enable (#7555)
The original Muon optimizer PR
(https://github.com/deepspeedai/DeepSpeed/pull/7509) requires user to
explicitly set `use_muon` flags in `model.parameters()`, as shown in
test
https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/ops/muon/test_muon.py#L27
.

This PR integrate setting of `use_muon` into DeepSpeed before engine
initialization. This makes Muon optimizer easier to use. User only needs
to change optimizer in `config.json` from `AdamW` to `Muon`, no need to
change code. It will solve the following issue
https://github.com/deepspeedai/DeepSpeed/issues/7552

---------

Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2025-09-17 09:52:11 -04:00
aa539c6dd5 fix npu device_id AttributeError issue (#7560)
## Environment
```
torch        2.7.1
torch_npu    2.7.1rc1
deepspeed    0.17.3
```
## Issue
An `AttributeError` is raised when `init_process_group` on NPU device
since deepspeed v0.17.3.
The issue is similar to
https://github.com/deepspeedai/DeepSpeed/pull/7488.

Trace:
```
Traceback (most recent call last):
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/cli/sft.py", line 10, in <module>
    sft_main()
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/train/sft.py", line 331, in sft_main
    return SwiftSft(args).main()
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/train/sft.py", line 27, in __init__
    super().__init__(args)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/base.py", line 19, in __init__
    self.args = self._parse_args(args)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/base.py", line 31, in _parse_args
    args, remaining_argv = parse_args(self.args_class, args)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/utils/utils.py", line 152, in parse_args
    args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 358, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 325, in __init__
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/argument/train_args.py", line 175, in __post_init__
    self.training_args = TrainerFactory.get_training_args(self)
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/trainer_factory.py", line 70, in get_training_args
    return training_args_cls(**args_dict)
  File "<string>", line 167, in __init__
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/arguments.py", line 152, in __post_init__
    super().__post_init__()
  File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/arguments.py", line 133, in __post_init__
    super().__post_init__()
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 1803, in __post_init__
    self.device
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 2332, in device
    return self._setup_devices
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/utils/generic.py", line 74, in __get__
    cached = self.fget(obj)
  File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 2259, in _setup_devices
    self.distributed_state = PartialState(**accelerator_state_kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/accelerate/state.py", line 216, in __init__
    dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 854, in init_distributed
    cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size)
  File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 120, in __init__
    self.init_process_group(backend, timeout, init_method, rank, world_size)
  File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 163, in init_process_group
    torch.distributed.init_process_group(backend, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
    func_return = func(*args, **kwargs)
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1717, in init_process_group
    default_pg, _ = _new_process_group_helper(
  File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1831, in _new_process_group_helper
    if device_id is not None and (device_id.index is None or device_id.type == "cpu"):
AttributeError: 'device' object has no attribute 'index'
```

## Fix
Switch `torch.npu.device(device_index)` to `torch.device('npu',
device_index)`.

Now:

d40a0f5de8/accelerator/npu_accelerator.py (L47-L48)

After fix:
```python
 def device(self, device_index=None): 
     return torch.device('npu', device_index) 
```

Signed-off-by: welsper <welsper@qq.com>
Co-authored-by: welsper <xinyuyang@cmbchina.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
2025-09-17 15:46:33 +08:00
2d84be8159 deepcompile: Create a full list of no-copy ops (#7562)
The list of torch no-copy ops is hard coded and does not include all
operations that may aliasing inputs in their outputs.

Instead of using a fixed list, iterate over all ops under torch.ops.aten
and identify those with aliasing behavior by inspecting their schema.

With PyTorch 2.7.1, the default overload of ops identified by the
updated logic include:

  - _nested_view_from_buffer
  - _reshape_alias
  - alias
  - as_strided
  - conj
  - detach
  - diagonal
  - expand
  - imag
  - lift_fresh
  - narrow
  - permute
  - pin_memory
  - positive
  - real
  - reshape
  - squeeze
  - t
  - unfold
  - unsqueeze
  - view
  - view_as_complex
  - view_as_real
  - most operations whose name ends with an underscore

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-16 09:05:11 -07:00
e9d5d416cc deepcompile: Record graph order using OrderedDict (#7563)
On clear, GraphOrder does not clears ordered_frames. That may confuses
subsequent passes after the first iteration.

Use an OrderedDict to record the mapping from frame IDs to other
graph-related information.

Also fix the type annotation of graph_order which is a list of (int ,
bool) tuples instead of a list of int.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-16 05:25:32 +00:00
660ee89529 deepcompile: Create dummy inputs using empty_strided (#7564)
CUDA tensors may have a larger storage than numel() * dtype.itemsize due
to alignment considerations. Creating dummy tensors by
torch.zero().as_strided() leads to out-of-bound errors in such cases.

Create dummy inputs by empty_strided().zero_() instead.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-15 14:19:06 -07:00
d40a0f5de8 Add dependency for deepcompile test (#7558)
This PR adds dependency to CI tests for DeepCompile.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-13 00:45:08 -07:00
b9bd03a2ec Move modal tests to tests/v1 (#7557)
This PR moves active tests under `tests/unit/v1` to clarify which tests
are run on modal.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-12 17:28:47 -04:00
0e859aa0d3 Fix gradient buffer access for DeepCompile Z1/2 (#7548)
The initialization of DeepCompile+Z1/2 now fails due to the change
introduced in #7509.

This PR resolves the issue by:
- Adding an argument to optimizer.get_flat_partition
- Skipping the entire allreduce function in the engine

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-10 18:12:02 +00:00
0012ff6ea8 Limit random seed range in tests (#7553)
`pytest-randomly` often passes a large seed value to `set_random_seed`
and causes an error
([example](https://github.com/deepspeedai/DeepSpeed/actions/runs/17620450004/job/50064585974))
```
E ValueError: Seed must be between 0 and 2**32 - 1
```

This PR limits the range of seed values by taking a modulo.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-10 17:45:37 +00:00
8cbbbb539d [MoE] Fix misuse of num_experts as expert parallel group size (ep_size) (#7551)
Fixes #7535 

## Description
This PR fixes a bug in inference/engine.py where num_experts
(moe_experts) was incorrectly passed as the expert parallel group size
(ep_size) when creating expert parallel groups.

Currently:
```
if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.moe_experts)
```
This causes **invalid** behavior whenever `num_experts > world_size`,
because `_create_ep_parallel_group` expects a group size, not the total
number of experts as pointed out by @Arnoochka

## Root Cause

num_experts = number of experts inside the MoE layer.

ep_size = how many GPUs to group together for expert parallelism.

These were mixed up in the code.

##Fix

Replaced the incorrect call with the proper ep_size argument:
```
if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.ep_size)
```


Additionally, added a safety check in _create_ep_parallel_group to catch
invalid configurations:

```
num_ep_groups = dist.get_world_size() // moe_ep_size
if num_ep_groups == 0:
    raise ValueError(
        f"Invalid ep_size={moe_ep_size} for world_size={dist.get_world_size()}"
    )
```
## Backward compatibility
- If a user was already running with ep_size >= num_experts, the old
code worked fine which would still work fine.
- Only the previously broken case (num_experts > world_size) now works
correctly.

Signed-off-by: Flakes342 <ayushtanwar1729@gmail.com>
2025-09-09 22:31:44 -07:00
533e834b0a [alstn tutorial] support bs>1 (#7550)
Edit tutorial's demo code to support bs>1 and prevent div by zero
2025-09-09 12:51:42 -07:00
450b965efb Revert "Add index to HPU devices (#7497)" (#7545)
This reverts commit 047a7599d24622dfb37fa5e5a32c671b1bb44233.

Unfortunately, the above required substantial redesign of existing HPU
stack, which is currently not feasible, so reverting.

Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-08 18:07:55 -04:00
b82ef716c8 Improve error message and reduce validation in autocast test (#7547)
This PR improves error logging and relaxes loss value checks in the
autocast test.

Previously, the test displayed error messages and mismatched loss values
on all ranks, even if the mismatch only occurred on some ranks. This was
confusing, since logs from other ranks could appear correct. This PR
changes the behavior so that error messages are shown only on the ranks
where the mismatch occurs.

Additionally, this PR skips loss value validation for
`test_lower_precision_model`, where we intentionally use a different
communication dtype from the baseline (standard PyTorch autocast).

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-09-05 07:04:18 +00:00
08879a3916 avoid setting device_id to init_process_group (#7542)
In some usecases such as vllm, we need to new distributed group not only
on gpu, but also on cpu, if we set `device_id` here, it will prevent us
from new distributed group on cpu:
[L230](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L230)
. This PR fixes this bug.

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-09-05 06:06:26 +00:00
78a74874b2 fix get_cuda_compile_flag (#7521)
command: python3 -c 'import
deepspeed;deepspeed.ops.adam.cpu_adam.CPUAdamBuilder().load()'
when running on the rocm platform, it encounter an error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
File
"/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py",
line 538, in load
    return self.jit_load(verbose)
File
"/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py",
line 570, in jit_load
    cxx_args = self.strip_empty_entries(self.cxx_args())
File
"/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py",
line 401, in strip_empty_entries
    return [x for x in args if len(x) > 0]
File
"/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py",
line 401, in <listcomp>
    return [x for x in args if len(x) > 0]
TypeError: object of type 'NoneType' has no len()

Compare with version 0.16.5:
https://github.com/deepspeedai/DeepSpeed/blob/v0.16.5/op_builder/builder.py#L435
The current version of code is missing a return when
self.is_rocm_pytorch() is True. Just add return '-D__DISABLE_CUDA__' is
ok!

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-04 12:34:17 -04:00
43537d0a60 Autotune ZenFlow affinity (#7506)
This PR address the following ZenFlow optimizer core binding issue.
https://github.com/deepspeedai/DeepSpeed/issues/7478

With this PR, ZenFlow optimizer worker would derive its core binding
from deepspeed core binding mechanism. The algorithm is as following:
1. Each DeepSpeed rank get its core binding by using DeepSpeed command
line `--bind_cores_to_rank`, this command would assign each CPU physical
cores to different workers
2. When spawing ZenFlow optimizer worker, DeepSpeed would split current
CPU affinity list into two sublist: pt_affinity and zf_affinity
3. zf_affinity would be used to set affinity of ZenFlow optimizer
worker. pt_affinity would be used to set current pytorch process.
4. By default, one cores is reserved by each pytorch process, the rest
is used by ZenFlow optimizer worker. The number of cores reserved for
pytorch process can be changed by ZenFlow config variable:
`pt_reserved_cores`

---------

Signed-off-by: Guokai Ma <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Signed-off-by: aeeeeeep <aeeeeeep@proton.me>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: aeeeeeep <aeeeeeep@proton.me>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
Co-authored-by: Zhipeng Wang <zwanga@wustl.edu>
Co-authored-by: Peng Du <pedu@linkedin.com>
Co-authored-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-09-04 07:10:39 -04:00
66bf2a642d Relax restrictions of torch.autocast integration (#7543)
This PR relaxes two restrictions on torch.autocast in the DeepSpeed
engine:

1) Nesting torch.autocast
Currently, we do not expect `torch.autocast` to be used outside the
DeepSpeed engine. Here is the current behavior:
- If `torch.autocast` is enabled in the DeepSpeed config and the engine
detects it is also enabled outside, a warning is displayed.
- If it is disabled in the config, the engine raises an error.

This design prevents the following usage:
```python
with torch.autocast(...):
    logits = deepspeed_model(...)
    loss = criteria_fn(logits)
```
In this case, we also want to apply autocast to `criteria_fn`. With the
current behavior, we would need move `deepspeed_model(...)` outside the
`torch.autocast` context, leading to inconsistent code between DeepSpeed
and non-DeepSpeed setups. (cannot be handled with `enabled` arg of
`torch.autocast`)

Change in this PR:
`torch.autocast` outside the DeepSpeed engine is ignored, and
- If `torch_autocast` is enabled in the config, DeepSpeed will follow
that setting.
- If it is disabled, DeepSpeed falls back to its own mixed-precision
support (or FP32).

In these cases, DeepSpeed engine shows a message to explain the
behavior.

2) Model’s dtype

Previously, DeepSpeed assumed the model’s dtype must be FP32 when
`torch.autocast` was enabled. However, models with lower-precision
parameters (e.g., BF16) can also be used with autocast. For example, if
both the model and `torch.autocast` use BF16, autocast will upcast
precision-sensitive ops as needed.

Change in this PR:
Removed the assertion that restricted the model’s dtype to FP32.

This PR also adds and updates tests to cover these new behaviors.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-03 12:15:10 -07:00
8af75487f4 Fix zenflow_torch_adam.py (#7544)
`_disable_dynamo_if_unsupported` fallback wasn't getting created under
certain conditions. This PR is fixing this. Also removed debug print.

Fixes issue installing deepspeed on torch 2.4 and 2.1 that triggered
this:
```
#42 15.84       Traceback (most recent call last):
#42 15.84         File "<string>", line 2, in <module>
#42 15.84         File "<pip-setuptools-caller>", line 34, in <module>
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/setup.py", line 40, in <module>
#42 15.84           from op_builder import get_default_compute_capabilities, OpBuilder
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/op_builder/__init__.py", line 18, in <module>
#42 15.84           import deepspeed.ops.op_builder  # noqa: F401 # type: ignore
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/__init__.py", line 25, in <module>
#42 15.84           from . import ops
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/__init__.py", line 6, in <module>
#42 15.84           from . import adam
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/adam/__init__.py", line 9, in <module>
#42 15.84           from .zenflow_torch_adam import ZenFlowSelectiveAdamW
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/adam/zenflow_torch_adam.py", line 685, in <module>
#42 15.84           @_disable_dynamo_if_unsupported(single_tensor_fn=_single_tensor_adamw)
#42 15.84       NameError: name '_disable_dynamo_if_unsupported' is not defined
#42 15.84       [WARNING] ZenFlow disabled: torch internal optimizer symbols could not be imported: cannot import name '_disable_dynamo_if_unsupported' from 'torch.optim.optimizer' (/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py)
```

---------

Signed-off-by: Stas Bekman <stas@stason.org>
2025-09-03 18:14:18 +00:00
1e183a6a9d Fix scaling and allgather with torch.autocast (#7534)
This PR includes these two fixes:
- Use GradScaler only for FP16 (not for BF16)
- Fix dtype conversion for ZeRO3 allgather
- The reduce hook should be called only once, even when a parameter is
shared across multiple layers (tied parameters).
- Currently, the hook is triggered at each tied layer because we
temporarily set `.data` with a different dtype.
- The fix ensures that the parameter consistently retains the same
dtype.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: jakehemmerle <jakehemmerle@protonmail.com>
Signed-off-by: Qi Bin <qibin0506@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Jake Hemmerle <jakehemmerle@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Qi Bin <qibin0506@users.noreply.github.com>
2025-09-03 01:22:19 +00:00
c07b3abf9a fixed DeepSpeedCPULion with ZeRO-Offload bug (#7531)
fixed DeepSpeedCPULion with ZeRO-Offload bug
[issues/7524](https://github.com/deepspeedai/DeepSpeed/issues/7524)

Signed-off-by: Qi Bin <qibin0506@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-09-02 21:40:14 +00:00
4d83f3fe13 docs typo: lrrt.md, reference to cycle_min_lr should be cycle_max_lr (#7530)
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: jakehemmerle <jakehemmerle@protonmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-09-02 21:17:22 +00:00
9e4957eb30 [doc] fixing moe tutorial (#7538)
MoE tutorial fixes:
1. cifar example has been moved - fix the url
2. fixing text and improving markup

---------

Signed-off-by: Stas Bekman <stas@stason.org>
2025-09-02 16:53:15 -04:00
066d912052 [logging] less startup noise (#7526)
This PR removes some and enables removing other startup noise -
especially when it's replicated rank-times and doesn't carry any
informative payload.

1. add `--log_level` flag which sets the launcher's logger to a desired
setting - defaulting to `logging.INFO` for now for BC, but will change
to `logging.WARNING` in v1
2. add `--quiet/-q` flag which sets the launcher's logger to
`logging.ERROR` which essentially disables startup info messages
3. change the logging defaults elsewhere to `logging.WARNING` (main
impact is the accelerator.py), once deepspeed started the frameworks
control its loglevel for each rank, so the tricky part is this pre-start
stage info logs. this part is breaking BC as there is no machinery to
set the logger level for `real_accelerator.py`)
4. builder is changed to non-verbose (BC breaking)

---------

Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-02 19:14:57 +00:00
411e20a3f7 undo the revert (#7536)
replay https://github.com/deepspeedai/DeepSpeed/pull/3019 as it got
reverted
2025-09-02 14:24:48 -04:00
902e78c989 fix typo s/1014 /1024 (#7528)
fix typo s/1014 /1024  
         s/was_interruptted /was_interrupted

detail info 
        modified:   deepspeed/autotuning/scheduler.py
        modified:   deepspeed/autotuning/utils.py

Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-01 01:12:40 +00:00
eabb687ac1 ZeRO3: Improve mismatch detection (#7525)
ZeRO3 tracks DDP (SPMD) behavior by matching values different training
states across ranks. Some of these states are represented as lists, and
mismatches sometimes manifests as hangs during error detection. This PR
improves error detection by first validating the list lengths across
ranks before validating the list contents.

Motivated by
https://github.com/deepspeedai/DeepSpeed/issues/7461#issuecomment-3235146207

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-08-31 17:57:10 -04:00
9bf215d213 Add riscv64 cpu support in deepspeed_shm_comm op (#7519)
This patch adds riscv64 support for the deepspeed_shm_comm
operator,enabling DeepSpeed to perform CPU training/inference on RISCV64
hosts, for research purposes. Based on the discussion in pull #7387 ,
this patch refactors some original code to support multiple CPU
architectures.

Related tests have passed on x86 and RISC-V CPU, and I successfully ran
Qwen2.5 on a RISC-V CPU,
```bash
(myenv) [root@openeuler-riscv64 DeepSpeed ]$ pytest tests/unit/comm/test_dist.py::TestDistInferenceAllReduce -vv
====================================================================== test session starts =======================================================================
platform linux -- Python 3.11.4, pytest-7.2.0, pluggy-1.6.0 -- /root/myenv/bin/python3
cachedir: .pytest_cache
hypothesis profile 'default'
rootdir: /root/ecosystem/DeepSpeed/tests, configfile: pytest.ini
plugins: mock-3.14.1, hypothesis-6.135.14, forked-1.6.0
collected 3 items

tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype0] PASSED                                                                              [ 33%]
tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype1] PASSED                                                                              [ 66%]
tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype2] PASSED                                                                              [100%]

(myenv) root@ubuntu-2204:~/soft-working-dir/DeepSpeed# pytest tests/unit/comm/test_dist.py::TestDistInferenceAllReduce -vv
====================================================================== test session starts =======================================================================
platform linux -- Python 3.12.3, pytest-7.2.0, pluggy-1.6.0 -- /root/soft-working-dir/myenv/bin/python3
cachedir: .pytest_cache
rootdir: /root/soft-working-dir/DeepSpeed/tests, configfile: pytest.ini
plugins: forked-1.6.0
collected 3 items

tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype0] PASSED                                                                              [ 33%]
tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype1] PASSED                                                                              [ 66%]
tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype2] PASSED                                                                              [100%]

```

---------

Signed-off-by: heyujiao99 <he.yujiao@sanechips.com.cn>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
2025-08-29 23:41:25 +08:00
e04fa3e679 Update README with ZenFlow release blog featured by PyTorch. (#7520)
**Main change:**
Add post bullet and link to ZenFlow release blog on latest news.

**Blog link:**

https://pytorch.org/blog/zenflow-stall-free-offloading-engine-for-llm-training/

---------

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
2025-08-28 13:28:08 -04:00
889f0ead27 Enable non-ZeRO mode (#7515)
Enabled via `stage=0` which corresponds to DDP. 
Remove hardwired path to b16_optimizer.
Enable`torch.autocast` for DDP training
Enable native mixed precision DDP for bfloat16
Update torch.autocast and native mixed precision UTs

<img width="976" height="184" alt="image"
src="https://github.com/user-attachments/assets/92904cdc-e312-46a4-943f-011eb5ab146a"
/>

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-08-27 14:07:29 -04:00
66ad278048 Enabling Muon Optimizer in DeepSpeed (#7509)
Authorship: @pengdurice and @PKUWZP 

Related Issue: #7438

# Introduction

[Muon](https://arxiv.org/abs/2502.16982), a new optimizer that has
attracted the community’s attention recently shows promising results in
training large language models. Adding the Muon Optimizer to DeepSpeed,
a popular OSS framework for large scale training and inference is
critically important for DeepSpeed users and developers. There has been
a [PR](https://github.com/deepspeedai/DeepSpeed/pull/7454) attempting
the adoption. (Huge Thanks to @qimcis), which is a good starting point.
It still requires more substantial effort to make it fully compatible
and work within DeepSpeed. We are publishing this PR to fully enable
Muon Optimizer capabilities for DeepSpeed.

# Issues and solutions
## Issues
1. With stage 1, 2 or 3, the optimizer states will be partitioned within
the same data parallel group. This means that each process is already
handling only parts of the model parameters and there is no need to use
the DP solution as in the
[code](https://github.com/KellerJordan/Muon/blob/master/muon.py#L195).
2. The parameters (and the gradients) will be flattened to 1D vector
before being used in the optimizer, thus nullifying the major hypothesis
of the muon optimizer: it works by orthogonalizing the updates for each
matrix (dim >=2)

## Solutions
To solve the issues, we propose this new PR in which: 
1. We simplify the Muon code by
[removing](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-c9052994e41caee9ca88363749c10af08655f8019f08dc971c018663d25a3712R22)
the partitioning and muon updates logics.

2. We
[move](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1867)
the muon update to the
[get_flat_partition](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1848)
function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter
gradients are collected before being flattened and used by the optimizer
to update the model parameters. Since each parameter is still in its
original shape, we can easily apply the muon updates.
3. We also save the momentum buffer into the optimizer’ state so that we
have a smooth convergence after applying the saved checkpoints.
4. We added comprehensive unit tests to validate Muon Optimizer's
correctness and functionality.

# Future directions and roadmap
In the future, several follow up works are of interests:
- [ ] Create a CPU offload version.
- [ ] Apply Muon to Stage 3
- [ ] Use the highly optimized version of Adam for the Adam part of
MuonWithAuxAdam optimizer.
- [ ] More efficient implementations e.g. a) add specialized kernels for
Newton-Schulz iteration and muon updates; b) parallelize updates for the
parameters (currently, each parameter is updated separately and
sequentially)

---------

Co-authored-by: Peng Du <pedu@linkedin.com>
Co-authored-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-26 18:34:35 -07:00
e4662faffd Update TSC Committers (#7517)
Update the affiliations and the TSC Committers.

Co-authored-by: Zhipeng Wang <zwanga@wustl.edu>
2025-08-26 07:24:12 -04:00
38d1a9eb64 Fix assert when 'pp_int' object has no attribute 'custom_print_str' (#7507)
Fix assert `'pp_int' object has no attribute 'custom_print_str'` when
tracking deepspeed module with some track debug tools like
[objwatch](https://github.com/aeeeeeep/objwatch)

```python3
    import objwatch
    objwatch.watch(targets=[deepspeed], framework="torch.distributed", indexes=[0,], with_locals=True)
```

Signed-off-by: aeeeeeep <aeeeeeep@proton.me>
2025-08-25 10:57:08 -04:00
d9cb78683e CI funding shout out to modal.com (#7503)
modal.com has been sponsoring our CI - thank you, Modal! Add a shout
out.
2025-08-21 10:03:49 -07:00
bc8c0db3b4 Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 (#7421)
Please refer to https://github.com/deepspeedai/DeepSpeed/issues/7251

---------

Signed-off-by: lym <letusgo126@126.com>
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Signed-off-by: Alex Kiefer <alexkiefer51@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: Sam Foreman <saforem2@gmail.com>
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: huanyuqu <yc37960@um.edu.mo>
Signed-off-by: weeknan <zhounan0431@163.com>
Signed-off-by: WoosungMyung <dntjd517@naver.com>
Signed-off-by: Nir Sonnenschein <nsonnenschein@habana.ai>
Signed-off-by: Junjie Mao <banxing.mjj@alibaba-inc.com>
Signed-off-by: vinceliu <lpnpcs@gmail.com>
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com>
Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: cyy <cyyever@outlook.com>
Co-authored-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Alexander Kiefer <56556451+alexk101@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Sam Foreman <saforem2@gmail.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: huanyuqu <55744355+huanyuqu@users.noreply.github.com>
Co-authored-by: weeknan <57584045+weeknan@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
Co-authored-by: WoosungMyung <115716986+WoosungMyung@users.noreply.github.com>
Co-authored-by: Nir Sonnenschein <nsonnenschein@habana.ai>
Co-authored-by: Junjie Mao <junjie.mao@hotmail.com>
Co-authored-by: Junjie Mao <banxing.mjj@alibaba-inc.com>
Co-authored-by: lpnpcs <lpnpcs@vip.qq.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Tingfeng Lan <tafflann@outlook.com>
Co-authored-by: Rui Yan <49115835+yanrui27@users.noreply.github.com>
Co-authored-by: Feng Yunlong <20281571+AlongWY@users.noreply.github.com>
Co-authored-by: Yao Matrix <matrix.yao@intel.com>
Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu>
Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
Co-authored-by: Yuanyuan Chen <cyyever@outlook.com>
Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>
2025-08-20 22:03:26 +00:00
f45159e415 Update version.txt after 0.17.5 release (#7502) 2025-08-20 21:41:57 +00:00
047a7599d2 Add index to HPU devices (#7497)
The [PR #7266](https://github.com/deepspeedai/DeepSpeed/pull/7266)
enforces the devices having explicit device indices (i.e. 'hpu:0',
'cuda:0', etc).

This PR aligns HPU devices to the requirement.

Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
v0.17.5
2025-08-19 00:30:56 +00:00
8cf5fc5787 Reduce performance impact of compiler.enable decorator (#7498)
For some accelerators (such as HPU) running in a non-compile scenarios,
the `compiler.enable` decorator can cause significant performance drops
up to 8-12%.

We can easily avoid the performance hit in non-compile scenarios, by
detecting the ongoing compilation and returning immediately.

Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-18 22:04:10 +00:00
12b4dc19a7 Fix DeepCompile for PyTorch v2.8 (#7496)
This PR updates the kernel generation function arguments in Inductor to
ensure DeepCompile is compatible with PyTorch v2.8.
It also fixes the logging output of DeepCompile.
2025-08-18 12:12:59 -04:00
1c03d1b1bb Fix invalid f-strings (#7457)
Fix invalid f-strings detected by ruff.

---------

Signed-off-by: cyy <cyyever@outlook.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>
2025-08-16 18:22:19 +00:00
1d7b90adc4 Add Zenflow code for Stage 1 & 2 (#7391)
This PR adds ZenFlow, a importance-aware offloaded training framework
for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between
computation and communication during offloaded training, improving GPU
utilization and reducing stalls.

Highlights:
- New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW)
- ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration
- Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig
- Unit tests and documentation included

Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will
be introduced in a follow-up PR.

---------

Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Guokai Ma <guokai.ma@gmail.com>
2025-08-15 17:32:22 +00:00
33cd94500e fix xpu device_id AttributeError issue (#7488)
# Reproduce
w/ PyTorch 2.8
```
$ git clone https://github.com/huggingface/trl.git
$ cd ./trl
$ accelerate launch     --config_file examples/accelerate_configs/deepspeed_zero3.yaml     examples/scripts/sft_gpt_oss.py     --torch_dtype bfloat16     --model_name_or_path openai/gpt-oss-20b     --packing true packing_strategy wrapped     --run_name 20b-full-eager     --attn_implementation sdpa     --dataset_num_proc 6     --dataset_name HuggingFaceH4/Multilingual-Thinking     --gradient_checkpointing     --max_length 4096     --per_device_train_batch_size 1     --num_train_epochs 1     --logging_steps 1     --warmup_ratio 0.03     --lr_scheduler_type cosine_with_min_lr     --lr_scheduler_kwargs '{"min_lr_rate": 0.1}'     --output_dir gpt-oss-20b-multilingual-reasoner     --report_to trackio     --seed 42
```

# Issue

> File "/workspace/accelerate/src/accelerate/state.py", line 216, in
__init__
> dist.init_distributed(dist_backend=self.backend,
auto_mpi_discovery=False, **kwargs)
> File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py",
line 854, in init_distributed
> cdb = TorchBackend(dist_backend, timeout, init_method, rank,
world_size)
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line
120, in __init__
> self.init_process_group(backend, timeout, init_method, rank,
world_size)
> File
"/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line
164, in init_process_group
>     torch.distributed.init_process_group(backend, **kwargs)
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py",
line 81, in wrapper
>     return func(*args, **kwargs)
>            ^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py",
line 95, in wrapper
>     func_return = func(*args, **kwargs)
>                   ^^^^^^^^^^^^^^^^^^^^^
> File
"/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py",
line 1685, in init_process_group
>     if device_id is not None and device_id.type != "cpu":
> AttributeError: 'device' object has no attribute 'type'

# Root Cause
`torch.xpu.device` in PyTorch is a context manager in PyTorch rather
than a device class, it doesn't have attribute `type`

# Fix
switch to use `torch.device`

Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-15 16:40:47 +00:00
64ac13f72e Enable forked PRs (#7486)
Enable forked PRs

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-14 17:43:08 -04:00
8aadf6cbe4 Fix pre-compile on cpu-only machines (#7168)
+ Fix pre-compile on cpu-only machines

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-12 10:24:30 -04:00
a54c394392 [TiledFusedLogitsLoss] support inference (#7477)
Adding inference support for `TiledFusedLogitsLoss` by skipping
`backward` inside `forward` if the incoming tensor doesn't require grad.

xref: https://github.com/snowflakedb/ArcticTraining/pull/259

---------

Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Rui Yan <49115835+yanrui27@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-11 17:44:32 -04:00
d75196a098 [UlyssesSPDataLoaderAdapter] fix iterator reset (#7472)
Fixes https://github.com/snowflakedb/ArcticTraining/issues/254 - to
support multi-epoch training with `UlyssesSPDataLoaderAdapter`.

Thanks to @yanrui27 for the fix

Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Rui Yan <49115835+yanrui27@users.noreply.github.com>
2025-08-11 20:45:10 +00:00