## Fix `UnboundLocalError` in `ZeroLinear.backward()` when training only
bias parameters, as mentioned in #7435
This PR addresses an issue in the `ZeroLinear.backward()` method, where
the local variable `dim` could be referenced before assignment. This
happens specifically when:
- Only the bias parameters are set to `requires_grad=True`, and
- The training setup uses **ZeRO Stage 3**, **AMP**, and **gradient
checkpointing**.
### Problem
When only the bias requires gradients, the condition for setting `dim =
grad_output.dim()` is skipped, but the value of `dim` is still used
later in the computation, leading to:
### Fix
Move the assignment `dim = grad_output.dim()` to occur unconditionally,
so that `dim` is always defined before being used in any branch of the
gradient computation logic.
### Impact
This makes the backward pass more robust across different training
setups.
Signed-off-by: weeknan <zhounan0431@163.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR fixes an `AttributeError` that occurs during
`deepspeed.init_inference` when using kernel injection
(`replace_with_kernel_inject=True`) with Llama models from recent
versions of `transformers`.
**The Bug:**
In newer `transformers` versions (e.g., `4.53.3`), configurations like
`num_heads` and `rope_theta` were moved from direct attributes of the
`LlamaAttention` module into a nested `config` object.
The current DeepSpeed injection policy tries to access these attributes
from their old, direct location, causing the initialization to fail with
an `AttributeError: 'LlamaAttention' object has no attribute
'num_heads'`.
**The Solution:**
This change updates the Llama injection logic to be more robust:
1. It first tries to read attributes like `num_heads` from the new
`config` object location.
2. If that fails, it falls back to the legacy direct attribute path.
---------
Signed-off-by: huanyuqu <yc37960@um.edu.mo>
Improved TiledMLP and SequenceTiledCompute for bs>1
This PR:
- extends the testing utils to add `CaptureStd*`, `CaptureLogger`
context managers
- extends the test to run both bs=1 and bs=2
- use an uneven seqlen to test varlen shards
- flattens bs+seqlen dim, to avoid problems with grad tensor strides
when bs>1 - mlp doesn't care for the bs dimension so using a pretend
`bs*seqlen` seqlen instead and restoring the shape at the end for the
grad.
---------
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR overcomes this issue when using any `torch.distributed` calls w/
deepspeed:
```
[W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0
to perform barrier as devices used by this process are currently unknown. This can
potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in
barrier() to force use of a particular device, or call init_process_group() with a device_id.
```
by setting `device_id` to the correct device corresponding to
`LOCAL_RANK` env var.
-------------------
Update: discovered `torch.dist` deadlocks with `torch=>2.7.0` when using
`device_id` arg - switching to draft for now as we can't commit this
until we know how to work around this.
---------
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
The KV cache can be passed via `layer_past` or `past_key_value`
arguments. Previously, `past_key_value` was ignored, causing workload
incompatibilities.
This PR fixes the issue while preserving the original logic.
---------
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Trying to use the `DeepSpeed/deepspeed/checkpoints/ds_to_universal.py`,
I encountered:
```python
Traceback (most recent call last):
File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards
sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank
self._strip_tensor_paddings(sd)
File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings
group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone()
RuntimeError: narrow(): length must be non-negative.
```
(see full traceback[^traceback] below)
The issue is, there's no way to propagate the `strip_tensor_paddings`
argument from the
[`DeepSpeedCheckpoint.get_zero_checkpoint_state(...)`](affee605e4/deepspeed/checkpoint/deepspeed_checkpoint.py (L123))
method through to the [`ZeroCheckpoint.get_state_for_rank(...)`
method](affee605e4/deepspeed/checkpoint/zero_checkpoint.py (L53))
(which accepts it as an argument) since it doesn't accept it.
This PR adds this additional `strip_tensor_paddings` argument (default
`True`) in the `DeepSpeedCheckpoint.get_zero_checkpoint_state` method,
and passes it through to the
`self.zero_checkpoint.get_state_for_rank(...,
strip_tensor_paddings=strip_tensor_paddings)`, as shown below:
```diff
- def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict:
+ def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True) -> dict:
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
tp_index=tp_index,
dp_index=dp_index,
- keys_to_ignore=[PARAM_SHAPES])
+ keys_to_ignore=[PARAM_SHAPES],
+ strip_tensor_paddings=strip_tensor_paddings)
```
[^traceback]: Full traceback:
<details closed><summary>[Full Traceback]:</summary>
```bash
#[🐍 aurora_nre_models_frameworks-2025.0.0](👻
aurora_nre_models_frameworks-2025.0.0)
#[/f/A/C/f/p/a/Megatron-DeepSpeed][🌱 saforem2/fix-formatting][✓]
#[07/12/25 @ 16:07:12][x4209c2s4b0n0]
;
ckpt_dir=checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash
; gs=$(cat "${ckpt_dir}/latest_checkpointed_iteration.txt") && echo
"global step: ${gs}" && python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py
--input_folder"${ckpt_dir}/global_step${gs}" --output_folder
"${ckpt_dir}/global_step${gs}_universal" --keep_temp_folder
global step: 158945
[W712 16:07:17.966425018 OperatorEntry.cpp:155] Warning: Warning only
once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the
same dispatch key
operator: aten::_cummax_helper(Tensor self, Tensor(a!) values,
Tensor(b!) indices, int dim) -> ()
registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6
dispatch key: XPU
previous kernel: registered at
/build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476
new kernel: registered at
/build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971
(function operator())
/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py:6:
UserWarning: pkg_resources is deprecated as an API. See
https://setuptools.pypa.io/en/latest/pkg_resources.html. The
pkg_resources package is slated for removal as early as 2025-11-30.
Refrain from using this package or pin to Setuptools<81.
import pkg_resources
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'
[2025-07-12 16:07:27,740] [INFO]
[real_accelerator.py:254:get_accelerator] Setting ds_accelerator to xpu
(auto detect)
[2025-07-12 16:07:29,078] [INFO] [logging.py:107:log_dist] [Rank -1]
[TorchCheckpointEngine] Initialized with serialization = False
args =
Namespace(input_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945',
output_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal',
num_extract_workers=4, num_merge_workers=2, keep_temp_folder=True,
strict=True, inject_missing_state=False)
Convert DeepSpeed Checkpoint to Universal Checkpoint
Converting DeepSpeed checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945
to Universal checkpoint in
checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal
/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:290:
FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated.
Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
def forward(
/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:334:
FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated.
Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
def backward(ctx, grad_output):
[2025-07-12 16:07:39,134079][I][ezpz/__init__:264:ezpz] Setting logging
level to 'INFO' on 'RANK == 0'
[2025-07-12 16:07:39,136376][I][ezpz/__init__:265:ezpz] Setting logging
level to 'CRITICAL' on all others 'RANK != 0'
*** 1. Extracting ZeRO fragments
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋|
767/768 [01:29<00:00, 8.53it/s]
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py",
line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 114, in extract_zero_shards
sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index,
tp_index=tp_index, dp_index=dp_index)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py",
line 124, in get_zero_checkpoint_state
return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index,
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 62, in get_state_for_rank
self._strip_tensor_paddings(sd)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py",
line 110, in _strip_tensor_paddings
group_state[state_name] = torch.narrow(state_value, 0, 0,
raw_length).clone()
RuntimeError: narrow(): length must be non-negative.
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 549, in <module>
main(args)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 499, in main
_extract_zero_shard_files(args, ds_checkpoint, temp_dir)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 370, in _extract_zero_shard_files
_do_parallel_work(do_work, _3d_range_list, args.num_extract_workers)
File
"/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py",
line 354, in _do_parallel_work
results.append(f.result())
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 451, in result
return self.__get_result()
File
"/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py",
line 403, in __get_result
raise self._exception
RuntimeError: narrow(): length must be non-negative.
[1] 144664 exit 1 python3
deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder
took: 0h:02m:08s
```
</details>
Signed-off-by: Sam Foreman <saforem2@gmail.com>
closes#7415
# Description
Resets `bucket.elements` after reduction in ZeRO Stage 3.
Without this, the bucket grows indefinitely, reducing only one param at
a time.
Added `bucket.elements = 0` after `params_in_bucket.clear()`.
Co-authored-by: a <a>
Dynamo breaks graphs because currently compile is disabled for a number
of functions such as `iter_params` and `record_module`.
The above functions compile successfully for at least PyTorch version
2.7.0.
We enable the compilation based on the user PyTorch version using a new
`compiler.enable(min_version=None)` decorator.
This should avoid the corresponding graph breaks and improve the
performance.
---------
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
It looks like my TiledMLP was working correctly only for batch_size=1
fixing to work with any bs
thanks to @winglian for detecting the problem and sending me an easy
repro
---------
Signed-off-by: Stas Bekman <stas@stason.org>
As in `destroy`, `self.optimizer` is called, but the error out calling
to `destroy` can happen in `__init__`, even before optimizer and
scheduler is configured. So we need to move `self.optimizer` to the top
to avoid triggering another exception.
e.g.:
```logs
File "deepspeed/runtime/engine.py", line 453, in _configure_tensor_parallel_states
assert self.zero_optimization_stage(
AssertionError: Currently, the compatibility between 'autotp' and 'zero_stage = 3' has not been validated
Exception ignored in: <function DeepSpeedEngine.__del__ at 0x1516c0610820>
Traceback (most recent call last):
File "deepspeed/runtime/engine.py", line 509, in __del__
self.destroy()
File "deepspeed/runtime/engine.py", line 512, in destroy
if self.optimizer is not None and hasattr(self.optimizer, 'destroy'):
File "deepspeed/runtime/engine.py", line 621, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DeepSpeedEngine' object has no attribute 'optimizer'
```
Signed-off-by: Hollow Man <hollowman@opensuse.org>
This PR fixes an omission in the `deepspeed.comm` API where `GradBucket`
was not exposed, despite the package aiming for full compatibility with
`torch.distributed`.
##The Problem
As reported in issue #7393, when a user replaces `torch.distributed`
with `deepspeed.comm`, they expect all public APIs to be available.
However, attempting to access `deepspeed.comm.GradBucket` (for example,
when using it as a type hint for DDP communication hooks) results in an
`AttributeError`.
##The Solution
This change resolves the issue by importing `GradBucket` directly from
`torch.distributed` into the `deepspeed/comm/comm.py` file, making it
part of the public `deepspeed.comm` namespace.
A `# noqa: F401` comment has been added to the import line. This is
necessary to bypass the `flake8` linter's "imported but unused" check,
as the specific purpose of this import is to expose the symbol to the
end-user, not for it to be used within the `comm.py` file itself.
##How This Was Tested
The fix was verified with a local test script that confirms
`deepspeed.comm.GradBucket` can now be accessed correctly and is
identical to `torch.distributed.GradBucket`. The pre-commit hooks now
pass successfully.
##Related run test Screenshout
<img width="1250" alt="Screenshot 2025-06-30 at 22 41 10"
src="https://github.com/user-attachments/assets/cadf18e1-9d1a-4164-a5ff-0b3e6804ac48"
/>
##Related Issue
Fixes#7393
Signed-off-by: Vensenmu <vensenmu@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
In torch v2.8.0, all symm mem code are moved into a dedicated folder
ffc6cbfaf7
So this PR tries to address this change by checking if we have it
located under `torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp`
(new location). If not, we fall back to the original place for backward
compatibilities.
This PR also clean up some includes in `z1/2/3.cpp` that has already
been included in `deepcompile.h`
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Since currently `z1.h`, `z2.h` and `z3.h` are located under
`csrc/compile`, without this patch, torch hipify will fail to identify
these hipified headers on AMD platform:
```log
In file included from torch/include/ATen/cuda/CUDAEvent.h:3,
from deepspeed/ops/csrc/includes/deepcompile.h:16,
from deepspeed/ops/csrc/compile/z1.h:6,
from deepspeed/ops/csrc/compile/z1_hip.cpp:7:
torch/include/ATen/cuda/ATenCUDAGeneral.h:3:10: fatal error: cuda.h: No such file or directory
3 | #include <cuda.h>
| ^~~~~~~~
compilation terminated.
```
Signed-off-by: Hollow Man <hollowman@opensuse.org>
In `comms_logging.py`, when calling log_all and the `show_straggler`
option is enabled, an all_reduce is performed across all nodes to
calculate the minimum latency to find stragglers. However, the tensors
on which this is performed are not sent to the configured devices. This
commit adds this capability using deepspeed's abstract accelerator api.
Resolves#7397
Signed-off-by: Alex Kiefer <alexkiefer51@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
This PR fixes the behavior of DeepCompile's ZeRO stage 1 and adds stage
2 support.
DeepCompile's ZeRO1 currently performs allreduce at every iteration even
when it is not a gradient accumulation boundary. This significantly
slows down the performance when gradient accumulation is enabled. This
PR fixes this issue by performing allreduce only at the gradient
accumulation boundary.
As the current behavior is similar to ZeRO2, this PR also adds
DeepCompile's ZeRO2 support. We can now set zero stage to 2 with
DeepCompile.
The loss values, performance, and memory usages were verified using this
[verification tool](https://github.com/tohtana/ds_verify_loss)
([results](https://github.com/tohtana/ds_verify_loss/blob/main/results/results_20250617_035117/report.md)).
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
......
File "torch/_dynamo/backends/common.py", line 72, in
_wrapped_bw_compiler
return disable(disable(bw_compiler_fn)(*args, **kwargs))
File "torch/_dynamo/eval_frame.py", line 838, in _fn
return fn(*args, **kwargs)
File "deepspeed/compile/inductor.py", line 27, in wrapped_compiler
mod_graph = dc_compiler(gm, fake_inputs)
File "deepspeed/compile/backend.py", line 330, in make_bw_graph
run_opt_passes(
File "deepspeed/compile/backend.py", line 206, in run_opt_passes
mem_prof.run(*create_inputs_fn())
File "deepspeed/compile/profilers/graph_profile.py", line 261, in run
return return_val
UnboundLocalError: local variable 'return_val' referenced before
assignment
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
CUDA Toolkit 12.9 has been out for a while. The build currently fails
when it is installed as the builder checks against hardcoded values.
this PR adds the value 12.9. a better mechanism would be to check
dynamically that the major number is the same... maybe next time when
CUDA13 comes out :)
Signed-off-by: LosCrossos <165311345+loscrossos@users.noreply.github.com>
### Description
This PR fixes an `AttributeError: 'UnembedParameter' object has no
attribute 'dtype'` that occurs in the Inference V2 engine. The issue is
triggered when using a high-level interface like
[DeepSpeed-MII](https://github.com/deepspeedai/DeepSpeed-MII) to run
inference on models with tied input/output embeddings, such as Llama 2.
**Resolves: #7260**
### Root Cause Analysis
The root cause is that while the `ParameterBase` metaclass correctly
creates property setters for parameter tensors, the setter function
(`param_setter`) only assigns the tensor value itself. It does not
propagate the tensor's `dtype` to the container instance.
Downstream functions, such as `flatten_inference_model`, expect every
parameter container to have a `.dtype` attribute. When they encounter a
custom container like `UnembedParameter` that lacks this attribute, an
`AttributeError` is raised.
### The Fix
The solution is to modify the `param_setter` function within
`make_param_setter` located in
`deepspeed/inference/v2/model_implementations/parameter_base.py`.
I have added the line `self.dtype = value.dtype` immediately after the
parameter tensor is assigned. This simple change ensures that any object
inheriting from `ParameterBase` will now correctly expose the `dtype` of
the tensor it wraps, resolving the error.
### Verification
This fix has been thoroughly verified in a containerized GPU environment
(RunPod with PyTorch 2.1). The verification process involved:
1. Cloning both the `deepspeed` and `DeepSpeed-MII` repositories from
source.
2. Installing the modified `deepspeed` library from this branch.
3. Installing the `DeepSpeed-MII` library (with a packaging fix) to
trigger the bug.
4. Running an end-to-end inference script with `mii.pipeline` and a
standard language model.
The logs confirm that with this fix, the program successfully executes
past the original point of failure. The `AttributeError` is completely
resolved, and the DeepSpeed engine proceeds correctly to the model
loading phase.
*(Note: A full end-to-end run in the test environment was ultimately
blocked by a separate, pre-existing build issue in DeepSpeed's op
builder (`ModuleNotFoundError: dskernels`), which is unrelated to this
logic fix. The successful progression past the original error point
serves as definitive proof of this fix's effectiveness.)*
### Related Context
This bug is primarily triggered via the
[**DeepSpeed-MII**](https://github.com/deepspeedai/DeepSpeed-MII)
project. A companion PR,
**[deepspeedai/DeepSpeed-MII#567](https://github.com/deepspeedai/DeepSpeed-MII/pull/567)**,
has been submitted to fix a packaging issue in that repository that was
a prerequisite for this verification.
output:
<img width="1014" alt="Screenshot 2025-06-22 at 14 16 15"
src="https://github.com/user-attachments/assets/1a658f98-a98b-4584-ae11-59e9edfd0b7e"
/>
<img width="1012" alt="Screenshot 2025-06-22 at 14 16 26"
src="https://github.com/user-attachments/assets/3959d0e5-d6dc-4ed4-adbc-6919e00da172"
/>
<img width="1728" alt="Screenshot 2025-06-22 at 14 17 40"
src="https://github.com/user-attachments/assets/537fd354-b840-4af2-98ab-d243c6902412"
/>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
`TestParamPartitioningSkipInit` throws the following error.
```
====================================== short test summary info ======================================
FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) =========
```
The test always sets the model's dtype to `torch.bfloat16` and ignores
the test parameter `dtype` when bfloat16 is supported. This causes a
dtype mismatch when `dtype=torch.float16` is given as the test parameter
because the data loader respects the test parameter dtype.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
#6993 broke many paths in ZeRO1/2 optimizer. This PR fixes most of the
issues the PR caused. Currently we still have one error with tests in
`unit/runtime/zero`.
```
====================================== short test summary info ======================================
FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16
========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) =========
```
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Relaxing the tolerance values to enable the below unit test, with FP16
data type on ROCm
`unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[fp16]
`
```
# Relax tolerance only for ROCm + FP16
if is_rocm_pytorch() and model_dtype == torch.float16:
rtol, atol = 3e-07, 3e-05
```
cc: @jithunnair-amd
DeepSpeed supports mixed precision training, but the behavior is
different from `torch.autocast`. DeepSpeed maintains parameters and
gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex
AMP style) and computes all modules in the lower precision while
`torch.autocast` maintains parameters in FP32 but computes only certain
operators in the lower precision.
This leads to differences in:
- performance: `torch.autocast` needs downcast in forward/backward
- memory usage: DeepSpeed needs more memory to keep copies of parameters
and gradients in lower precision
- accuracy: `torch.autocast` has a list of modules that can safely be
computed in lower precision. Some precision-sensitive operators (e.g.
softmax) are computed in FP32.
To align DeepSpeed's behavior with `torch.autocast` when necessary, this
PR adds the integration with `torch.autocast` with ZeRO. Here is an
examples of the configuration.
```json
"torch_autocast": {
"enabled": true,
"dtype": "bfloat16",
"lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"]
}
```
Each configuration works as follows:
- `enabled`: Enable the integration with `torch.autocast` if this is set
to `True`. You don't need to call `torch.autocast` in your code. The
grad scaler is also applied in the DeepSpeed optimizer.
- `dtype`: lower precision dtype passed to `torch.autocast`. Gradients
for allreduce (reduce-scatter) and parameters for allgather (only for
ZeRO3) of `lower_precision_safe_modules` are also downcasted to this
dtype.
- `lower_precision_safe_modules`: Downcast for allreduce
(reduce-scatter) and allgather (ZeRO3) are applied only to modules
specified in this list. (The precision for PyTorch operators in
forward/backward follows `torch.autocast`'s policy, not this list.) You
can set names of classes with their packages. If you don't set this
item, DeepSpeed uses the default list: `[torch.nn.Linear,
torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d]`.
Note that we only maintain FP32 parameters with this feature enabled.
For consistency, you cannot enable `fp16` or `bf16` in DeepSpeed config.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: Omar Elayan <oelayan@habana.ai>
Signed-off-by: Roman Fitzjalen <romaactor@gmail.com>
Signed-off-by: Hongwei <hongweichen@microsoft.com>
Signed-off-by: shaomin <wukon1992@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: siqi <siqi@tecorigin.com>
Signed-off-by: Wei Wu <wuwei211x@gmail.com>
Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il>
Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Roman Fitzjalen <romaactor@gmail.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com>
Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: siqi654321 <siqi202311@163.com>
Co-authored-by: siqi <siqi@tecorigin.com>
Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com>
Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com>
Co-authored-by: snahir <snahir@habana.ai>
Co-authored-by: Yejing-Lai <yejing.lai@intel.com>
Co-authored-by: Siddharth Singh <siddharth9820@gmail.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
the newly released nccl finally started to use fp32 accumulation for
reduction ops!
* Floating point summation is always done in fp32 accumulators (with the
exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus,
the accuracy with fp8 and fp16 data types should be much improved.
72d2432094
So we should change the fp32 comms default for SP to the same dtype as
inputs if `nccl>=2.27.3` - the user can still override the default.
---------
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
This PR fixes issue #7303.
### 1. Description of the Bug
Currently, when using the `WarmupLR` scheduler, if `warmup_max_lr` is
not explicitly set in the scheduler's parameters, it incorrectly falls
back to its internal default value (`0.001`), ignoring the learning rate
set in the optimizer's parameters. This can lead to unexpected training
behavior and diverges from user expectations.
### 2. Description of the Fix
This fix modifies the `__init__` method of the `WarmupLR` scheduler in
`deepspeed/runtime/lr_schedules.py`.
- The default value for the `warmup_max_lr` argument in the function
signature is changed from `0.001` to `None`.
- Logic is added to check if `warmup_max_lr` is `None` upon
initialization. If it is, the scheduler now correctly inherits the
learning rate from the optimizer's parameter groups.
This change ensures that the optimizer's learning rate is respected as
the default `warmup_max_lr`, aligning the scheduler's behavior with the
user's configuration intent.
### 3. Verification
The fix has been verified using a minimal reproduction script that
clearly demonstrates the behavioral change.
**Before Fix:**
Without `warmup_max_lr` in the scheduler config, the learning rate
incorrectly defaults to `0.001`.
<img width="1711" alt="Screenshot 2025-06-16 at 18 34 31"
src="https://github.com/user-attachments/assets/fe68f39e-2bbc-4f94-b322-546d9ce43bb0"
/>
**After Workaround (Demonstrating the Mechanism):**
By explicitly adding `warmup_max_lr` to the scheduler config, the
learning rate behaves as expected. My code change makes this the default
behavior.
<img width="1195" alt="Screenshot 2025-06-16 at 20 17 11"
src="https://github.com/user-attachments/assets/cc170246-fdac-4a56-8b9c-f204ebb47895"
/>
Signed-off-by: Vensenmu <vensenmu@gmail.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
This PR keeps some of real inputs given to the custom backend for
DeepCompile.
DeepCompile expects that the custom backend at TorchFX graph level is
always called when recompilation happens. In some cases, however, only
the Aten-level backend is called. As the Aten-level backend uses real
inputs saved by TorchFX-level backend, we need to keep the real inputs
for recompilation.
Currently we discard the real inputs after the Aten-level backend uses
it as the real inputs are often too large to keep in GPU memory. This
causes an error in cases where recompilation only calls Aten-level
backends because we don't have a chance to record new real inputs in
TorchFX-level backend.
This PR always keeps only tensor metadata and non-tensor data on CPU and
materialize the tensors when needed (i.e. when recompilation happens and
only Aten-level backends are called without real inputs). As we use
dummy data to materialize tensors, this solution might still not work
but improves the coverage.
The new module `InputStorage` keeps tensor metadata and non-tensor data
for this purpose and materialize tensors.
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
This PR improves `pad_tensors` in `deepspeed/compile/util.py`, which
pads tensors so that all ranks have tensors with the same shape.
Previously, this function only adjusts tensor shapes, but tensor strides
could differ across ranks, leading to recompilation on only some ranks.
As DeepCompile inserts communication operators in the graph, the
communication collective easily gets stuck.
To address this issue, this PR replaces the use of
`torch.nn.functional.pad` with a new approach that ensures consistent
strides and avoids communication issues during distributed operations.
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
The project has been renamed at the last moment, so this PR is adapting
to that change.
There are no code changes in this PR, just docs.
---------
Signed-off-by: Stas Bekman <stas@stason.org>