DeepSpeed

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-20 15:33:51 +08:00

Author	SHA1	Message	Date
weeknan	56fed13a1a	Fix: UnboundLocalError for variable 'dim' about issue (#7449 ) ## Fix `UnboundLocalError` in `ZeroLinear.backward()` when training only bias parameters, as mentioned in #7435 This PR addresses an issue in the `ZeroLinear.backward()` method, where the local variable `dim` could be referenced before assignment. This happens specifically when: - Only the bias parameters are set to `requires_grad=True`, and - The training setup uses ZeRO Stage 3, AMP, and gradient checkpointing. ### Problem When only the bias requires gradients, the condition for setting `dim = grad_output.dim()` is skipped, but the value of `dim` is still used later in the computation, leading to: ### Fix Move the assignment `dim = grad_output.dim()` to occur unconditionally, so that `dim` is always defined before being used in any branch of the gradient computation logic. ### Impact This makes the backward pass more robust across different training setups. Signed-off-by: weeknan <zhounan0431@163.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-07-28 12:18:58 -07:00
Logan Adams	b8668fb96c	Update version.txt after 0.17.3 release. (#7455 )	2025-07-28 11:50:34 -07:00
huanyuqu	092625c7eb	Fix: Adapt Llama injection policy for newer transformers versions (#7443 ) This PR fixes an `AttributeError` that occurs during `deepspeed.init_inference` when using kernel injection (`replace_with_kernel_inject=True`) with Llama models from recent versions of `transformers`. The Bug: In newer `transformers` versions (e.g., `4.53.3`), configurations like `num_heads` and `rope_theta` were moved from direct attributes of the `LlamaAttention` module into a nested `config` object. The current DeepSpeed injection policy tries to access these attributes from their old, direct location, causing the initialization to fail with an `AttributeError: 'LlamaAttention' object has no attribute 'num_heads'`. The Solution: This change updates the Llama injection logic to be more robust: 1. It first tries to read attributes like `num_heads` from the new `config` object location. 2. If that fails, it falls back to the legacy direct attribute path. --------- Signed-off-by: huanyuqu <yc37960@um.edu.mo> v0.17.3	2025-07-26 14:27:33 -07:00
Logan Adams	43f00ba31c	Remove additional unused tests (human-eval) (#7445 )	2025-07-24 13:16:57 -07:00
Stas Bekman	70caefe3c1	[ALST] fix typo in the url part2 (#7446 ) oops, forgot to rename the file itself :( continuation of https://github.com/deepspeedai/DeepSpeed/pull/7444 --------- Signed-off-by: Stas Bekman <stas@stason.org>	2025-07-23 16:31:59 -07:00
Stas Bekman	1d10d48291	[ALST] fix typo in the url (#7444 ) fixing the misspelled url --------- Signed-off-by: Stas Bekman <stas@stason.org>	2025-07-23 12:33:23 -07:00
Logan Adams	3bf53451e5	Remove tests from README that are already removed. (#7441 )	2025-07-21 20:56:11 -07:00
Stas Bekman	c2bb53f20f	TiledMLP + SequenceTiledCompute: improve the bs>1 use-case (#7422 ) Improved TiledMLP and SequenceTiledCompute for bs>1 This PR: - extends the testing utils to add `CaptureStd`, `CaptureLogger` context managers - extends the test to run both bs=1 and bs=2 - use an uneven seqlen to test varlen shards - flattens bs+seqlen dim, to avoid problems with grad tensor strides when bs>1 - mlp doesn't care for the bs dimension so using a pretend `bsseqlen` seqlen instead and restoring the shape at the end for the grad. --------- Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-07-16 09:30:08 -07:00
Stas Bekman	d33b5623f9	[Ulysses-ALST] add FA3 support (#7430 ) FA3 is needed for 500K+ seqlen on llama-8b. Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>	2025-07-16 15:51:35 +00:00
Stas Bekman	ee286e53c8	set `device_id` in torch's `init_process_group` (#7266 ) This PR overcomes this issue when using any `torch.distributed` calls w/ deepspeed: ``` [W404 00:15:21.693690333 ProcessGroupNCCL.cpp:4561] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. Specify device_ids in barrier() to force use of a particular device, or call init_process_group() with a device_id. ``` by setting `device_id` to the correct device corresponding to `LOCAL_RANK` env var. ------------------- Update: discovered `torch.dist` deadlocks with `torch=>2.7.0` when using `device_id` arg - switching to draft for now as we can't commit this until we know how to work around this. --------- Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>	2025-07-16 08:32:20 -07:00
Max Kovalenko	88ba24a3a6	Use past_key_value when provided (#7428 ) The KV cache can be passed via `layer_past` or `past_key_value` arguments. Previously, `past_key_value` was ignored, causing workload incompatibilities. This PR fixes the issue while preserving the original logic. --------- Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>	2025-07-14 16:36:03 -04:00
Sam Foreman	a687d327ea	fix: Propagate `strip_tensor_paddings` (#7426 ) Trying to use the `DeepSpeed/deepspeed/checkpoints/ds_to_universal.py`, I encountered: ```python Traceback (most recent call last): File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(call_item.args, call_item.kwargs) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index, File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank self._strip_tensor_paddings(sd) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone() RuntimeError: narrow(): length must be non-negative. ``` (see full traceback[^traceback] below) The issue is, there's no way to propagate the `strip_tensor_paddings` argument from the [`DeepSpeedCheckpoint.get_zero_checkpoint_state(...)`](`affee605e4/deepspeed/checkpoint/deepspeed_checkpoint.py (L123)`) method through to the [`ZeroCheckpoint.get_state_for_rank(...)` method](`affee605e4/deepspeed/checkpoint/zero_checkpoint.py (L53)`) (which accepts it as an argument) since it doesn't accept it. This PR adds this additional `strip_tensor_paddings` argument (default `True`) in the `DeepSpeedCheckpoint.get_zero_checkpoint_state` method, and passes it through to the `self.zero_checkpoint.get_state_for_rank(..., strip_tensor_paddings=strip_tensor_paddings)`, as shown below: ```diff - def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index) -> dict: + def get_zero_checkpoint_state(self, pp_index, tp_index, dp_index, strip_tensor_paddings: bool = True) -> dict: return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index, - keys_to_ignore=[PARAM_SHAPES]) + keys_to_ignore=[PARAM_SHAPES], + strip_tensor_paddings=strip_tensor_paddings) ``` [^traceback]: Full traceback: <details closed><summary>[Full Traceback]:</summary> ```bash #[🐍 aurora_nre_models_frameworks-2025.0.0](👻 aurora_nre_models_frameworks-2025.0.0) #[/f/A/C/f/p/a/Megatron-DeepSpeed][🌱 saforem2/fix-formatting][✓] #[07/12/25 @ 16:07:12][x4209c2s4b0n0] ; ckpt_dir=checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash ; gs=$(cat "${ckpt_dir}/latest_checkpointed_iteration.txt") && echo "global step: ${gs}" && python3 deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder"${ckpt_dir}/global_step${gs}" --output_folder "${ckpt_dir}/global_step${gs}_universal" --keep_temp_folder global step: 158945 [W712 16:07:17.966425018 OperatorEntry.cpp:155] Warning: Warning only once for all operators, other operators may also be overridden. Overriding a previously registered kernel for the same operator and the same dispatch key operator: aten::_cummax_helper(Tensor self, Tensor(a!) values, Tensor(b!) indices, int dim) -> () registered at /build/pytorch/build/aten/src/ATen/RegisterSchema.cpp:6 dispatch key: XPU previous kernel: registered at /build/pytorch/build/aten/src/ATen/RegisterCPU.cpp:30476 new kernel: registered at /build/intel-pytorch-extension/build/Release/csrc/gpu/csrc/aten/generated/ATen/RegisterXPU.cpp:2971 (function operator()) /opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/intel_extension_for_pytorch/nn/utils/_weight_prepack.py:6: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources AttributeError: 'MessageFactory' object has no attribute 'GetPrototype' AttributeError: 'MessageFactory' object has no attribute 'GetPrototype' AttributeError: 'MessageFactory' object has no attribute 'GetPrototype' AttributeError: 'MessageFactory' object has no attribute 'GetPrototype' AttributeError: 'MessageFactory' object has no attribute 'GetPrototype' [2025-07-12 16:07:27,740] [INFO] [real_accelerator.py:254:get_accelerator] Setting ds_accelerator to xpu (auto detect) [2025-07-12 16:07:29,078] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False args = Namespace(input_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945', output_folder='checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal', num_extract_workers=4, num_merge_workers=2, keep_temp_folder=True, strict=True, inject_missing_state=False) Convert DeepSpeed Checkpoint to Universal Checkpoint Converting DeepSpeed checkpoint in checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945 to Universal checkpoint in checkpoints/ws768_ds_stage1_nl32_hs4096_mb1_seq4096_gb3072_sp1_pp1_tp1_bf16_optadamw_lr_lwf_flash/global_step158945_universal /lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:290: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. def forward( /lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/megatron/core/tensor_parallel/layers.py:334: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. def backward(ctx, grad_output): [2025-07-12 16:07:39,134079][I][ezpz/__init__:264:ezpz] Setting logging level to 'INFO' on 'RANK == 0' [2025-07-12 16:07:39,136376][I][ezpz/__init__:265:ezpz] Setting logging level to 'CRITICAL' on all others 'RANK != 0' ** 1. Extracting ZeRO fragments 100%\|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋\| 767/768 [01:29<00:00, 8.53it/s] concurrent.futures.process._RemoteTraceback: """ Traceback (most recent call last): File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(call_item.args, *call_item.kwargs) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 114, in extract_zero_shards sd = ds_checkpoint.get_zero_checkpoint_state(pp_index=pp_index, tp_index=tp_index, dp_index=dp_index) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/deepspeed_checkpoint.py", line 124, in get_zero_checkpoint_state return self.zero_checkpoint.get_state_for_rank(pp_index=pp_index, File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 62, in get_state_for_rank self._strip_tensor_paddings(sd) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/venvs/aurora/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/site-packages/deepspeed/checkpoint/zero_checkpoint.py", line 110, in _strip_tensor_paddings group_state[state_name] = torch.narrow(state_value, 0, 0, raw_length).clone() RuntimeError: narrow(): length must be non-negative. """ The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 549, in <module> main(args) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 499, in main _extract_zero_shard_files(args, ds_checkpoint, temp_dir) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 370, in _extract_zero_shard_files _do_parallel_work(do_work, _3d_range_list, args.num_extract_workers) File "/lus/flare/projects/AuroraGPT/CPT-AuroraGPT-v0/foremans/projects/argonne-lcf/Megatron-DeepSpeed/deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py", line 354, in _do_parallel_work results.append(f.result()) File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/opt/aurora/24.347.0/frameworks/aurora_nre_models_frameworks-2025.0.0/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception RuntimeError: narrow(): length must be non-negative. [1] 144664 exit 1 python3 deps/DeepSpeed/deepspeed/checkpoint/ds_to_universal.py --input_folder took: 0h:02m:08s ``` </details> Signed-off-by: Sam Foreman <saforem2@gmail.com>	2025-07-12 21:22:38 -04:00
Stas Bekman	affee605e4	trying to fix nv-accelerate-v100.yml CI job (#7424 ) trying a day old accelerate from the day before `1ac8643df7` --------- Signed-off-by: Stas Bekman <stas@stason.org>	2025-07-11 10:07:27 -04:00
Alexander Kiefer	f485e1369e	Improvements to Communication Logger (#7404 ) These changes add - ability to return communication logs as dictionaries, rather than only printing to stdout - convenience helper functions for getting information about current logs - ability to clear existing log operations - additional documentation for logging operations These address points made in #7403 --------- Signed-off-by: Alex Kiefer <alexkiefer51@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-07-09 13:35:28 -04:00
Max Kovalenko	ac16035d8c	Align missing argument in AllReduceCoalescedHandle (#7414 ) After a new argument (handle_dependency) was added to the corresponding wait() methods, AllReduceCoalescedHandle has to be aligned, too. Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-07-08 14:10:44 -07:00
Rahul Kumar	27228462b0	[BUGFIX] Reset `bucket.elements` after reduction in ZeRO Stage 3 (#7418 ) closes #7415 # Description Resets `bucket.elements` after reduction in ZeRO Stage 3. Without this, the bucket grows indefinitely, reducing only one param at a time. Added `bucket.elements = 0` after `params_in_bucket.clear()`. Co-authored-by: a <a>	2025-07-08 02:32:55 +00:00
Max Kovalenko	8ace4da7c6	Enable torch version dependent compilation of record_module and iter_params (#7362 ) Dynamo breaks graphs because currently compile is disabled for a number of functions such as `iter_params` and `record_module`. The above functions compile successfully for at least PyTorch version 2.7.0. We enable the compilation based on the user PyTorch version using a new `compiler.enable(min_version=None)` decorator. This should avoid the corresponding graph breaks and improve the performance. --------- Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-07-07 12:55:31 -07:00
Logan Adams	d6fe70ef88	Update version.txt after v0.17.2 release. (#7417 )	2025-07-07 19:10:53 +00:00
Stas Bekman	2790220d31	[TiledMLP]: fix for bs>1 (#7412 ) It looks like my TiledMLP was working correctly only for batch_size=1 fixing to work with any bs thanks to @winglian for detecting the problem and sending me an easy repro --------- Signed-off-by: Stas Bekman <stas@stason.org>	2025-07-07 14:34:03 -04:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	15f054d97f	fix: engine initializes optimizer attributes at the beginning (#7410 ) As in `destroy`, `self.optimizer` is called, but the error out calling to `destroy` can happen in `__init__`, even before optimizer and scheduler is configured. So we need to move `self.optimizer` to the top to avoid triggering another exception. e.g.: ```logs File "deepspeed/runtime/engine.py", line 453, in _configure_tensor_parallel_states assert self.zero_optimization_stage( AssertionError: Currently, the compatibility between 'autotp' and 'zero_stage = 3' has not been validated Exception ignored in: <function DeepSpeedEngine.__del__ at 0x1516c0610820> Traceback (most recent call last): File "deepspeed/runtime/engine.py", line 509, in __del__ self.destroy() File "deepspeed/runtime/engine.py", line 512, in destroy if self.optimizer is not None and hasattr(self.optimizer, 'destroy'): File "deepspeed/runtime/engine.py", line 621, in __getattr__ raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'") AttributeError: 'DeepSpeedEngine' object has no attribute 'optimizer' ``` Signed-off-by: Hollow Man <hollowman@opensuse.org> v0.17.2	2025-07-06 22:24:53 -04:00
gjj2828	da60a878ac	fix: fix FileNotFoundError for build_win.bat (#7399 ) fix FileNotFoundError for build_win.bat `FileNotFoundError: [WinError 2] 系统找不到指定的文件。ERROR Backend subprocess exited when trying to invoke get_requires_for_build_wheel` Signed-off-by: gjj2828 <gjj2828@sina.com> Co-authored-by: gjj2828 <gjj2828@gmail.com>	2025-07-01 16:13:44 +00:00
Vensen	e6324af96b	fix(comm): Expose GradBucket in deepspeed.comm API (#7400 ) This PR fixes an omission in the `deepspeed.comm` API where `GradBucket` was not exposed, despite the package aiming for full compatibility with `torch.distributed`. ##The Problem As reported in issue #7393, when a user replaces `torch.distributed` with `deepspeed.comm`, they expect all public APIs to be available. However, attempting to access `deepspeed.comm.GradBucket` (for example, when using it as a type hint for DDP communication hooks) results in an `AttributeError`. ##The Solution This change resolves the issue by importing `GradBucket` directly from `torch.distributed` into the `deepspeed/comm/comm.py` file, making it part of the public `deepspeed.comm` namespace. A `# noqa: F401` comment has been added to the import line. This is necessary to bypass the `flake8` linter's "imported but unused" check, as the specific purpose of this import is to expose the symbol to the end-user, not for it to be used within the `comm.py` file itself. ##How This Was Tested The fix was verified with a local test script that confirms `deepspeed.comm.GradBucket` can now be accessed correctly and is identical to `torch.distributed.GradBucket`. The pre-commit hooks now pass successfully. ##Related run test Screenshout <img width="1250" alt="Screenshot 2025-06-30 at 22 41 10" src="https://github.com/user-attachments/assets/cadf18e1-9d1a-4164-a5ff-0b3e6804ac48" /> ##Related Issue Fixes #7393 Signed-off-by: Vensenmu <vensenmu@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-06-30 20:49:01 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	384e6d3414	fix: DeepCompile for torch 2.8 (#7402 ) In torch v2.8.0, all symm mem code are moved into a dedicated folder `ffc6cbfaf7` So this PR tries to address this change by checking if we have it located under `torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp` (new location). If not, we fall back to the original place for backward compatibilities. This PR also clean up some includes in `z1/2/3.cpp` that has already been included in `deepcompile.h` Signed-off-by: Hollow Man <hollowman@opensuse.org>	2025-06-30 18:19:46 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	a755a9e52c	fix: Add `csrc/compile` to include paths for DeepCompile builder (#7401 ) Since currently `z1.h`, `z2.h` and `z3.h` are located under `csrc/compile`, without this patch, torch hipify will fail to identify these hipified headers on AMD platform: ```log In file included from torch/include/ATen/cuda/CUDAEvent.h:3, from deepspeed/ops/csrc/includes/deepcompile.h:16, from deepspeed/ops/csrc/compile/z1.h:6, from deepspeed/ops/csrc/compile/z1_hip.cpp:7: torch/include/ATen/cuda/ATenCUDAGeneral.h:3:10: fatal error: cuda.h: No such file or directory 3 \| #include <cuda.h> \| ^~~~~~~~ compilation terminated. ``` Signed-off-by: Hollow Man <hollowman@opensuse.org>	2025-06-30 10:55:35 -07:00
Alexander Kiefer	4c687bfdac	Added device detection to communication logging (#7398 ) In `comms_logging.py`, when calling log_all and the `show_straggler` option is enabled, an all_reduce is performed across all nodes to calculate the minimum latency to find stragglers. However, the tensors on which this is performed are not sent to the configured devices. This commit adds this capability using deepspeed's abstract accelerator api. Resolves #7397 Signed-off-by: Alex Kiefer <alexkiefer51@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-06-28 06:56:19 +00:00
Masahiro Tanaka	6594c266c2	Improve coverage of DeepCompile (#7386 ) This PR improves the coverage of DeepCompile. - Use real parameters when recompilation happens - Handling overflow error in profiling This PR should be merged after #7366. ZeRO1 and ZeRO3 both worked with OpenRLHF. See [Wiki page](https://github.com/tohtana/DeepCompile_docs/wiki/Debug-with-OpenRLHF-(%237243)) for more details. --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>	2025-06-27 16:28:14 -07:00
Masahiro Tanaka	be8124c88e	Fix ZeRO stage 1 and add stage 2 support with DeepCompile (#7366 ) This PR fixes the behavior of DeepCompile's ZeRO stage 1 and adds stage 2 support. DeepCompile's ZeRO1 currently performs allreduce at every iteration even when it is not a gradient accumulation boundary. This significantly slows down the performance when gradient accumulation is enabled. This PR fixes this issue by performing allreduce only at the gradient accumulation boundary. As the current behavior is similar to ZeRO2, this PR also adds DeepCompile's ZeRO2 support. We can now set zero stage to 2 with DeepCompile. The loss values, performance, and memory usages were verified using this [verification tool](https://github.com/tohtana/ds_verify_loss) ([results](https://github.com/tohtana/ds_verify_loss/blob/main/results/results_20250617_035117/report.md)). --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-06-27 22:31:46 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	6c469425dc	Fix unbound local error for `return_val` (#7395 ) ...... File "torch/_dynamo/backends/common.py", line 72, in _wrapped_bw_compiler return disable(disable(bw_compiler_fn)(args, kwargs)) File "torch/_dynamo/eval_frame.py", line 838, in _fn return fn(args, *kwargs) File "deepspeed/compile/inductor.py", line 27, in wrapped_compiler mod_graph = dc_compiler(gm, fake_inputs) File "deepspeed/compile/backend.py", line 330, in make_bw_graph run_opt_passes( File "deepspeed/compile/backend.py", line 206, in run_opt_passes mem_prof.run(create_inputs_fn()) File "deepspeed/compile/profilers/graph_profile.py", line 261, in run return return_val UnboundLocalError: local variable 'return_val' referenced before assignment Signed-off-by: Hollow Man <hollowman@opensuse.org> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-06-27 10:10:01 -07:00
LosCrossOS	59bb08bf90	add support for CUDAtk12.9 (#7394 ) CUDA Toolkit 12.9 has been out for a while. The build currently fails when it is installed as the builder checks against hardcoded values. this PR adds the value 12.9. a better mechanism would be to check dynamically that the major number is the same... maybe next time when CUDA13 comes out :) Signed-off-by: LosCrossos <165311345+loscrossos@users.noreply.github.com>	2025-06-27 15:24:54 +00:00
Stas Bekman	ebb64239a6	fix broken url (#7390 )	2025-06-26 15:29:02 -07:00
Stas Bekman	ec73d91b4d	add blog link (#7385 ) add a blog link to Arctic Long Sequence Training (ALST) with DeepSpeed	2025-06-24 13:14:25 -07:00
Vensen	61829b55ea	fix(inference): Add missing dtype attribute to ParameterBase setter (#7378 ) ### Description This PR fixes an `AttributeError: 'UnembedParameter' object has no attribute 'dtype'` that occurs in the Inference V2 engine. The issue is triggered when using a high-level interface like [DeepSpeed-MII](https://github.com/deepspeedai/DeepSpeed-MII) to run inference on models with tied input/output embeddings, such as Llama 2. Resolves: #7260 ### Root Cause Analysis The root cause is that while the `ParameterBase` metaclass correctly creates property setters for parameter tensors, the setter function (`param_setter`) only assigns the tensor value itself. It does not propagate the tensor's `dtype` to the container instance. Downstream functions, such as `flatten_inference_model`, expect every parameter container to have a `.dtype` attribute. When they encounter a custom container like `UnembedParameter` that lacks this attribute, an `AttributeError` is raised. ### The Fix The solution is to modify the `param_setter` function within `make_param_setter` located in `deepspeed/inference/v2/model_implementations/parameter_base.py`. I have added the line `self.dtype = value.dtype` immediately after the parameter tensor is assigned. This simple change ensures that any object inheriting from `ParameterBase` will now correctly expose the `dtype` of the tensor it wraps, resolving the error. ### Verification This fix has been thoroughly verified in a containerized GPU environment (RunPod with PyTorch 2.1). The verification process involved: 1. Cloning both the `deepspeed` and `DeepSpeed-MII` repositories from source. 2. Installing the modified `deepspeed` library from this branch. 3. Installing the `DeepSpeed-MII` library (with a packaging fix) to trigger the bug. 4. Running an end-to-end inference script with `mii.pipeline` and a standard language model. The logs confirm that with this fix, the program successfully executes past the original point of failure. The `AttributeError` is completely resolved, and the DeepSpeed engine proceeds correctly to the model loading phase. (Note: A full end-to-end run in the test environment was ultimately blocked by a separate, pre-existing build issue in DeepSpeed's op builder (`ModuleNotFoundError: dskernels`), which is unrelated to this logic fix. The successful progression past the original error point serves as definitive proof of this fix's effectiveness.) ### Related Context This bug is primarily triggered via the [DeepSpeed-MII](https://github.com/deepspeedai/DeepSpeed-MII) project. A companion PR, [deepspeedai/DeepSpeed-MII#567](https://github.com/deepspeedai/DeepSpeed-MII/pull/567), has been submitted to fix a packaging issue in that repository that was a prerequisite for this verification. output： <img width="1014" alt="Screenshot 2025-06-22 at 14 16 15" src="https://github.com/user-attachments/assets/1a658f98-a98b-4584-ae11-59e9edfd0b7e" /> <img width="1012" alt="Screenshot 2025-06-22 at 14 16 26" src="https://github.com/user-attachments/assets/3959d0e5-d6dc-4ed4-adbc-6919e00da172" /> <img width="1728" alt="Screenshot 2025-06-22 at 14 17 40" src="https://github.com/user-attachments/assets/537fd354-b840-4af2-98ab-d243c6902412" /> Signed-off-by: Vensenmu <vensenmu@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-06-23 23:38:35 +00:00
Nir Sonnenschein	2a450b3a33	Add support for ws=1 scenario (#7379 ) add support for ws=1 scenario and make sure fields are always initialized for robustness. --------- Signed-off-by: Nir Sonnenschein <nsonnenschein@habana.ai> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>	2025-06-23 12:07:32 -07:00
Masahiro Tanaka	e049bbfa1c	Fix dtype mismatch in `TestParamPartitioningSkipInit` (#7377 ) `TestParamPartitioningSkipInit` throws the following error. ``` ====================================== short test summary info ====================================== FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16 ========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) ========= ``` The test always sets the model's dtype to `torch.bfloat16` and ignores the test parameter `dtype` when bfloat16 is supported. This causes a dtype mismatch when `dtype=torch.float16` is given as the test parameter because the data loader respects the test parameter dtype. --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-06-23 16:44:32 +00:00
Ned Letcher	8dd215162b	fix wandb.log() call by removing `sync` kwarg (#7383 ) Remove `sync` kwarg from `wandb.log()` invocation, which was [removed in wandb==0.20.0](https://github.com/wandb/wandb/releases/tag/v0.20.0). Fixes #7381	2025-06-23 16:07:35 +00:00
Masahiro Tanaka	d5f6915104	Fix release of IPG buffer (#7376 ) #6993 broke many paths in ZeRO1/2 optimizer. This PR fixes most of the issues the PR caused. Currently we still have one error with tests in `unit/runtime/zero`. ``` ====================================== short test summary info ====================================== FAILED test_zero.py::TestParamPartitioningSkipInit::test[dtype1] - RuntimeError: mat1 and mat2 must have the same dtype, but got Half and BFloat16 ========= 1 failed, 204 passed, 66 skipped, 15 deselected, 5 warnings in 2305.03s (0:38:25) ========= ``` --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>	2025-06-21 21:24:48 -07:00
Logan Adams	9606f8f010	Update latest news with DeepNVMe (#7375 ) Update latest news section to include DeepNVMe.	2025-06-20 10:05:59 -07:00
Ramya Ramineni	d33baf009b	Relax tolerances for FP8 unit test only for ROCm + FP16 (#7373 ) Relaxing the tolerance values to enable the below unit test, with FP16 data type on ROCm `unit/runtime/half_precision/test_fp8.py::TestFp8ComposabilityAcrossZero::test[fp16] ` ``` # Relax tolerance only for ROCm + FP16 if is_rocm_pytorch() and model_dtype == torch.float16: rtol, atol = 3e-07, 3e-05 ``` cc: @jithunnair-amd	2025-06-20 15:24:43 +00:00
Olatunji Ruwase	25da6fc1ab	Flops profiler support for F.interpolate (#7353 ) Fix #4504 Credit @xmfbit --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-06-20 15:02:56 +00:00
Stas Bekman	86097872c6	add ALST paper reference (#7372 ) add the just published ALST paper reference Signed-off-by: Stas Bekman <stas@stason.org>	2025-06-20 00:25:11 +00:00
Masahiro Tanaka	ed5f737554	Enable torch.autocast with ZeRO (#6993 ) DeepSpeed supports mixed precision training, but the behavior is different from `torch.autocast`. DeepSpeed maintains parameters and gradients both in FP32 and a lower precision (FP16/BF16) (NVIDIA Apex AMP style) and computes all modules in the lower precision while `torch.autocast` maintains parameters in FP32 but computes only certain operators in the lower precision. This leads to differences in: - performance: `torch.autocast` needs downcast in forward/backward - memory usage: DeepSpeed needs more memory to keep copies of parameters and gradients in lower precision - accuracy: `torch.autocast` has a list of modules that can safely be computed in lower precision. Some precision-sensitive operators (e.g. softmax) are computed in FP32. To align DeepSpeed's behavior with `torch.autocast` when necessary, this PR adds the integration with `torch.autocast` with ZeRO. Here is an examples of the configuration. ```json "torch_autocast": { "enabled": true, "dtype": "bfloat16", "lower_precision_safe_modules": ["torch.nn.Linear", "torch.nn.Conv2d"] } ``` Each configuration works as follows: - `enabled`: Enable the integration with `torch.autocast` if this is set to `True`. You don't need to call `torch.autocast` in your code. The grad scaler is also applied in the DeepSpeed optimizer. - `dtype`: lower precision dtype passed to `torch.autocast`. Gradients for allreduce (reduce-scatter) and parameters for allgather (only for ZeRO3) of `lower_precision_safe_modules` are also downcasted to this dtype. - `lower_precision_safe_modules`: Downcast for allreduce (reduce-scatter) and allgather (ZeRO3) are applied only to modules specified in this list. (The precision for PyTorch operators in forward/backward follows `torch.autocast`'s policy, not this list.) You can set names of classes with their packages. If you don't set this item, DeepSpeed uses the default list: `[torch.nn.Linear, torch.nn.Conv1d, torch.nn.Conv2d, torch.nn.Conv3d]`. Note that we only maintain FP32 parameters with this feature enabled. For consistency, you cannot enable `fp16` or `bf16` in DeepSpeed config. --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Omar Elayan <oelayan@habana.ai> Signed-off-by: Roman Fitzjalen <romaactor@gmail.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr> Co-authored-by: Liangliang Ma <1906710196@qq.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Roman Fitzjalen <romaactor@gmail.com> Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com> Co-authored-by: root <root@ftqtmec25000000.taxzvufipdhelhupulxcbvr15f.ux.internal.cloudapp.net> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: Joe Mayer <114769929+jomayeri@users.noreply.github.com> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com> Co-authored-by: Siddharth Singh <siddharth9820@gmail.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>	2025-06-19 21:36:03 +00:00
Stas Bekman	d3b9cb8c4e	sequence parallel default dtype (#7364 ) the newly released nccl finally started to use fp32 accumulation for reduction ops! * Floating point summation is always done in fp32 accumulators (with the exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus, the accuracy with fp8 and fp16 data types should be much improved. `72d2432094` So we should change the fp32 comms default for SP to the same dtype as inputs if `nccl>=2.27.3` - the user can still override the default. --------- Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>	2025-06-19 18:32:14 +00:00
Vensen	22cf1a4401	Fix(scheduler): WarmupLR inherits optimizer lr when not specified (#7360 ) This PR fixes issue #7303. ### 1. Description of the Bug Currently, when using the `WarmupLR` scheduler, if `warmup_max_lr` is not explicitly set in the scheduler's parameters, it incorrectly falls back to its internal default value (`0.001`), ignoring the learning rate set in the optimizer's parameters. This can lead to unexpected training behavior and diverges from user expectations. ### 2. Description of the Fix This fix modifies the `__init__` method of the `WarmupLR` scheduler in `deepspeed/runtime/lr_schedules.py`. - The default value for the `warmup_max_lr` argument in the function signature is changed from `0.001` to `None`. - Logic is added to check if `warmup_max_lr` is `None` upon initialization. If it is, the scheduler now correctly inherits the learning rate from the optimizer's parameter groups. This change ensures that the optimizer's learning rate is respected as the default `warmup_max_lr`, aligning the scheduler's behavior with the user's configuration intent. ### 3. Verification The fix has been verified using a minimal reproduction script that clearly demonstrates the behavioral change. Before Fix: Without `warmup_max_lr` in the scheduler config, the learning rate incorrectly defaults to `0.001`. <img width="1711" alt="Screenshot 2025-06-16 at 18 34 31" src="https://github.com/user-attachments/assets/fe68f39e-2bbc-4f94-b322-546d9ce43bb0" /> After Workaround (Demonstrating the Mechanism): By explicitly adding `warmup_max_lr` to the scheduler config, the learning rate behaves as expected. My code change makes this the default behavior. <img width="1195" alt="Screenshot 2025-06-16 at 20 17 11" src="https://github.com/user-attachments/assets/cc170246-fdac-4a56-8b9c-f204ebb47895" /> Signed-off-by: Vensenmu <vensenmu@gmail.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>	2025-06-19 12:32:54 -04:00
Masahiro Tanaka	6f1a1c04c1	Restore real inputs for recompilation (#7356 ) This PR keeps some of real inputs given to the custom backend for DeepCompile. DeepCompile expects that the custom backend at TorchFX graph level is always called when recompilation happens. In some cases, however, only the Aten-level backend is called. As the Aten-level backend uses real inputs saved by TorchFX-level backend, we need to keep the real inputs for recompilation. Currently we discard the real inputs after the Aten-level backend uses it as the real inputs are often too large to keep in GPU memory. This causes an error in cases where recompilation only calls Aten-level backends because we don't have a chance to record new real inputs in TorchFX-level backend. This PR always keeps only tensor metadata and non-tensor data on CPU and materialize the tensors when needed (i.e. when recompilation happens and only Aten-level backends are called without real inputs). As we use dummy data to materialize tensors, this solution might still not work but improves the coverage. The new module `InputStorage` keeps tensor metadata and non-tensor data for this purpose and materialize tensors. --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>	2025-06-19 08:52:51 -04:00
Stas Bekman	f394e78036	Fix tutorial title (#7365 ) Missed this renamed in last PR https://github.com/deepspeedai/DeepSpeed/pull/7348	2025-06-17 10:10:16 -07:00
Olatunji Ruwase	9ac9441400	Fix 404s (#7363 ) Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com>	2025-06-16 18:54:36 -04:00
Masahiro Tanaka	600d280f21	Improve padding util for compile (#7355 ) This PR improves `pad_tensors` in `deepspeed/compile/util.py`, which pads tensors so that all ranks have tensors with the same shape. Previously, this function only adjusts tensor shapes, but tensor strides could differ across ranks, leading to recompilation on only some ranks. As DeepCompile inserts communication operators in the graph, the communication collective easily gets stuck. To address this issue, this PR replaces the use of `torch.nn.functional.pad` with a new approach that ensures consistent strides and avoids communication issues during distributed operations. Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>	2025-06-14 16:58:56 -07:00
wzy	7663121549	Fix error of <glog/logging.h> (#7351 ) Fix #7350	2025-06-12 16:18:51 -04:00
Olatunji Ruwase	10b106619a	Don't break set_start_method (#7349 ) Fix #7347 --------- Signed-off-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal> Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal>	2025-06-11 13:00:58 -04:00
Stas Bekman	d7e60fd0f6	s/UlyssesPlus/Arctic Long Sequence Training (ALST)/ (#7348 ) The project has been renamed at the last moment, so this PR is adapting to that change. There are no code changes in this PR, just docs. --------- Signed-off-by: Stas Bekman <stas@stason.org>	2025-06-10 17:10:54 -07:00

1 2 3 4 5 ...

2958 Commits