DeepSpeed

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-20 06:53:47 +08:00

Author	SHA1	Message	Date
Liangliang Ma	69e03e52d0	[XPU][CI] recover xpu-max1100 workflow (#7630 ) Reduce some test scope to recover CI workflow. Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-10-13 16:43:17 +00:00
JoshWoo2003	7cb1b88ec4	Add ZenFlow code for Stage 3 (#7516 ) This PR completes the ZenFlow integration for DeepSpeed ZeRO Stage 3. Highlights: - ZenFlowSelectiveAdamW_stage3: Optimizer with importance-aware selective parameter updates for ZeRO Stage 3. - ZenFlowZeroOptimizer_Stage3: Full Stage 3 optimizer integration with partitioned parameters and CPU offload. - Configurable via ZenFlowConfig, fully integrated with DeepSpeedZeroConfig for Stage 3. - Unit tests for Stage 3 cases ensuring correctness and compatibility. Note: Intergration with ZeRO Stage 1&2 was introduced in #7391 --------- Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Ma, Guokai <guokai.ma@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu>	2025-10-13 12:19:18 -04:00
Olatunji Ruwase	b7cd78f096	Bump version Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-10-07 19:36:10 -04:00
Olatunji Ruwase	79caae1c04	Update email address (#7624 ) Update contact address Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> v0.18.0	2025-10-07 17:15:49 +00:00
Stas Bekman	1b08325da3	[TiledMLP] moe support (#7622 ) MoE routers seem to drop the `bs` dimension in `x` so the `[bs, seqlen, hidden_size]` is no longer expected. support that use-case. Signed-off-by: Stas Bekman <stas@stason.org>	2025-10-07 13:33:34 +00:00
Masahiro Tanaka	1ae1cdd8e4	Clarify document of leaf module config (#7623 ) Update document of leaf module config as suggested [here](https://github.com/deepspeedai/DeepSpeed/pull/7604#discussion_r2407483616). Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-10-06 20:10:32 -07:00
Ma, Guokai	2b68bbc594	Blog of zenflow binding study (#7614 ) This PR add a blog/lab for study of zenflow and zero offload performance with DeepSpeed CPU core binding. --------- Signed-off-by: Guokai Ma <guokai.ma@gmail.com> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com> Co-authored-by: Xinyu Lian <lian7@illinois.edu> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: zhengchenyu <zhengchenyu16@163.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-10-06 11:38:44 -04:00
Masahiro Tanaka	71d077da73	Enable grad scaler for ZeRO-0 + torch.autocast path (#7619 ) Currently, the DeepSpeed engine does not enable the grad scaler for the ZeRO-0 and `torch.autocast` path, even when dtype is set to `fp16`. This leads to errors in tests when we replace our hard-coded tolerances with PyTorch’s [standard tolerances](https://docs.pytorch.org/docs/stable/testing.html#torch.testing.assert_close) (Thank you @stas00 for you suggestion regarding the previous PR). This PR enables the grad scaler for this path to improve accuracy, and refactors the tests to simplify validation by using `torch.testing.assert_close`. The tests now rely on PyTorch’s standard (and stricter) tolerances, and they still pass. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-10-04 13:21:08 +00:00
Ma, Guokai	65322e103c	Super offload blog Chinese version (#7620 ) This is the Chinese version of the SuperOffload blog. --------- Signed-off-by: Guokai Ma <guokai.ma@gmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-10-04 12:58:51 +00:00
Stas Bekman	4eb37729de	add print_dist util (#7621 ) a refactor follow up to https://github.com/deepspeedai/DeepSpeed/pull/7617 as suggested by @tohtana to create 2 independent utils, sharing the main logic via another util. Signed-off-by: Stas Bekman <stas@stason.org>	2025-10-03 19:30:26 -07:00
Masahiro Tanaka	7d9a2f2bf3	Improve leaf module interface (enable via config, relax matching criteria, add document, etc.) (#7604 ) This PR improves the usability of the leaf module feature. Here are the changes: - Allow enabling the leaf module via both the DeepSpeed config and APIs. - Relax matching criteria to support class-based matching. - Support multiple ways of specifying the target module: class, class name (with or without package name), module name, or suffix. - Add documentation to the training guide, including config snippets and explanations of default behavior. - Add default classes (e.g., Mixtral, Qwen2/Qwen3) that automatically enable the leaf module feature. (Welcoming requests to add more classes) --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-10-03 09:45:28 +00:00
Masahiro Tanaka	82a9db7eba	Show mismatching values when DeepCompile test fails (#7618 ) This PR improves error message when DeepCompile test fails. Tests of DeepCompile occasionally fail ([example](https://github.com/deepspeedai/DeepSpeed/actions/runs/18160078309/job/51688736712?pr=7604)) because of mismatching loss values. To make sure this is not a synchronization bug that causes `nan` loss values, the change in this PR shows the mismatching values. We can consider increasing the tolerances once we confirm the mismatch is reasonable. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-10-03 05:23:13 -04:00
Junjie Mao	2a76988958	DeepCompile: Use min_cut_rematerialization for partitioning joint graphs (#7609 ) # Motivation PyTorch provides `min_cut_rematerialization_partition()` to partition a joint graph while respecting recomputation annotation. That algorithm forms a data-flow-like graph from the joint graph, adds to edges weights from some recomputation-cost-related heuristics and applies the min-cut algorithm to determine which nodes to recompute. Users can force recomputation of a node by annotating its `node.meta["recompute"]` to MUST_RECOMPUTE or PREFER_RECOMPUTE, as is implemented in [1]. While originally designed for activation checkpointing, min_cut_rematerialization can also be used to recompute param aliases. When partitioning a joint graph, we don't want to save for backward the gathered parameters and values computed from them via aliasing ops, as that essentially means the gathered parameter will be saved. Instead of customizing the partitioner or patching `choose_saved_values_set`, we can achieve that by annotating such nodes to be MUST_RECOMPUTE. Both eager and inductor backends can use min_cut_rematerialization easily. The eager backend can use min-cut by customizing the partition_fn for `aot_module_simplified`, and is already using that for graphs with activation checkpointing enabled. The inductor backend uses that algorithm since torch 2.0.0 [2] and is still the default after the inductor partitioner is made configurable a few weeks ago [3]. That approach also helps DeepCompile + torch autocast nicely. When autocast is enabled, downcasted parameters are preferred to be recomputed. It suffices to mark such casting nodes as must-recompute. [1] https://github.com/pytorch/pytorch/blob/main/torch/_functorch/partitioners.py#L1813 [2] https://github.com/pytorch/pytorch/blob/v2.0.0/torch/_inductor/compile_fx.py#L459 [3] https://github.com/pytorch/pytorch/pull/157580 # Proposal Motivated by the flexibility and the requirement for optimizing DeepCompile + autocast, I propose to switch to the min-cut-based partitioner for both backends. This PR implements that switch, cleans up dead code and also recomputes downcasted parameters in the backward. # Preliminary Evaluation Here's a summary of the tests using https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 on a 8x RTX 5090 node. \| Configuration \| Base Time (ms) \| Base Mem (GB) \| Time with this PR (ms) \| Mem with this PR (GB) \| \|---------------------\|----------------\|---------------\|------------------------\|-----------------------\| \| eager + autocast \| 551.92 \| 12.07 \| 571.24 \| 9.96 \| \| eager + bf16 \| 419.87 \| 9.47 \| 445.76 \| 7.30 \| \| inductor + autocast \| 546.97 \| 12.84 \| 570.09 \| 13.04 \| \| inductor + bf16 \| 444.03 \| 10.01 \| 444.70 \| 10.19 \| ## Reduced memory with eager backend The initial goal of this PR is to reduce peak memory usage when torch autocast is enabled. That is achieved according to the first row of the table, but in two different ways simultaneously. 1. Downcasted parameters during forward are throwed away and recomputed (by the fused cast + allgather) in the backward pass. 2. Without this PR, `fast_free_schedule` will arange most allgather at the beginning of the graph. That leads to a even higher peak during forward, but is no longer seen with PR. 3. By diffing the graphs passed to `add_z3_gather_release`, I noticed that recomputations selected by min-cut is slightly different (that test script has activation checkpointing enabled for the LLM module). That can also impact computation time and memory usage. Here's the shape of memory usage before this PR with eager backend + torch autocast. eager + BF16 shows similar shapes. Numbers reported in the table are peak during forward. The peak memory usage during backend reduces ~0.7GB in both cases. <img width="1482" height="629" alt="image" src="https://github.com/user-attachments/assets/7e7ec859-9a04-4ddd-ba37-c2d475a81058" /> After this PR: <img width="1482" height="453" alt="image" src="https://github.com/user-attachments/assets/f15c71b8-f823-4aa5-801a-a36188c5e866" /> ## Similar memory with inductor backend Unlike eager backend, the inductor backend uses similar memory with or without this PR. The memory usage pattern is as follows, which requires further analysis. Before this PR: <img width="1070" height="613" alt="image" src="https://github.com/user-attachments/assets/317b9a58-d4ef-459f-ac7b-67ef2318a9de" /> After this PR: <img width="911" height="536" alt="image" src="https://github.com/user-attachments/assets/7e737a81-cf27-402c-aeea-dfe661043fc1" /> Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-10-03 03:39:38 +00:00
Stas Bekman	9cbd3edd0d	[wall_clock_breakdown] always log stats when enabled (#7617 ) currently when main logger is WARN level, `wall_clock_breakdown: true` never logs - which is invalid as it disables this crucial at times functionality. Plus I think we have a disconnect somewhere since the recently added `--log_level` flag doesn't seem to change this logger's level. The future plan is to be able to have different log levels for different modules, but for now just use `print` if `wall_clock_breakdown` is `True`, so this functionality is not log-level dependent. `print` is also less noisy than the logger, because of the long prefix generated by the latter, which is of no value to the user since we print stats and not code related logs, so the printed results are easier to digest. Signed-off-by: Stas Bekman <stas@stason.org>	2025-10-02 19:08:39 -04:00
Himanshu Sekhar Nayak	e37c37acdd	Fixed save_checkpoint race when consolidating NVMe offloaded tensors (#7613 ) Past Discussion: #7549 Signed-off-by: H1manshu21 <himanshuwindows8.1@gmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-10-01 18:08:20 +00:00
zhengchenyu	07e76bd45f	Fixed the issue that universal checkpoint cannot be loaded for stage3 when world size expansion. (#7599 ) When the world size expands from 2 to 4, then convert to universal checkpoint, and load from universal checkpoint. The new rank, for example, rank3 will load model file `zero_pp_rank_3_mp_rank_00_model_states.pt`. But this file was not produced during the last execution. For stage3, just load the first file, that is `zero_pp_rank_0_mp_rank_00_model_states`. The existing unit test TestZeROUniversalCheckpointDP::test_dp_world_size_2to4 can verify this problem. --------- Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-10-01 15:37:19 +00:00
Xinyu Lian	330f738cd7	Minor fix in the SuperOffload blog (#7612 ) Polish SuperOffload blog post; minor grammar and style fixes --------- Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-10-01 11:02:31 +00:00
Junjie Mao	aa90f544e3	DeepCompile: Fix IPG bucket clearing (#7610 ) PR #6993 replaces the flat IPG buffers with a dict maintaining type-indexed buckets. The member is also renamed from `_ipg_bucket_flat_buffer` to `ipg_buckets`. Update the bucket clearing logic in `init_z3` accordingly. Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-10-01 03:42:51 +00:00
Masahiro Tanaka	e32e817306	Handle the case of DeepCompile's enabled but not activated (#7603 ) This PR improves state management for DeepCompile in the engine. Previously, the system relied only on the config flag indicating whether DeepCompile was enabled. However, DeepCompile is actually activated only when `compile()` is called. This meant that if DeepCompile was enabled in the config but `compile()` was never called, it could lead to invalid internal states (as shown in #7598). Since `enabled == True` should be interpreted as an option that modifies the behavior of `compile()`, this PR introduces clearer state management: - If .compile() is not called, the DeepCompile config has no effect on behavior. A one-time message is shown instead. - A new state, DeepCompile activated, is introduced. This represents the condition where DeepCompile is both enabled in the config and .compile() has been called. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-30 17:21:55 -07:00
zhengchenyu	177c25c9d7	Add venv to .gitignore (#7605 ) Since `make format` will generate `venv` directory, we should add it to `.gitignore`. Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-30 20:44:07 +00:00
Xinyu Lian	462d28c5e6	Add blog for SuperOffload (#7594 ) This PR adds a blog post for SuperOffload. More specifically, the blog covers the design and motivation behind SuperOffload, comparisons with previous approaches, key experiences and insights, and guidance on enabling and using SuperOffload. See also: [PR#7559](https://github.com/deepspeedai/DeepSpeed/pull/7559) - SuperOffload implementation. [PR#990](https://github.com/deepspeedai/DeepSpeedExamples/pull/990) - Examples. --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-30 13:59:34 -04:00
Junjie Mao	4efd7eca73	DeepCompile: Fuse allgather and downcast (#7588 ) With autocast enabled, a majority of weights are downcasted before being used in calculations. Today zero3_compile gathers the FP32 weights before they are downcasted. That is sub-optimal because FP32 weights consumes more bandwidth to allgather and takes more time to downcast. To reduce communication and downcast time, fuse allgather and downcast in the dc ops. The target type is now passed to allgather_param() and prefetch_params_fused() which will downcast the (partial) weights before launching allgathers. This corresponds to issue 1 of #7577. Tested with https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 (run with `deepspeed --num_gpus=N this_file.py -c -p -m 23` to collect torch and memory profiles, and with DINOV2_DEPTH = SIGLIP_DEPTH = 3, LLAMA2_DEPTH = 4 for faster compileation) on 5090 (which has limited inter-GPU bandwidth), time per step decreases from 438ms to 337ms and peak GPU memory usage from 9.5GB to 8.5GB. Profiles of a single step before this PR: <img width="1235" height="1029" alt="image" src="https://github.com/user-attachments/assets/d9fe5296-7731-4542-924b-421ff7415054" /> <img width="1466" height="616" alt="image" src="https://github.com/user-attachments/assets/aa192802-8633-4e36-b2c4-f28b1b432663" /> After this PR: <img width="1218" height="1006" alt="image" src="https://github.com/user-attachments/assets/18a0e09c-155b-4783-adb5-b4d36c5c3691" /> <img width="1537" height="559" alt="image" src="https://github.com/user-attachments/assets/16a2ca74-8a89-4db9-9b68-81844295c61b" /> This PR also reduces peak memory usage because the `fast_free_schedule()` today always arranges param allgathers and downcasts at the beginning of the graph. While the original FP32 params can be freed early, all FP16/BF16-casted params are kept in GPU memory at the beginning of the backward graph, leading to a higher peak in memory usage. P.S. Probably due to organization branch rule settings, I don't find anywhere to allow reviewers to modify the branch. So I'll update the branch per reviewers' comments and rebase if needed. Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-09-29 03:15:33 +00:00
Junjie Mao	6fcccfa2c9	DeepCompile: Specify tensor aliasing in C++ op schema (#7597 ) PyTorch C++ op schema [1] allows specifying tensor storage aliasing by annotating `(a)` after input/output types. Torch inductor takes this information to determine where to insert explicit `del` statements for tensors that are no longer needed. If what an op schema specifies disagrees with the op implementation, inductor-generated code is likely to release tensors earlier than expected and leads to wrong results. `wait_allgather` and `release_param` return the first argument unchanged and that aliasing should be annotated in the schema. Also remove the code related to `clone_custom_op_output` as it is solely a workaround of the aforementioned issue. [1] https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/README.md Fixes: #7596 Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-09-29 02:40:09 +00:00
zhengchenyu	47b3fb5e7f	Fixed the problem of loading universal checkpoint error in multi-machine mode. (#7601 ) In a multi-machine environment, loading the stage3 universal checkpoint will produce incorrect results, causing the loss to increase abnormally.	2025-09-28 20:26:11 +00:00
Ma, Guokai	66c70312f2	Change current_device() to current_device_name() (#7600 ) This PR fix a bug that in some place get_accelerator().current_device() are used instead of get_accelerator().current_device_name(). This would be mostly fine but on CPU this won't work `torch.empty(3, device=get_accelerator().current_device()` <-- won't work other than CUDA device `torch.empty(3, device=torch.device(get_accelerator().current_device()))` <-- works for GPU device, but won't work for CPU `torch.empty(3, device=torch.device(get_accelerator().current_device_name()))` <-- works for both GPU device and CPU `torch.empty(3, device=get_accelerator().current_device_name())` <-- this also works, but not as formal as the last one. This bug is exposed when I tried to run AutoTP training on Xeon server for debug purpose. --------- Signed-off-by: Guokai Ma <guokai.ma@gmail.com>	2025-09-28 10:19:49 -07:00
zhengchenyu	91d14527b6	Fix the universal checkpoint issue for stage3 when there are multiple subgroups. (#7585 ) Describe the bug When the model is large and there are multiple subgroups, we use ds_to_universal.py, will fail ,the error log are below: ``` * 1. Extracting ZeRO fragments 0%\| \| 0/1 [00:03<?, ?it/s] Traceback (most recent call last): File "/work/zhengchenyu/ai-project/qwen3/scripts/ds_to_universal_example.py", line 21, in <module> main() File "/work/zhengchenyu/ai-project/qwen3/scripts/ds_to_universal_example.py", line 18, in main ds_to_universal_main(args) File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 523, in main _extract_zero_shard_files_stage3(args, optim_files, param_shapes, dp_degree, temp_dir) File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 375, in _extract_zero_shard_files_stage3 _do_parallel_work(do_work, list(range(dp_degree)), args.num_extract_workers) File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 359, in _do_parallel_work results.append(do_work(work)) ^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 167, in extract_zero_shards_stage3 dump_param_fragment(temp_dir, 0, dp_index, state_key, flat_state[state_key], name, offset, File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 194, in dump_param_fragment state_flat_tensor = state_flat_tensor.narrow(0, offset, numel).clone() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: start (0) + length (155582464) exceeds dimension size (74499072). ``` To Reproduce Steps to reproduce the behavior: 1. Use large model to run, or set sub_group_size to a lower value. Then train and save model 2. Run ds_to_universal.py The reason** I found that the previous stage3 universal checkpoint implementation did not take subgroups into account. I also found the following problems during debugging. * Unable to handle multiple sub-groups, which will result in data loss * When load_checkpoint is True, then all process will save to same zero model checkpoint file. If multiple processes write at the same time, the file will be corrupted. Occasionally, file corruption was discovered during testing. Relete issue: #7584 --------- Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-27 17:39:43 +00:00
Masahiro Tanaka	6ea345ae27	Simplify leaf module hook (#7592 ) This PR simplifies hooks for leaf module using PyTorch's API. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-27 13:12:15 -04:00
Naveenraj Kamalakannan	b75654001a	disables ZeRO checkpoint loading path when stage=0 (#7586 ) Fixes #7571 When ZeRO is disabled (stage 0) and bf16 is enabled, the current guard sets `load_zero_checkpoint=True`, which leads to `_load_zero_checkpoint` and `_restore_from_bit16_weights()` being called even though no ZeRO state exists. This PR removes the `self.bfloat16_enabled()` condition so that load_zero_checkpoint is tied strictly to `self.zero_optimization()`. Stage 0 (BF16/FP16/FP32): cleanly skips ZeRO checkpoint path. Stage ≥ 1: loads ZeRO partitioned optimizer state as before. cc @sfc-gh-truwase Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-25 20:31:14 +00:00
Manh Nguyen	16c1bf429f	Include init file for superoffload folder (#7591 ) This PR just fixes tiny error for pr [7559](https://github.com/deepspeedai/DeepSpeed/pull/7559) in the comment reported error [here](https://github.com/deepspeedai/DeepSpeed/pull/7559#issuecomment-3329036699). ``` [rank1]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1462, in _configure_optimizer [rank1]: self.optimizer = self._configure_zero_optimizer(basic_optimizer) [rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank1]: File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1835, in _configure_zero_optimizer [rank1]: from deepspeed.runtime.superoffload.superoffload_stage3 import SuperOffloadOptimizer_Stage3 [rank1]: ModuleNotFoundError: No module named 'deepspeed.runtime.superoffload' ``` Create `__init__.py` for superoffload folder to avoid import error when superoffload folder irgnored by pip installation. --------- Signed-off-by: nguyen599 <pnvmanh2123@gmail.com>	2025-09-24 16:50:17 +00:00
Xinyu Lian	af56ed4d37	SuperOffload Release (#7559 ) This PR introduces SuperOffload—an optimizer designed for Superchips (Nvidia GH200 & GB200, AMD MI300A) with high CPU–GPU bandwidth. It enables full fine-tuning of GPT-OSS-20B, Qwen3-14B, and Phi-4 on a single GH200 GPU, achieving up to ~500 TFLOPS, using Hugging Face Transformers and DeepSpeed—no custom modeling code required. SuperOffload extends ZeRO-Offload with fine-grained control and CPUAdam rollback utilities, allowing GPU execution to overlap with CPUAdam. This reduces GPU idle time and improves overall efficiency. Key changes: - New SuperOffloadOptimizer_Stage3 optimizer. - C++/CUDA binding for adam_rollback to revert one optimization step. - Config additions including super_offload and cpuadam_cores_perc. A detailed blog and tutorial will be available soon. --------- Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-24 13:09:23 +00:00
Junjie Mao	17d80ce440	Deepcompile: Make size of activation to free configurable (#7582 ) In deepcompile free-activation mode, only activations larger than a threshold are eagerly freed. The threshold is hardcoded today and thus may not be suitable in all cases. This PR first generalizes the dc.init() interface to take the whole compile_config object, and then converts the threshold into a config item. This corresponds to issue 3 of #7577. --------- Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-09-24 01:37:46 +00:00
Olatunji Ruwase	bc9ed477e9	Broadcast fp16 overflow in Z1 (#7580 ) Fix #7568 Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-23 15:51:43 +00:00
Junjie Mao	8c7c56a932	Deepcompile: Fix bugs when applying deepcompile to VLA-like models (#7569 ) Describe the bug When applying deepcompile to the OpenVLA model (which is composed of two vision transformers and a llama-7B), I met the following issues: a. Not all parameters are trained, which leads to compile-time exceptions as well as incorrect invocation of `endBackward()`. b. `release_param()` can be passed a tuple, not a tensor. c. A use-before-define error in `fast_free_schedule()`. This PR attempts to fix all of those issues. Patch 1~2 resolves a, 3 resolves b and 4 resolves c. To Reproduce the issues Use this script: https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 1. `deepspeed --num_gpus=N openvla-like.py -c` --------- Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-09-23 07:27:15 +00:00
Junjie Mao	35de2030be	logging: Also set log level of logger handlers (#7576 ) After #7526 the default logger passes logs to a StreamHandler, which has its own log level. Changing the log level of the logger alone does not take effect in such case. Update the log level of all handlers when changing the parent logger's. Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-09-23 03:32:37 +00:00
Jupiter-Guy	325c6c5e9c	DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… (#7489 ) … meta key (max_mem) --------- Signed-off-by: Abhishek <dalakotiashu150@gmail.com> Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Abhishek <dalakotiashu150@gmail.com> Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-09-22 16:45:00 -07:00
Logan Adams	80033a8293	Update version.txt post 0.17.6 release (#7572 )	2025-09-19 14:33:22 -07:00
jinghanhu	e4f6da9685	[bugfix] fix partition context unpatch (#7566 ) ## Fix asymmetric patching/unpatching in InsertPostInitMethodToModuleSubClasses ### Problem Description The `InsertPostInitMethodToModuleSubClasses` context manager patches `__init__` methods of model classes during entry and unpatches them during exit. However, asymmetric condition checks between patching and unpatching can introduce subtle inheritance bugs. ### Root Cause Analysis The issue occurs with classes that have multiple inheritance where: 1. Child class A does not override `__init__` 2. Parent class B does not inherit from `nn.Module` 3. Parent class C inherits from `nn.Module` Current asymmetric logic: ```python # Patching (entry): Only patch classes with explicit __init__ def _enable_class(cls): if '__init__' in cls.__dict__: # ✅ Strict check cls._old_init = cls.__init__ cls.__init__ = partition_after(cls.__init__) # Unpatching (exit): Restore any class with _old_init def _disable_class(cls): if hasattr(cls, '_old_init'): # ❌ Permissive check cls.__init__ = cls._old_init ``` Execution flow: 1. During entry: Child A is skipped (no explicit `__init__`), Parent C is patched 2. During exit: Child A inherits `_old_init` from Parent C and gets incorrectly "restored" Result: Child A's `__init__` points to Parent C's original `__init__`, bypassing Parent B and breaking the inheritance chain. ### Reproduction Case This pattern is common in Hugging Face models: ```python class Qwen3ForSequenceClassification(GenericForSequenceClassification, Qwen3PreTrainedModel): pass # No explicit __init__ # GenericForSequenceClassification - not a nn.Module subclass # Qwen3PreTrainedModel - inherits from nn.Module ``` ### Solution Apply symmetric condition checking in both patch and unpatch operations: ```python def _disable_class(cls): # Match the patching condition: only restore classes we explicitly patched if '__init__' in cls.__dict__ and hasattr(cls, '_old_init'): cls.__init__ = cls._old_init delattr(cls, '_old_init') # Optional cleanup ``` This ensures that only classes that were explicitly patched during entry get restored during exit. ### Testing The fix has been validated against the Qwen3ForSequenceClassification reproduction case and resolves the inheritance chain corruption. ### Related Issues - External issue: https://github.com/modelscope/ms-swift/pull/5820 Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> v0.17.6	2025-09-19 07:24:33 +00:00
Junjie Mao	6b731c5c96	scripts: Check .is_cuda only in non-C++ files (#7561 ) The check-torchcuda.py today will search for all occurrences of .is_cuda in the repository when a commit only modifies C++ headers and sources, which I believe is not intended. Check usage of .is_cuda only when a commit modifies any non-C++ file. Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-09-19 05:01:50 +00:00
Ma, Guokai	2585881ae9	Make Muon optimizer easier to enable (#7555 ) The original Muon optimizer PR (https://github.com/deepspeedai/DeepSpeed/pull/7509) requires user to explicitly set `use_muon` flags in `model.parameters()`, as shown in test https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/ops/muon/test_muon.py#L27 . This PR integrate setting of `use_muon` into DeepSpeed before engine initialization. This makes Muon optimizer easier to use. User only needs to change optimizer in `config.json` from `AdamW` to `Muon`, no need to change code. It will solve the following issue https://github.com/deepspeedai/DeepSpeed/issues/7552 --------- Signed-off-by: Ma, Guokai <guokai.ma@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-09-17 09:52:11 -04:00
Welsper	aa539c6dd5	fix npu device_id AttributeError issue (#7560 ) ## Environment ``` torch 2.7.1 torch_npu 2.7.1rc1 deepspeed 0.17.3 ``` ## Issue An `AttributeError` is raised when `init_process_group` on NPU device since deepspeed v0.17.3. The issue is similar to https://github.com/deepspeedai/DeepSpeed/pull/7488. Trace: ``` Traceback (most recent call last): File "/home/welsper/.local/lib/python3.10/site-packages/swift/cli/sft.py", line 10, in <module> sft_main() File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/train/sft.py", line 331, in sft_main return SwiftSft(args).main() File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/train/sft.py", line 27, in __init__ super().__init__(args) File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/base.py", line 19, in __init__ self.args = self._parse_args(args) File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/base.py", line 31, in _parse_args args, remaining_argv = parse_args(self.args_class, args) File "/home/welsper/.local/lib/python3.10/site-packages/swift/utils/utils.py", line 152, in parse_args args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True) File "/home/welsper/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 358, in parse_args_into_dataclasses obj = dtype(inputs) File "<string>", line 325, in __init__ File "/home/welsper/.local/lib/python3.10/site-packages/swift/llm/argument/train_args.py", line 175, in __post_init__ self.training_args = TrainerFactory.get_training_args(self) File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/trainer_factory.py", line 70, in get_training_args return training_args_cls(args_dict) File "<string>", line 167, in __init__ File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/arguments.py", line 152, in __post_init__ super().__post_init__() File "/home/welsper/.local/lib/python3.10/site-packages/swift/trainers/arguments.py", line 133, in __post_init__ super().__post_init__() File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 1803, in __post_init__ self.device File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 2332, in device return self._setup_devices File "/home/welsper/.local/lib/python3.10/site-packages/transformers/utils/generic.py", line 74, in __get__ cached = self.fget(obj) File "/home/welsper/.local/lib/python3.10/site-packages/transformers/training_args.py", line 2259, in _setup_devices self.distributed_state = PartialState(accelerator_state_kwargs) File "/home/welsper/.local/lib/python3.10/site-packages/accelerate/state.py", line 216, in __init__ dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, kwargs) File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/comm.py", line 854, in init_distributed cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 120, in __init__ self.init_process_group(backend, timeout, init_method, rank, world_size) File "/home/welsper/.local/lib/python3.10/site-packages/deepspeed/comm/torch.py", line 163, in init_process_group torch.distributed.init_process_group(backend, *kwargs) File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper return func(args, *kwargs) File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 95, in wrapper func_return = func(args, **kwargs) File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1717, in init_process_group default_pg, _ = _new_process_group_helper( File "/home/welsper/.local/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1831, in _new_process_group_helper if device_id is not None and (device_id.index is None or device_id.type == "cpu"): AttributeError: 'device' object has no attribute 'index' ``` ## Fix Switch `torch.npu.device(device_index)` to `torch.device('npu', device_index)`. Now: `d40a0f5de8/accelerator/npu_accelerator.py (L47-L48)` After fix: ```python def device(self, device_index=None): return torch.device('npu', device_index) ``` Signed-off-by: welsper <welsper@qq.com> Co-authored-by: welsper <xinyuyang@cmbchina.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>	2025-09-17 15:46:33 +08:00
Junjie Mao	2d84be8159	deepcompile: Create a full list of no-copy ops (#7562 ) The list of torch no-copy ops is hard coded and does not include all operations that may aliasing inputs in their outputs. Instead of using a fixed list, iterate over all ops under torch.ops.aten and identify those with aliasing behavior by inspecting their schema. With PyTorch 2.7.1, the default overload of ops identified by the updated logic include: - _nested_view_from_buffer - _reshape_alias - alias - as_strided - conj - detach - diagonal - expand - imag - lift_fresh - narrow - permute - pin_memory - positive - real - reshape - squeeze - t - unfold - unsqueeze - view - view_as_complex - view_as_real - most operations whose name ends with an underscore Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-09-16 09:05:11 -07:00
Junjie Mao	e9d5d416cc	deepcompile: Record graph order using OrderedDict (#7563 ) On clear, GraphOrder does not clears ordered_frames. That may confuses subsequent passes after the first iteration. Use an OrderedDict to record the mapping from frame IDs to other graph-related information. Also fix the type annotation of graph_order which is a list of (int , bool) tuples instead of a list of int. Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-09-16 05:25:32 +00:00
Junjie Mao	660ee89529	deepcompile: Create dummy inputs using empty_strided (#7564 ) CUDA tensors may have a larger storage than numel() * dtype.itemsize due to alignment considerations. Creating dummy tensors by torch.zero().as_strided() leads to out-of-bound errors in such cases. Create dummy inputs by empty_strided().zero_() instead. Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>	2025-09-15 14:19:06 -07:00
Masahiro Tanaka	d40a0f5de8	Add dependency for deepcompile test (#7558 ) This PR adds dependency to CI tests for DeepCompile. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-09-13 00:45:08 -07:00
Masahiro Tanaka	b9bd03a2ec	Move modal tests to tests/v1 (#7557 ) This PR moves active tests under `tests/unit/v1` to clarify which tests are run on modal. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-09-12 17:28:47 -04:00
Masahiro Tanaka	0e859aa0d3	Fix gradient buffer access for DeepCompile Z1/2 (#7548 ) The initialization of DeepCompile+Z1/2 now fails due to the change introduced in #7509. This PR resolves the issue by: - Adding an argument to optimizer.get_flat_partition - Skipping the entire allreduce function in the engine --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-09-10 18:12:02 +00:00
Masahiro Tanaka	0012ff6ea8	Limit random seed range in tests (#7553 ) `pytest-randomly` often passes a large seed value to `set_random_seed` and causes an error ([example](https://github.com/deepspeedai/DeepSpeed/actions/runs/17620450004/job/50064585974)) ``` E ValueError: Seed must be between 0 and 2**32 - 1 ``` This PR limits the range of seed values by taking a modulo. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-09-10 17:45:37 +00:00
Ayush	8cbbbb539d	[MoE] Fix misuse of num_experts as expert parallel group size (ep_size) (#7551 ) Fixes #7535 ## Description This PR fixes a bug in inference/engine.py where num_experts (moe_experts) was incorrectly passed as the expert parallel group size (ep_size) when creating expert parallel groups. Currently: ``` if moe and dist.get_world_size() > 1: self._create_ep_parallel_group(config.moe.moe_experts) ``` This causes invalid behavior whenever `num_experts > world_size`, because `_create_ep_parallel_group` expects a group size, not the total number of experts as pointed out by @Arnoochka ## Root Cause num_experts = number of experts inside the MoE layer. ep_size = how many GPUs to group together for expert parallelism. These were mixed up in the code. ##Fix Replaced the incorrect call with the proper ep_size argument: ``` if moe and dist.get_world_size() > 1: self._create_ep_parallel_group(config.moe.ep_size) ``` Additionally, added a safety check in _create_ep_parallel_group to catch invalid configurations: ``` num_ep_groups = dist.get_world_size() // moe_ep_size if num_ep_groups == 0: raise ValueError( f"Invalid ep_size={moe_ep_size} for world_size={dist.get_world_size()}" ) ``` ## Backward compatibility - If a user was already running with ep_size >= num_experts, the old code worked fine which would still work fine. - Only the previously broken case (num_experts > world_size) now works correctly. Signed-off-by: Flakes342 <ayushtanwar1729@gmail.com>	2025-09-09 22:31:44 -07:00
Stas Bekman	533e834b0a	[alstn tutorial] support bs>1 (#7550 ) Edit tutorial's demo code to support bs>1 and prevent div by zero	2025-09-09 12:51:42 -07:00
Max Kovalenko	450b965efb	Revert "Add index to HPU devices (#7497 )" (#7545 ) This reverts commit 047a7599d24622dfb37fa5e5a32c671b1bb44233. Unfortunately, the above required substantial redesign of existing HPU stack, which is currently not feasible, so reverting. Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-08 18:07:55 -04:00

1 2 3 4 5 ...

2956 Commits