DeepSpeed

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-20 15:33:51 +08:00

Author	SHA1	Message	Date
Stas Bekman	533e834b0a	[alstn tutorial] support bs>1 (#7550 ) Edit tutorial's demo code to support bs>1 and prevent div by zero	2025-09-09 12:51:42 -07:00
Max Kovalenko	450b965efb	Revert "Add index to HPU devices (#7497 )" (#7545 ) This reverts commit 047a7599d24622dfb37fa5e5a32c671b1bb44233. Unfortunately, the above required substantial redesign of existing HPU stack, which is currently not feasible, so reverting. Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-08 18:07:55 -04:00
Masahiro Tanaka	b82ef716c8	Improve error message and reduce validation in autocast test (#7547 ) This PR improves error logging and relaxes loss value checks in the autocast test. Previously, the test displayed error messages and mismatched loss values on all ranks, even if the mismatch only occurred on some ranks. This was confusing, since logs from other ranks could appear correct. This PR changes the behavior so that error messages are shown only on the ranks where the mismatch occurs. Additionally, this PR skips loss value validation for `test_lower_precision_model`, where we intentionally use a different communication dtype from the baseline (standard PyTorch autocast). --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>	2025-09-05 07:04:18 +00:00
kaixuanliu	08879a3916	avoid setting device_id to `init_process_group` (#7542 ) In some usecases such as vllm, we need to new distributed group not only on gpu, but also on cpu, if we set `device_id` here, it will prevent us from new distributed group on cpu: [L230](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L230) . This PR fixes this bug. --------- Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>	2025-09-05 06:06:26 +00:00
MingjieLuAMD	78a74874b2	fix get_cuda_compile_flag (#7521 ) command: python3 -c 'import deepspeed;deepspeed.ops.adam.cpu_adam.CPUAdamBuilder().load()' when running on the rocm platform, it encounter an error: Traceback (most recent call last): File "<string>", line 1, in <module> File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 538, in load return self.jit_load(verbose) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 570, in jit_load cxx_args = self.strip_empty_entries(self.cxx_args()) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 401, in strip_empty_entries return [x for x in args if len(x) > 0] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 401, in <listcomp> return [x for x in args if len(x) > 0] TypeError: object of type 'NoneType' has no len() Compare with version 0.16.5: https://github.com/deepspeedai/DeepSpeed/blob/v0.16.5/op_builder/builder.py#L435 The current version of code is missing a return when self.is_rocm_pytorch() is True. Just add return '-D__DISABLE_CUDA__' is ok! --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-04 12:34:17 -04:00
Ma, Guokai	43537d0a60	Autotune ZenFlow affinity (#7506 ) This PR address the following ZenFlow optimizer core binding issue. https://github.com/deepspeedai/DeepSpeed/issues/7478 With this PR, ZenFlow optimizer worker would derive its core binding from deepspeed core binding mechanism. The algorithm is as following: 1. Each DeepSpeed rank get its core binding by using DeepSpeed command line `--bind_cores_to_rank`, this command would assign each CPU physical cores to different workers 2. When spawing ZenFlow optimizer worker, DeepSpeed would split current CPU affinity list into two sublist: pt_affinity and zf_affinity 3. zf_affinity would be used to set affinity of ZenFlow optimizer worker. pt_affinity would be used to set current pytorch process. 4. By default, one cores is reserved by each pytorch process, the rest is used by ZenFlow optimizer worker. The number of cores reserved for pytorch process can be changed by ZenFlow config variable: `pt_reserved_cores` --------- Signed-off-by: Guokai Ma <guokai.ma@gmail.com> Signed-off-by: Ma, Guokai <guokai.ma@intel.com> Signed-off-by: aeeeeeep <aeeeeeep@proton.me> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: aeeeeeep <aeeeeeep@proton.me> Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com> Co-authored-by: Zhipeng Wang <zwanga@wustl.edu> Co-authored-by: Peng Du <pedu@linkedin.com> Co-authored-by: pengdurice <pengduhit@gmail.com> Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-09-04 07:10:39 -04:00
Masahiro Tanaka	66bf2a642d	Relax restrictions of torch.autocast integration (#7543 ) This PR relaxes two restrictions on torch.autocast in the DeepSpeed engine: 1) Nesting torch.autocast Currently, we do not expect `torch.autocast` to be used outside the DeepSpeed engine. Here is the current behavior: - If `torch.autocast` is enabled in the DeepSpeed config and the engine detects it is also enabled outside, a warning is displayed. - If it is disabled in the config, the engine raises an error. This design prevents the following usage: ```python with torch.autocast(...): logits = deepspeed_model(...) loss = criteria_fn(logits) ``` In this case, we also want to apply autocast to `criteria_fn`. With the current behavior, we would need move `deepspeed_model(...)` outside the `torch.autocast` context, leading to inconsistent code between DeepSpeed and non-DeepSpeed setups. (cannot be handled with `enabled` arg of `torch.autocast`) Change in this PR: `torch.autocast` outside the DeepSpeed engine is ignored, and - If `torch_autocast` is enabled in the config, DeepSpeed will follow that setting. - If it is disabled, DeepSpeed falls back to its own mixed-precision support (or FP32). In these cases, DeepSpeed engine shows a message to explain the behavior. 2) Model’s dtype Previously, DeepSpeed assumed the model’s dtype must be FP32 when `torch.autocast` was enabled. However, models with lower-precision parameters (e.g., BF16) can also be used with autocast. For example, if both the model and `torch.autocast` use BF16, autocast will upcast precision-sensitive ops as needed. Change in this PR: Removed the assertion that restricted the model’s dtype to FP32. This PR also adds and updates tests to cover these new behaviors. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>	2025-09-03 12:15:10 -07:00
Stas Bekman	8af75487f4	Fix zenflow_torch_adam.py (#7544 ) `_disable_dynamo_if_unsupported` fallback wasn't getting created under certain conditions. This PR is fixing this. Also removed debug print. Fixes issue installing deepspeed on torch 2.4 and 2.1 that triggered this: ``` #42 15.84 Traceback (most recent call last): #42 15.84 File "<string>", line 2, in <module> #42 15.84 File "<pip-setuptools-caller>", line 34, in <module> #42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/setup.py", line 40, in <module> #42 15.84 from op_builder import get_default_compute_capabilities, OpBuilder #42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/op_builder/__init__.py", line 18, in <module> #42 15.84 import deepspeed.ops.op_builder # noqa: F401 # type: ignore #42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/__init__.py", line 25, in <module> #42 15.84 from . import ops #42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/__init__.py", line 6, in <module> #42 15.84 from . import adam #42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/adam/__init__.py", line 9, in <module> #42 15.84 from .zenflow_torch_adam import ZenFlowSelectiveAdamW #42 15.84 File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/adam/zenflow_torch_adam.py", line 685, in <module> #42 15.84 @_disable_dynamo_if_unsupported(single_tensor_fn=_single_tensor_adamw) #42 15.84 NameError: name '_disable_dynamo_if_unsupported' is not defined #42 15.84 [WARNING] ZenFlow disabled: torch internal optimizer symbols could not be imported: cannot import name '_disable_dynamo_if_unsupported' from 'torch.optim.optimizer' (/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py) ``` --------- Signed-off-by: Stas Bekman <stas@stason.org>	2025-09-03 18:14:18 +00:00
Masahiro Tanaka	1e183a6a9d	Fix scaling and allgather with `torch.autocast` (#7534 ) This PR includes these two fixes: - Use GradScaler only for FP16 (not for BF16) - Fix dtype conversion for ZeRO3 allgather - The reduce hook should be called only once, even when a parameter is shared across multiple layers (tied parameters). - Currently, the hook is triggered at each tied layer because we temporarily set `.data` with a different dtype. - The fix ensures that the parameter consistently retains the same dtype. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: jakehemmerle <jakehemmerle@protonmail.com> Signed-off-by: Qi Bin <qibin0506@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: digger yu <digger-yu@outlook.com> Co-authored-by: Jake Hemmerle <jakehemmerle@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Qi Bin <qibin0506@users.noreply.github.com>	2025-09-03 01:22:19 +00:00
Qi Bin	c07b3abf9a	fixed DeepSpeedCPULion with ZeRO-Offload bug (#7531 ) fixed DeepSpeedCPULion with ZeRO-Offload bug [issues/7524](https://github.com/deepspeedai/DeepSpeed/issues/7524) Signed-off-by: Qi Bin <qibin0506@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-09-02 21:40:14 +00:00
Jake Hemmerle	4d83f3fe13	docs typo: `lrrt.md`, reference to `cycle_min_lr` should be `cycle_max_lr` (#7530 ) Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: jakehemmerle <jakehemmerle@protonmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-09-02 21:17:22 +00:00
Stas Bekman	9e4957eb30	[doc] fixing moe tutorial (#7538 ) MoE tutorial fixes: 1. cifar example has been moved - fix the url 2. fixing text and improving markup --------- Signed-off-by: Stas Bekman <stas@stason.org>	2025-09-02 16:53:15 -04:00
Stas Bekman	066d912052	[logging] less startup noise (#7526 ) This PR removes some and enables removing other startup noise - especially when it's replicated rank-times and doesn't carry any informative payload. 1. add `--log_level` flag which sets the launcher's logger to a desired setting - defaulting to `logging.INFO` for now for BC, but will change to `logging.WARNING` in v1 2. add `--quiet/-q` flag which sets the launcher's logger to `logging.ERROR` which essentially disables startup info messages 3. change the logging defaults elsewhere to `logging.WARNING` (main impact is the accelerator.py), once deepspeed started the frameworks control its loglevel for each rank, so the tricky part is this pre-start stage info logs. this part is breaking BC as there is no machinery to set the logger level for `real_accelerator.py`) 4. builder is changed to non-verbose (BC breaking) --------- Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-02 19:14:57 +00:00
Stas Bekman	411e20a3f7	undo the revert (#7536 ) replay https://github.com/deepspeedai/DeepSpeed/pull/3019 as it got reverted	2025-09-02 14:24:48 -04:00
digger yu	902e78c989	fix typo s/1014 /1024 (#7528 ) fix typo s/1014 /1024 s/was_interruptted /was_interrupted detail info modified: deepspeed/autotuning/scheduler.py modified: deepspeed/autotuning/utils.py Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-09-01 01:12:40 +00:00
Olatunji Ruwase	eabb687ac1	ZeRO3: Improve mismatch detection (#7525 ) ZeRO3 tracks DDP (SPMD) behavior by matching values different training states across ranks. Some of these states are represented as lists, and mismatches sometimes manifests as hangs during error detection. This PR improves error detection by first validating the list lengths across ranks before validating the list contents. Motivated by https://github.com/deepspeedai/DeepSpeed/issues/7461#issuecomment-3235146207 --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>	2025-08-31 17:57:10 -04:00
heyujiao99	9bf215d213	Add riscv64 cpu support in deepspeed_shm_comm op (#7519 ) This patch adds riscv64 support for the deepspeed_shm_comm operator，enabling DeepSpeed to perform CPU training/inference on RISCV64 hosts, for research purposes. Based on the discussion in pull #7387 , this patch refactors some original code to support multiple CPU architectures. Related tests have passed on x86 and RISC-V CPU, and I successfully ran Qwen2.5 on a RISC-V CPU, ```bash (myenv) [root@openeuler-riscv64 DeepSpeed ]$ pytest tests/unit/comm/test_dist.py::TestDistInferenceAllReduce -vv ====================================================================== test session starts ======================================================================= platform linux -- Python 3.11.4, pytest-7.2.0, pluggy-1.6.0 -- /root/myenv/bin/python3 cachedir: .pytest_cache hypothesis profile 'default' rootdir: /root/ecosystem/DeepSpeed/tests, configfile: pytest.ini plugins: mock-3.14.1, hypothesis-6.135.14, forked-1.6.0 collected 3 items tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype0] PASSED [ 33%] tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype1] PASSED [ 66%] tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype2] PASSED [100%] (myenv) root@ubuntu-2204:~/soft-working-dir/DeepSpeed# pytest tests/unit/comm/test_dist.py::TestDistInferenceAllReduce -vv ====================================================================== test session starts ======================================================================= platform linux -- Python 3.12.3, pytest-7.2.0, pluggy-1.6.0 -- /root/soft-working-dir/myenv/bin/python3 cachedir: .pytest_cache rootdir: /root/soft-working-dir/DeepSpeed/tests, configfile: pytest.ini plugins: forked-1.6.0 collected 3 items tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype0] PASSED [ 33%] tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype1] PASSED [ 66%] tests/unit/comm/test_dist.py::TestDistInferenceAllReduce::test[dtype2] PASSED [100%] ``` --------- Signed-off-by: heyujiao99 <he.yujiao@sanechips.com.cn> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>	2025-08-29 23:41:25 +08:00
Tingfeng Lan	e04fa3e679	Update README with ZenFlow release blog featured by PyTorch. (#7520 ) Main change: Add post bullet and link to ZenFlow release blog on latest news. Blog link: https://pytorch.org/blog/zenflow-stall-free-offloading-engine-for-llm-training/ --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>	2025-08-28 13:28:08 -04:00
Olatunji Ruwase	889f0ead27	Enable non-ZeRO mode (#7515 ) Enabled via `stage=0` which corresponds to DDP. Remove hardwired path to b16_optimizer. Enable`torch.autocast` for DDP training Enable native mixed precision DDP for bfloat16 Update torch.autocast and native mixed precision UTs <img width="976" height="184" alt="image" src="https://github.com/user-attachments/assets/92904cdc-e312-46a4-943f-011eb5ab146a" /> --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>	2025-08-27 14:07:29 -04:00
Zhipeng Wang	66ad278048	Enabling Muon Optimizer in DeepSpeed (#7509 ) Authorship: @pengdurice and @PKUWZP Related Issue: #7438 # Introduction [Muon](https://arxiv.org/abs/2502.16982), a new optimizer that has attracted the community’s attention recently shows promising results in training large language models. Adding the Muon Optimizer to DeepSpeed, a popular OSS framework for large scale training and inference is critically important for DeepSpeed users and developers. There has been a [PR](https://github.com/deepspeedai/DeepSpeed/pull/7454) attempting the adoption. (Huge Thanks to @qimcis), which is a good starting point. It still requires more substantial effort to make it fully compatible and work within DeepSpeed. We are publishing this PR to fully enable Muon Optimizer capabilities for DeepSpeed. # Issues and solutions ## Issues 1. With stage 1, 2 or 3, the optimizer states will be partitioned within the same data parallel group. This means that each process is already handling only parts of the model parameters and there is no need to use the DP solution as in the [code](https://github.com/KellerJordan/Muon/blob/master/muon.py#L195). 2. The parameters (and the gradients) will be flattened to 1D vector before being used in the optimizer, thus nullifying the major hypothesis of the muon optimizer: it works by orthogonalizing the updates for each matrix (dim >=2) ## Solutions To solve the issues, we propose this new PR in which: 1. We simplify the Muon code by [removing](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-c9052994e41caee9ca88363749c10af08655f8019f08dc971c018663d25a3712R22) the partitioning and muon updates logics. 2. We [move](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1867) the muon update to the [get_flat_partition](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1848) function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter gradients are collected before being flattened and used by the optimizer to update the model parameters. Since each parameter is still in its original shape, we can easily apply the muon updates. 3. We also save the momentum buffer into the optimizer’ state so that we have a smooth convergence after applying the saved checkpoints. 4. We added comprehensive unit tests to validate Muon Optimizer's correctness and functionality. # Future directions and roadmap In the future, several follow up works are of interests: - [ ] Create a CPU offload version. - [ ] Apply Muon to Stage 3 - [ ] Use the highly optimized version of Adam for the Adam part of MuonWithAuxAdam optimizer. - [ ] More efficient implementations e.g. a) add specialized kernels for Newton-Schulz iteration and muon updates; b) parallelize updates for the parameters (currently, each parameter is updated separately and sequentially) --------- Co-authored-by: Peng Du <pedu@linkedin.com> Co-authored-by: pengdurice <pengduhit@gmail.com> Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-08-26 18:34:35 -07:00
Zhipeng Wang	e4662faffd	Update TSC Committers (#7517 ) Update the affiliations and the TSC Committers. Co-authored-by: Zhipeng Wang <zwanga@wustl.edu>	2025-08-26 07:24:12 -04:00
aeeeeeep	38d1a9eb64	Fix assert when 'pp_int' object has no attribute 'custom_print_str' (#7507 ) Fix assert `'pp_int' object has no attribute 'custom_print_str'` when tracking deepspeed module with some track debug tools like [objwatch](https://github.com/aeeeeeep/objwatch) ```python3 import objwatch objwatch.watch(targets=[deepspeed], framework="torch.distributed", indexes=[0,], with_locals=True) ``` Signed-off-by: aeeeeeep <aeeeeeep@proton.me>	2025-08-25 10:57:08 -04:00
Stas Bekman	d9cb78683e	CI funding shout out to modal.com (#7503 ) modal.com has been sponsoring our CI - thank you, Modal! Add a shout out.	2025-08-21 10:03:49 -07:00
YiMing Liu	bc8c0db3b4	Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 (#7421 ) Please refer to https://github.com/deepspeedai/DeepSpeed/issues/7251 --------- Signed-off-by: lym <letusgo126@126.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: Alex Kiefer <alexkiefer51@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: Sam Foreman <saforem2@gmail.com> Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Signed-off-by: huanyuqu <yc37960@um.edu.mo> Signed-off-by: weeknan <zhounan0431@163.com> Signed-off-by: WoosungMyung <dntjd517@naver.com> Signed-off-by: Nir Sonnenschein <nsonnenschein@habana.ai> Signed-off-by: Junjie Mao <banxing.mjj@alibaba-inc.com> Signed-off-by: vinceliu <lpnpcs@gmail.com> Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com> Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: Yao, Matrix <matrix.yao@intel.com> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Signed-off-by: cyy <cyyever@outlook.com> Co-authored-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Alexander Kiefer <56556451+alexk101@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Sam Foreman <saforem2@gmail.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: huanyuqu <55744355+huanyuqu@users.noreply.github.com> Co-authored-by: weeknan <57584045+weeknan@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com> Co-authored-by: WoosungMyung <115716986+WoosungMyung@users.noreply.github.com> Co-authored-by: Nir Sonnenschein <nsonnenschein@habana.ai> Co-authored-by: Junjie Mao <junjie.mao@hotmail.com> Co-authored-by: Junjie Mao <banxing.mjj@alibaba-inc.com> Co-authored-by: lpnpcs <lpnpcs@vip.qq.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Tingfeng Lan <tafflann@outlook.com> Co-authored-by: Rui Yan <49115835+yanrui27@users.noreply.github.com> Co-authored-by: Feng Yunlong <20281571+AlongWY@users.noreply.github.com> Co-authored-by: Yao Matrix <matrix.yao@intel.com> Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Yuanyuan Chen <cyyever@outlook.com> Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>	2025-08-20 22:03:26 +00:00
Logan Adams	f45159e415	Update version.txt after 0.17.5 release (#7502 )	2025-08-20 21:41:57 +00:00
Max Kovalenko	047a7599d2	Add index to HPU devices (#7497 ) The [PR #7266](https://github.com/deepspeedai/DeepSpeed/pull/7266) enforces the devices having explicit device indices (i.e. 'hpu:0', 'cuda:0', etc). This PR aligns HPU devices to the requirement. Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> v0.17.5	2025-08-19 00:30:56 +00:00
Max Kovalenko	8cf5fc5787	Reduce performance impact of compiler.enable decorator (#7498 ) For some accelerators (such as HPU) running in a non-compile scenarios, the `compiler.enable` decorator can cause significant performance drops up to 8-12%. We can easily avoid the performance hit in non-compile scenarios, by detecting the ongoing compilation and returning immediately. Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-08-18 22:04:10 +00:00
Masahiro Tanaka	12b4dc19a7	Fix DeepCompile for PyTorch v2.8 (#7496 ) This PR updates the kernel generation function arguments in Inductor to ensure DeepCompile is compatible with PyTorch v2.8. It also fixes the logging output of DeepCompile.	2025-08-18 12:12:59 -04:00
Yuanyuan Chen	1c03d1b1bb	Fix invalid f-strings (#7457 ) Fix invalid f-strings detected by ruff. --------- Signed-off-by: cyy <cyyever@outlook.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>	2025-08-16 18:22:19 +00:00
Tingfeng Lan	1d7b90adc4	Add Zenflow code for Stage 1 & 2 (#7391 ) This PR adds ZenFlow, a importance-aware offloaded training framework for DeepSpeed ZeRO. ZenFlow enables multi-step overlap between computation and communication during offloaded training, improving GPU utilization and reducing stalls. Highlights: - New ZenFlow optimizers (ZenFlowCPUAdam, ZenFlowSelectiveAdamW) - ZenFlowZeroOptimizer for ZeRO Stage 1/2 integration - Configurable via ZenFlowConfig, integrated with DeepSpeedZeroConfig - Unit tests and documentation included Note: This PR focuses on Stage 1 and 2 integration. Stage 3 support will be introduced in a follow-up PR. --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Signed-off-by: Yusen Wu <xrn4ub@virginia.edu> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Yusen Wu <xrn4ub@virginia.edu> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Guokai Ma <guokai.ma@gmail.com>	2025-08-15 17:32:22 +00:00
Yao Matrix	33cd94500e	fix xpu device_id AttributeError issue (#7488 ) # Reproduce w/ PyTorch 2.8 ``` $ git clone https://github.com/huggingface/trl.git $ cd ./trl $ accelerate launch --config_file examples/accelerate_configs/deepspeed_zero3.yaml examples/scripts/sft_gpt_oss.py --torch_dtype bfloat16 --model_name_or_path openai/gpt-oss-20b --packing true packing_strategy wrapped --run_name 20b-full-eager --attn_implementation sdpa --dataset_num_proc 6 --dataset_name HuggingFaceH4/Multilingual-Thinking --gradient_checkpointing --max_length 4096 --per_device_train_batch_size 1 --num_train_epochs 1 --logging_steps 1 --warmup_ratio 0.03 --lr_scheduler_type cosine_with_min_lr --lr_scheduler_kwargs '{"min_lr_rate": 0.1}' --output_dir gpt-oss-20b-multilingual-reasoner --report_to trackio --seed 42 ``` # Issue > File "/workspace/accelerate/src/accelerate/state.py", line 216, in __init__ > dist.init_distributed(dist_backend=self.backend, auto_mpi_discovery=False, kwargs) > File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/comm.py", line 854, in init_distributed > cdb = TorchBackend(dist_backend, timeout, init_method, rank, world_size) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line 120, in __init__ > self.init_process_group(backend, timeout, init_method, rank, world_size) > File "/usr/local/lib/python3.12/dist-packages/deepspeed/comm/torch.py", line 164, in init_process_group > torch.distributed.init_process_group(backend, kwargs) > File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper > return func(args, kwargs) > ^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper > func_return = func(args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^^ > File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 1685, in init_process_group > if device_id is not None and device_id.type != "cpu": > AttributeError: 'device' object has no attribute 'type' # Root Cause `torch.xpu.device` in PyTorch is a context manager in PyTorch rather than a device class, it doesn't have attribute `type` # Fix switch to use `torch.device` Signed-off-by: Yao, Matrix <matrix.yao@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-08-15 16:40:47 +00:00
Olatunji Ruwase	64ac13f72e	Enable forked PRs (#7486 ) Enable forked PRs --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-08-14 17:43:08 -04:00
Feng Yunlong	8aadf6cbe4	Fix pre-compile on cpu-only machines (#7168 ) + Fix pre-compile on cpu-only machines --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-08-12 10:24:30 -04:00
Stas Bekman	a54c394392	[TiledFusedLogitsLoss] support inference (#7477 ) Adding inference support for `TiledFusedLogitsLoss` by skipping `backward` inside `forward` if the incoming tensor doesn't require grad. xref: https://github.com/snowflakedb/ArcticTraining/pull/259 --------- Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Rui Yan <49115835+yanrui27@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-08-11 17:44:32 -04:00
Stas Bekman	d75196a098	[UlyssesSPDataLoaderAdapter] fix iterator reset (#7472 ) Fixes https://github.com/snowflakedb/ArcticTraining/issues/254 - to support multi-epoch training with `UlyssesSPDataLoaderAdapter`. Thanks to @yanrui27 for the fix Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Rui Yan <49115835+yanrui27@users.noreply.github.com>	2025-08-11 20:45:10 +00:00
Olatunji Ruwase	a12de38db6	Modal CI (#7289 ) This is an initial effort to migrate CI unto Modal infra. This PR creates two new workflows that run on Modal 1. modal-torch-latest: a subset of nv-torch-latest-v100 that includes `tests/unit/runtime/zero/test_zero.py`. 2. modal-accelerate: a full copy of nv-accelerate-v100. Follow up PRs will selectively migrate relevant workflows onto Modal. --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com> Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>	2025-08-11 20:13:39 +00:00
Stas Bekman	8e02992332	fix `deepspeed --venv_script` (#7469 ) currently passing `deepspeed ... --venv_script foo.sh` ends up with a pdsh cmd like: ``` pdsh -S -f 1024 -w 10.4.11.15,10.4.10.1 source foo.sh export NCCL_NET_PLUGIN=blah; ... ``` you can see, `;` is missing before exports start, so the first export is ignored. It should be: ``` pdsh -S -f 1024 -w 10.4.11.15,10.4.10.1 source foo.sh; export NCCL_NET_PLUGIN=blah; ... ``` This PR is fixing it.	2025-08-11 19:12:54 +00:00
Olatunji Ruwase	8c83e42ba1	Fix cpu CI (#7481 ) Fix torch version --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-08-11 11:53:09 -07:00
Tingfeng Lan	cda3f9628c	Add blog for ZenFlow (#7463 ) This PR adds a blog post and images for ZenFlow, introducing its design, benefits, and usage. The blog explains how ZenFlow improves GPU utilization by overlapping computation and communication during offloaded training. See also: #7391 – core ZenFlow implementation. [#982](https://github.com/deepspeedai/DeepSpeedExamples/pull/982) - – benchmarking and fine-tuning example. --------- Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>	2025-08-10 08:50:34 -04:00
Ma, Guokai	f03d416eae	add --bind_cores_to_rank to zero offload tutorial (#7474 ) In ZeRO offload, significant time is spent on CPUAdam, which is CPU code. Thus use `--bind_cores_to_rank` in deepspeed launch command would help improve the performance of ZeRO offload. This PR add this command to ZeRO offload tutorial to increase user awareness. For Qwen2.5-3B finetuning on 2 A100-40B cards, running on CPU host with 128 CPU cores, the average step time is as follow, near 1.3x performance improvement: without `--bind_cores_to_rank`: 3084.44ms per step with `--bind_cores_to_rank`: 2383.16ms per step --------- Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>	2025-08-08 10:34:29 -07:00
lpnpcs	f897b67394	fix #7188 (#7371 ) I found that when using DeepSpeed Zero2 for my training task, the loss becomes 0 at the third step with a grad_norm of 1.414. This issue doesn't occur when using Zero3. I found the same issue #7188. After conducting a series of experiments, I identified the cause: there's a synchronization problem when using double ipg_buffer swapping. The issue was resolved after making modifications. before ![image](https://github.com/user-attachments/assets/981d0829-e15f-4899-ae2c-4eca16ef138d) after ![image](https://github.com/user-attachments/assets/8b6b8403-d5df-4aa8-b573-195b9ee1fdfb) Signed-off-by: vinceliu <lpnpcs@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>	2025-08-04 18:36:40 +00:00
Junjie Mao	0aff6b2c20	Fix all-gather duplicate params and wrong dtype (#7462 ) The following assertion error arises when torch autocast is enabled. [rank3]: File "/opt/deepspeed/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 337, in fetch_sub_module [rank3]: self.__inflight_param_registry.pop(param).wait(handle_dependency=not fast_fetch) [rank3]: File "/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 787, in wait [rank3]: handle.wait(handle_dependency) [rank3]: File "/opt/deepspeed/deepspeed/utils/nvtx.py", line 20, in wrapped_fn [rank3]: ret_val = func(args, *kwargs) [rank3]: File "/opt/deepspeed/deepspeed/runtime/zero/partition_parameters.py", line 750, in wait [rank3]: assert param.ds_status == ZeroParamStatus.INFLIGHT, f"expected param {param.ds_summary()} to be inflight" [rank3]: AssertionError: expected param {'id': 685, 'status': 'AVAILABLE', 'numel': 131334144, 'ds_numel': 131334144, 'shape': (32064, 4096), 'ds_shape': (32064, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': set(), 'ds_tensor.shape': torch.Size([16416768])} to be inflight This is due to multiple all-gather ops in the same coalesced all-gather sharing the same list of params (of mixed dtypes). Make each all-gather exchange only params of a certain dtype. Also pass the allgather dtype that matches the params. Signed-off-by: Junjie Mao <banxing.mjj@alibaba-inc.com> Co-authored-by: Junjie Mao <banxing.mjj@alibaba-inc.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>	2025-08-02 17:01:57 -07:00
Nir Sonnenschein	1a8ad24f0d	fix issues raised by Coverity scans (#7431 ) This commit combines fixes for 37 potential code issues found in Coverity scans. the issues include but are not limited to potential access to uninitialized variables, dead and redundant code. We understand that reviewing such a commit can be difficult and will be happy to help with any questions or changes required. --------- Signed-off-by: Nir Sonnenschein <nsonnenschein@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-08-02 12:16:10 -04:00
WoosungMyung	0e51e09396	Add getter APIs for TP/PP/DP ranks in DeepSpeedEngine (#7427 ) Thanks again for giving opportunity for improving this Community! This PR is from Issue #7423. 1) Motivation To improve compatibility with low-level profiling tools (e.g., NVIDIA CUPTI or DCGM), it can be useful to expose parallelism-specific rank (tensor/pipeline/data) at the engine level. 2) Changes I Added three getter methods to DeepSpeedEngine: - get_tensor_parallel_rank() - get_pipeline_parallel_rank() - get_data_parallel_rank() Thank you for reviewing this contribution! --------- Signed-off-by: WoosungMyung <dntjd517@naver.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-08-01 22:23:11 +00:00
Zhipeng Wang	e1560d8499	Update README.md (#7465 ) Adding slack channel link in ReadME.md	2025-08-01 11:33:58 -07:00
Logan Adams	1ae39b7d09	Update version.txt after v0.17.4 release (#7460 )	2025-07-31 14:31:57 -07:00
Logan Adams	5a202469fd	Revert "Update version.txt after v0.17.4 release" This reverts commit 2ff4b41b7982454faa4a1d11d61153b525630e0b.	2025-07-31 14:09:58 -07:00
Logan Adams	2ff4b41b79	Update version.txt after v0.17.4 release	2025-07-31 13:49:58 -07:00
Stas Bekman	c4b1a8cb8f	`TiledFusedLogitsLoss` bug fix (#7459 ) bug fix - mixed up tuple and list. v0.17.4	2025-07-31 08:59:09 -07:00
Stas Bekman	3292e07a92	adding TiledFusedLogitsLoss (#7437 ) This PR adds `TiledFusedLogitsLoss` for an efficient fused logits+loss computation - this version pre-calculates grads in `forward`, avoiding recomputation in the backward (similar to the Liger-Kernel implementation). --------- Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>	2025-07-30 14:15:33 -04:00

1 2 3 4 5 ...

2958 Commits