2958 Commits

Author SHA1 Message Date
4686d5ef0b Update version after 0.17.1 release (#7345) 2025-06-09 21:03:16 -07:00
2ce5505799 Move pytest pinning from individual tests to requirements-dev.txt until fixed. (#7327)
pytest 8.4.0 seems to break a number of our tests, rather than pinning
in each individually, we should just put this in the requirements file
until we resolve the issue.

---------

Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
v0.17.1
2025-06-09 22:42:55 +00:00
4d0c159630 Fix docs that are rendering Incorrectly (#7344)
Fixes #6747 

### Changes

- Added missing imports required for the documentation to render
correctly.
- Changed `autoclass_content` from `auto` to `both`
The value `auto` is **not valid** according to the [Sphinx
documentation](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html#confval-autoclass_content).


### Preview

Sample fixed page:
https://deepspeedfelixgondwefork.readthedocs.io/en/latest/model-checkpointing.html

Current broken page:
https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html

---------

Signed-off-by: felixgondwe <zungwala@gmail.com>
Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: xiongjyu <xiongjyu@gmail.com>
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Raza Sikander <srsikander@habana.ai>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Ramya Ramineni <62723901+rraminen@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com>
Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com>
Co-authored-by: Ubuntu <jomayeri@microsoft.com>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
Co-authored-by: xiongjyu <xiongjyu@gmail.com>
Co-authored-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-06-09 13:15:44 -07:00
e440506bee Improve overflow handling in ZeRO (#6976)
Fix #5241: Improve overflow handling 
- [x] ZeRO 1
- [x] ZeRO 2
- [ ] ZeRO 3
- [ ] BF16Optimizer

Enable pydantic configuration for mixed precision
- [x] bf16
- [x] fp16

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Xinyu Lian <lian7@illinois.edu>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>
Co-authored-by: Fabio Geraci <118277438+fabiosanger@users.noreply.github.com>
Co-authored-by: Sam Foreman <saforem2@gmail.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-06-09 17:30:51 +00:00
bb293aea5d Update folder name (#7343)
Sync folder name with release date

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-06-09 09:15:18 -07:00
05818e90d9 Fix LoRA arxiv reference (#7340)
## PR Summary
This small PR fixes the LoRA arxiv reference in
`mixed_precision_zeropp.md`. Relevant docs page:
https://www.deepspeed.ai/tutorials/mixed_precision_zeropp/

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
2025-06-07 14:09:01 -04:00
770967f5f0 fixed: Modified the topkgating function and modified the test_moe file for testing (#7163)
Since the previous PR encountered the DCO problem and could not be
solved for some reason, I resubmitted a completely identical PR but
without the problem.

---------

Signed-off-by: xiongjyu <xiongjyu@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2025-06-06 16:42:41 -07:00
24a1d8f936 DeepNVMe update (#7215)
- FastPersist
- ZeRO-Inference+SGLang

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: jerryyangli <jerryyangli@gmail.com>
Co-authored-by: Yang Li <yangli2@microsoft.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
Co-authored-by: Connor Holmes <connorholmes@microsoft.com>
Co-authored-by: Bing Xie <67908712+xiexbing@users.noreply.github.com>
Co-authored-by: cassieesvelt <73311224+cassieesvelt@users.noreply.github.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Michael Wyatt <michaelwyatt@microsoft.com>
Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>
Co-authored-by: swli <47371259+lucasleesw@users.noreply.github.com>
Co-authored-by: Cheng Li <pistasable@gmail.com>
Co-authored-by: Molly Smith <112220543+molly-smith@users.noreply.github.com>
Co-authored-by: Ubuntu <jomayeri@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
2025-06-06 18:49:41 -04:00
cb3ad0c176 fp16 optimizer timers fix - TypeError: 'NoneType' object is not callable (#7330)
This fix is required to prevent the below error:

=================================== FAILURES
===================================
__________________ TestFp8ComposabilityAcrossZero.test[fp16]
___________________
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/pool.py",
line 125, in worker
    result = (True, func(*args, **kwds))
File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/pool.py",
line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
File "/root/PR/test/DeepSpeed/tests/unit/common.py", line 322, in
_dist_run
    raise e
File "/root/PR/test/DeepSpeed/tests/unit/common.py", line 314, in
_dist_run
    self.run(**self._fixture_kwargs)
  File "/root/PR/test/DeepSpeed/tests/unit/common.py", line 470, in run
    self._current_test(**fixture_kwargs)
File
"/root/PR/test/DeepSpeed/tests/unit/runtime/half_precision/test_fp8.py",
line 88, in test
    loss = run_zero(stage, model_dtype)
File
"/root/PR/test/DeepSpeed/tests/unit/runtime/half_precision/test_fp8.py",
line 74, in run_zero
    model.step()
File "/root/PR/test/DeepSpeed/deepspeed/runtime/engine.py", line 2387,
in step
    self._take_model_step(lr_kwargs)
File "/root/PR/test/DeepSpeed/deepspeed/runtime/engine.py", line 2290,
in _take_model_step
    self.optimizer.step()
File
"/root/PR/test/DeepSpeed/deepspeed/runtime/fp16/fused_optimizer.py",
line 255, in step
    self.timers(OVERFLOW_CHECK_TIMER).start()
TypeError: 'NoneType' object is not callable
"""

Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-06-06 14:02:22 +00:00
7d0c3f782e Fix issue with symint input (#7243)
This PR fixes an issue with symint input in backend. (See #7229)

---------

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-06-06 00:08:41 +00:00
2ad2011cc9 Fix pytest version to 8.3.5 in hpu-gaudi actions (#7337)
This is needed to avoid the issue of ci failure in #7330 PR.

Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
2025-06-05 23:10:19 +00:00
d0f7091aa4 Update config_utils.py (#7333)
Fixes this warning:

```
 /fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/deepspeed/runtime/config_utils.py💯 PydanticDeprecatedSince211: Accessing the 'model_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
    fields = self.model_fields
```

Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
2025-06-05 09:51:14 -07:00
b8d4b84260 Improve Ulysses Plus Docs (#7335)
Improve or fix some minor indentation, typo, and list numbering issues
of the Ulysses Plus tutorial.

---------

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-06-05 15:26:30 +00:00
097f0637d5 UlyssesPlus Docs take 2 (#7332)
bare md urls don't get automatically linked, so fixing that.
2025-06-03 12:14:06 -07:00
81a47408c3 Ulysses Plus Docs (#7331)
The docs/tutorials for
https://github.com/deepspeedai/DeepSpeed/pull/7268

I also updated the previous Ulysses to clarify that it's for
Megatron-Deepspeed.

---------

Signed-off-by: Stas Bekman <stas@stason.org>
2025-06-03 11:20:41 -07:00
8f3c3e78ab Update version.txt after v0.17.0 release (#7326) 2025-06-02 16:22:32 -07:00
720787e79b Bump to v0.17.0 (#7324)
Co-authored-by: Logan Adams <loadams@microsoft.com>
v0.17.0
2025-06-02 16:01:44 -07:00
8b03a35646 Fix ci hang in torch2.7& improve ut (#7321)
fix ci hang.
improve the ut.

---------

Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-06-02 15:41:10 -04:00
4d00b38ada Ulysses SP for HF Integration (#7268)
This is the Deepspeed counterpart of
https://github.com/snowflakedb/ArcticTraining/pull/45 - as the new
feature(s) require changes on both sides.


For PR reviewers: 

Readiness status:
- [x] Code
- [x] Tests
- [ ] Docs - working on it


Features:

- [x] add support for delaying grad addition via
`param.ds_grad_is_ready` flag (used when performing tiled compute in an
autograd function)
- [x] add light sp-only mpu version (Jeff Rasley)
- [x] improved debug
- [x] added `all_gather_object` to `dist`
- [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from
Megatron-Deepspeed plus modern MHA-variations)
- [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL
batches to be used by `UlyssesSPAttentionHF`
- [x] `SequenceTiledCompute` - generic autograd function to perform
compute after tiling on the sequence dimension
- [x] `TiledMLP` - a specific autograd function to perform tiled MLP
(it's much easier to understand before trying to grok
`SequenceTiledCompute`)
- [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari)
- [x] torch-dist-check now allows `torch.distributed.nn` (which is
needed since deepspeed's dist is not up to date with
`torch.distributed.nn`)

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-05-31 07:25:23 +00:00
0baf79ead0 fix asymmetric in dequantize (#7283)
框架中反量化算子(dequantize)在非对称量化产生结果时只将float类型转成half类型,漏掉了对float类型的转换,导致在输出是float类型时,会产生精度误差。
<img width="809" alt="企业微信截图_1747294273387"
src="https://github.com/user-attachments/assets/3be19f06-89fe-404c-bc32-efcacc31bb1d"
/>

---------

Co-authored-by: 潘俊涵 <sp.junhan.pan@enflame-tech.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
2025-05-29 20:05:58 +00:00
b66c81077c anchor transformers version (#7316)
some features require minimal transformers versions so let's start
anchoring.

and fixing tests that break with recent transformers.

I need this fixed to be able to merge
https://github.com/deepspeedai/DeepSpeed/pull/7268 which requires
`transformers>=4.51.3`

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-05-29 06:19:54 +00:00
ec6b254dce Update gaudi2 nightly,ci to latest 1.21.0 build (#7313)
Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-05-29 02:58:52 +00:00
e5afb88760 tests/conftest.py: automatically add local deepspeed repo when running tests (#7317)
This is a follow up to https://github.com/deepspeedai/DeepSpeed/pull/923

my original code was a copy from transformers, which has a different fs
layout and I missed that. So this PR is fixing it to actually do the
right thing.

Now you can have multiple clones of deepspeed and the tests will use the
local repo automatically and not the pre-installed deepspeed.
2025-05-28 23:32:49 +00:00
b4cc079eee CI: prefer bf16 over fp16 (#7304)
these days fp16 is barely ever used, so we should be testing bf16
instead of fp16 where possible.

had to fix a bunch of tests to adapt to this change. a few bugs as well
on the way.

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-05-28 00:49:21 +00:00
b9af5d8d61 Fix: Update grad norm calculation for CPU offload (#7302)
## Description
This PR fixes an issue where gradient clipping modifications are not
reflected in the global gradient norm calculation when CPU offloading is
enabled. The issue occurs because the `averaged_gradients` are not being
updated with the clipped gradients when CPU offloading is active.

## Problem
When using CPU offloading with gradient clipping:
1. The gradients are successfully clipped using `safe_set_local_grad`
2. However, the `_global_grad_norm` calculation still uses the original
unclipped gradients.
3. This leads to incorrect gradient norm reporting and potential issues
with gradient clipping effectiveness

## Solution
The fix ensures that the `averaged_gradients` are properly updated with
the clipped gradients when CPU offloading is enabled, similar to how it
works when CPU offloading is disabled.

## Testing
The fix has been tested with:
- CPU offloading enabled and disabled
- Different gradient clipping values
- A simple model with linear layers
- Both FP16 and BF16

## Related Issues
Fixes #7292

---------

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
2025-05-27 12:13:54 +00:00
17c8be0706 Fix the GPU memory usage of ZeRO-Offload (only update stage_1_and_2.py) (#7309)
Signed-off-by: Armin Zhu <mingzhengzhu1998@gmail.com>

Fix the memory usage of ZeRO-Offload with stage 1 and 2. Before the fix,
the memory usage is about 3x that of params_FP16. This is caused by the
H2D data copy is using different data type. Now the GPU memory usage is
about 1x params_FP16. And the H2D memory copy needs a 16bit pinned
memory buffer.
2025-05-27 12:13:24 +00:00
b666844ffc Fix AutoTP gathering replaced layer params when bias is not None (#7257)
Some params are one-dimensional, this PR adds support for these params.

Resolve #7249

```log
param.shape torch.Size([768, 1536])
param.shape torch.Size([768])
...
```

```log
with deepspeed.module_inject.layers.GatherReplacedLayerParams([param], model, enabled=True):
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 359, in __enter__
self.params[0].gather_params(self.params)
File "torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
       ^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 473, in gather_params
param.shape[1],
~~~~~~~~~~~^^^
IndexError: tuple index out of range
```

---------

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
2025-05-25 04:03:20 +00:00
d4032ec7d1 Update COMMITTERS.md (#7305)
Adding Zhipeng Wang to the TSC Committers.

---------

Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
2025-05-23 14:52:19 +00:00
e3bd16fdbd Update next version in version.txt after 0.16.9 release. (#7306) 2025-05-22 16:31:05 -07:00
bdba8231bc [XPU] Support XCCL on deepspeed side (#7299)
XCCL will be used for XPU device on Pytorch-2.8, with this support will
remove torch-ccl on XPU device, and we will also reserve the old path
for torch-CCL enable.

---------

Signed-off-by: yisheng <yi.sheng@intel.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
v0.16.9
2025-05-22 16:31:26 +00:00
0e3209a16b Fix extra_repr_str when weight is None / in zero-3 (#7254)
extra_repr_str will be undefined if self.weight is None with current
code.

In addition, the shape is stored in ds_shape if it's in ZeRO-3, so we
also need to do this check (Although currently AutoTP hasn't supported
ZeRO-3).

```logs
  File "deepspeed/__init__.py", line 394, in tp_model_init
    model = TpTrainingManager(model=model, tp_size=tp_size, dtype=dtype).module
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/runtime/tensor_parallel/tp_manager.py", line 35, in __init__
    self._apply_policies(parser_dict)
  File "deepspeed/runtime/tensor_parallel/tp_manager.py", line 47, in _apply_policies
    self._apply_injection_policy(self.config, client_module)
  File "deepspeed/runtime/tensor_parallel/tp_manager.py", line 53, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, None, self.config, self.model_config)
  File "deepspeed/module_inject/replace_module.py", line 400, in replace_transformer_layer
    replaced_module = replace_module(model=model,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/module_inject/replace_module.py", line 653, in replace_module
    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/module_inject/replace_module.py", line 713, in _replace_module
    _, layer_id = _replace_module(child,
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/module_inject/replace_module.py", line 713, in _replace_module
    _, layer_id = _replace_module(child,
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/module_inject/replace_module.py", line 689, in _replace_module
    replaced_module = policies[child.__class__][0](child,
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/module_inject/replace_module.py", line 333, in replace_fn
    new_module = replace_wo_policy(child, _policy, prefix=prefix, state_dict=state_dict)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/module_inject/replace_module.py", line 316, in replace_wo_policy
    return _autotp._replace_module(module)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/module_inject/auto_tp.py", line 481, in _replace_module
    self._replace_module(child, name, class_name)
  File "deepspeed/module_inject/auto_tp.py", line 466, in _replace_module
    setattr(r_module, name, self.linear_policies[child.__class__](child, prev_name + '.' + name,
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "deepspeed/module_inject/auto_tp.py", line 361, in _replace
    if 'Yuan' in str(self.module):
                 ^^^^^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 2940, in __repr__
    mod_str = repr(module)
              ^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 2940, in __repr__
    mod_str = repr(module)
              ^^^^^^^^^^^^
  File "torch/nn/modules/module.py", line 2934, in __repr__
    extra_repr = self.extra_repr()
                 ^^^^^^^^^^^^^^^^^
  File "deepspeed/module_inject/layers.py", line 267, in extra_repr
    out_features, in_features = self.weight.shape[-2:] if self.weight is not None else (None, None)
    ^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 1)
```

Signed-off-by: Hollow Man <hollowman@opensuse.org>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-05-22 04:01:12 +00:00
e290bf580d disable license check until the new license situation has been sorted… (#7301)
Until we sort out the new license situation disable this check so that
new code not owned by MSFT could be added

---------

Signed-off-by: Stas Bekman <stas@stason.org>
2025-05-22 00:27:39 +00:00
41fceadeeb Add qwen3moe meta loading for AutoTP (#7297)
Enable Qwen3-Moe meta loading for AutoTP, for issue
https://github.com/deepspeedai/DeepSpeed/issues/7275

Signed-off-by: ranzhejiang <zhejiang.ran@intel.com>
2025-05-20 20:06:51 +00:00
0e741714f5 Enable ZeRO set/get APIs for NVMe offload (#7046)
- Extend APIs for
[debugging](https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging)
and
[modifying](https://deepspeed.readthedocs.io/en/latest/zero3.html#modifying-partitioned-states)
ZeRO partitioned states to NVMe offload.
- Add vectorized update API. This is performance-critical for NVMe
offloading scenarios.

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
2025-05-20 00:11:17 +00:00
b048cc2b46 Modernize system executable detection across components (#7290)
# PR Summary
This small PR resolves deprecation warnings caused by the use of
`distutils.spawn.find_executable`:
```python
DeprecationWarning: Use shutil.which instead of find_executable
```
Please note that `find_executable` is deprecated from Python 3.10 and
removed in 3.12. `shutil.which` available since Python 3.3.

Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-05-19 21:42:42 +00:00
d0ef6501b8 Avoid graph break by removing another redundant requires grad false (#7263)
This PR is an follow-up to [PR
#7158](https://github.com/deepspeedai/DeepSpeed/pull/7158) handling the
same issue in another place.
See [PR #7158](https://github.com/deepspeedai/DeepSpeed/pull/7158) for
details.

---------

Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
2025-05-19 16:38:12 +00:00
80bc7b76da Add qwen3 meta loading for AutoTP (#7293)
This PR fixes https://github.com/deepspeedai/DeepSpeed/issues/7275 to
enable Qwen3 meta loading for AutoTP

Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
2025-05-19 15:36:42 +00:00
88a1b5c057 Update patch version after 0.16.8 release (#7296) 2025-05-19 09:31:28 -07:00
f45950258b rollback #6726 (#7258)
This PR rollback #6726 which caused
https://github.com/deepspeedai/DeepSpeed/issues/7116 .

---------

Signed-off-by: Guokai Ma <guokai.ma@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
v0.16.8
2025-05-19 04:54:16 +00:00
d46947db4a Temporarily skip AIO tests due to an issue with runners (#7288)
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-05-18 23:36:06 +00:00
930ab46e63 Fix issues XPU tests hit with extra-index-url (#7291)
cc: @Liangliang-Ma

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-05-16 19:07:35 -07:00
5a4e7a08ec [XPU] update xpu-max1100 CI workflow to torch 2.7 (#7284)
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2025-05-15 10:02:53 -07:00
069ec31c59 Fix fp8 gemm (#7265)
This PR addresses this issue
https://github.com/deepspeedai/DeepSpeed/issues/7236.
I might have reverted some of the recent changes introduced in this
[PR](https://github.com/deepspeedai/DeepSpeed/pull/6932), which was
necessary to remove a misaligned address issue on the CUDA kernel. I
will get back to this and try to make the necessary changes for the
other pass.

cc: @mrwyattii @jeffra

---------

Co-authored-by: Reza Yazdani <reza.yazdani@snowflake.com>
Co-authored-by: Reza Yazdani <rezay@microsoft.com>
Co-authored-by: Jeff Rasley <jeffra45@gmail.com>
Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-05-08 15:21:52 -07:00
e1ba9e614f add Makefile to ease maintenance (#7267)
adding `Makefile` with `make format` and `make test` to make things
easier to maintain.

---------

Signed-off-by: Stas Bekman <stas@stason.org>
2025-05-07 13:09:56 -07:00
ee492c30a7 Fix compile error for nv_bloat162 (#7248)
some systems seem not to have the __nv_bfloat162 definition so a
placeholder was introduced. newer CUDA libs have that definition, which
breaks the compile process. this patch adds the official cuda_bf16.h
guard while keeping the old code and a safety assert in case the
definition should change in the future. see #7190 for reference

---------

Signed-off-by: LosCrossos <165311345+loscrossos@users.noreply.github.com>
Signed-off-by: LosCrossos <165311345+mytait@users.noreply.github.com>
Co-authored-by: LosCrossos <165311345+mytait@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-04-27 05:16:34 +00:00
fff77bd293 Update README.md (#7246)
I make the sentence look more human, not robot.
2025-04-25 15:15:16 +00:00
9926879b59 Update CPU torch version to 2.7 (#7241)
Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-04-23 21:58:01 +00:00
8d2865e014 Revert "Update torch cpu test version"
This reverts commit 00b5678bbf10c12b97a5f80d4b89247dcd837a95.
2025-04-23 13:26:40 -07:00
00b5678bbf Update torch cpu test version
Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-04-23 13:26:02 -07:00
d79bd930d6 Add cpu accelerator fp16 dtype support (#7207)
Add cpu accelerator fp16 dtype support

---------

Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-04-21 19:21:37 +00:00