310 Commits

Author SHA1 Message Date
69e03e52d0 [XPU][CI] recover xpu-max1100 workflow (#7630)
Reduce some test scope to recover CI workflow.

Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-10-13 16:43:17 +00:00
64ac13f72e Enable forked PRs (#7486)
Enable forked PRs

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-14 17:43:08 -04:00
a12de38db6 Modal CI (#7289)
This is an initial effort to migrate CI unto Modal infra. This PR
creates two new workflows that run on Modal
1. modal-torch-latest: a subset of nv-torch-latest-v100 that includes
`tests/unit/runtime/zero/test_zero.py`.
2. modal-accelerate: a full copy of nv-accelerate-v100. 

Follow up PRs will selectively migrate relevant workflows onto Modal.

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com>
Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-08-11 20:13:39 +00:00
8c83e42ba1 Fix cpu CI (#7481)
Fix torch version

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-11 11:53:09 -07:00
43f00ba31c Remove additional unused tests (human-eval) (#7445) 2025-07-24 13:16:57 -07:00
3bf53451e5 Remove tests from README that are already removed. (#7441) 2025-07-21 20:56:11 -07:00
affee605e4 trying to fix nv-accelerate-v100.yml CI job (#7424)
trying a day old accelerate from the day before
1ac8643df7

---------

Signed-off-by: Stas Bekman <stas@stason.org>
2025-07-11 10:07:27 -04:00
d3b9cb8c4e sequence parallel default dtype (#7364)
the newly released nccl finally started to use fp32 accumulation for
reduction ops!

* Floating point summation is always done in fp32 accumulators (with the
exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus,
   the accuracy with fp8 and fp16 data types should be much improved.

72d2432094

So we should change the fp32 comms default for SP to the same dtype as
inputs if `nccl>=2.27.3` - the user can still override the default.

---------

Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
2025-06-19 18:32:14 +00:00
10b106619a Don't break set_start_method (#7349)
Fix #7347

---------

Signed-off-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal>
Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal>
2025-06-11 13:00:58 -04:00
2ce5505799 Move pytest pinning from individual tests to requirements-dev.txt until fixed. (#7327)
pytest 8.4.0 seems to break a number of our tests, rather than pinning
in each individually, we should just put this in the requirements file
until we resolve the issue.

---------

Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
2025-06-09 22:42:55 +00:00
2ad2011cc9 Fix pytest version to 8.3.5 in hpu-gaudi actions (#7337)
This is needed to avoid the issue of ci failure in #7330 PR.

Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
2025-06-05 23:10:19 +00:00
720787e79b Bump to v0.17.0 (#7324)
Co-authored-by: Logan Adams <loadams@microsoft.com>
2025-06-02 16:01:44 -07:00
4d00b38ada Ulysses SP for HF Integration (#7268)
This is the Deepspeed counterpart of
https://github.com/snowflakedb/ArcticTraining/pull/45 - as the new
feature(s) require changes on both sides.


For PR reviewers: 

Readiness status:
- [x] Code
- [x] Tests
- [ ] Docs - working on it


Features:

- [x] add support for delaying grad addition via
`param.ds_grad_is_ready` flag (used when performing tiled compute in an
autograd function)
- [x] add light sp-only mpu version (Jeff Rasley)
- [x] improved debug
- [x] added `all_gather_object` to `dist`
- [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from
Megatron-Deepspeed plus modern MHA-variations)
- [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL
batches to be used by `UlyssesSPAttentionHF`
- [x] `SequenceTiledCompute` - generic autograd function to perform
compute after tiling on the sequence dimension
- [x] `TiledMLP` - a specific autograd function to perform tiled MLP
(it's much easier to understand before trying to grok
`SequenceTiledCompute`)
- [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari)
- [x] torch-dist-check now allows `torch.distributed.nn` (which is
needed since deepspeed's dist is not up to date with
`torch.distributed.nn`)

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-05-31 07:25:23 +00:00
b66c81077c anchor transformers version (#7316)
some features require minimal transformers versions so let's start
anchoring.

and fixing tests that break with recent transformers.

I need this fixed to be able to merge
https://github.com/deepspeedai/DeepSpeed/pull/7268 which requires
`transformers>=4.51.3`

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-05-29 06:19:54 +00:00
ec6b254dce Update gaudi2 nightly,ci to latest 1.21.0 build (#7313)
Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-05-29 02:58:52 +00:00
b4cc079eee CI: prefer bf16 over fp16 (#7304)
these days fp16 is barely ever used, so we should be testing bf16
instead of fp16 where possible.

had to fix a bunch of tests to adapt to this change. a few bugs as well
on the way.

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
2025-05-28 00:49:21 +00:00
0e741714f5 Enable ZeRO set/get APIs for NVMe offload (#7046)
- Extend APIs for
[debugging](https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging)
and
[modifying](https://deepspeed.readthedocs.io/en/latest/zero3.html#modifying-partitioned-states)
ZeRO partitioned states to NVMe offload.
- Add vectorized update API. This is performance-critical for NVMe
offloading scenarios.

---------

Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>
2025-05-20 00:11:17 +00:00
d46947db4a Temporarily skip AIO tests due to an issue with runners (#7288)
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-05-18 23:36:06 +00:00
930ab46e63 Fix issues XPU tests hit with extra-index-url (#7291)
cc: @Liangliang-Ma

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-05-16 19:07:35 -07:00
5a4e7a08ec [XPU] update xpu-max1100 CI workflow to torch 2.7 (#7284)
Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2025-05-15 10:02:53 -07:00
9926879b59 Update CPU torch version to 2.7 (#7241)
Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-04-23 21:58:01 +00:00
8d2865e014 Revert "Update torch cpu test version"
This reverts commit 00b5678bbf10c12b97a5f80d4b89247dcd837a95.
2025-04-23 13:26:40 -07:00
00b5678bbf Update torch cpu test version
Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-04-23 13:26:02 -07:00
227a60c0c4 DeepCompile for enhanced compiler integration (#7154)
This PR introduces *DeepCompile*, a new feature that efficiently
integrates compiler optimizations with other DeepSpeed features.
DeepCompile utilizes torch's dynamo to capture the computation graph and
modifies it to incorporate DeepSpeed’s optimizations seamlessly.

Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements
such as proactive prefetching and selective unsharding to improve
performance.
(More details will be added later.)

---------

Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2025-04-16 04:33:53 +00:00
3388f8331b Update container version that runs on A6000 tests. (#7153)
Changes from https://github.com/huggingface/transformers/pull/36654 in
transformers cause issues with the torch 2.5 version we were using. This
just updated us to use a newer version.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-03-19 23:42:38 +00:00
29e9fd53b5 Enhance Gaudi2 CI/Nightly Coverage with Model Parallelism and Linear Tests (#7146)
Enhancing  ci/nightly coverage for gaudi2 device
Tests added :
        test_autotp_training.py
        test_ulysses.py
	test_linear::TestLoRALinear and test_linear::TestBasicLinear
	test_ctx::TestEngine
these provide coverage for model_parallesim and linear feature.
The tests are stable. 10/10 runs pass.
New tests addition is expected to increase ci time by 3-4 mins and
nightly job time by 15 min.

Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
2025-03-18 23:49:01 +00:00
d095b18185 Unpin transformers version for most workflows (#7139)
Unpin transformers version for all workflows except
`nv-torch-latest-v100` as this still has a tolerance issue with some
quantization tests.

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-03-14 13:52:44 -07:00
c1acd49cdf Update gaudi2 nightly,ci to latest 1.20.0 build (#7093)
Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: shaomin <wukon1992@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: siqi <siqi@tecorigin.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Wei Wu <wuwei211x@gmail.com>
Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il>
Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Signed-off-by: Hongwei <hongweichen@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com>
Co-authored-by: shaomin <wukon1992@gmail.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: loadams <loadams@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: siqi654321 <siqi202311@163.com>
Co-authored-by: siqi <siqi@tecorigin.com>
Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com>
Co-authored-by: snahir <snahir@habana.ai>
Co-authored-by: Yejing-Lai <yejing.lai@intel.com>
2025-03-07 22:46:47 +00:00
02bbf50109 Remove workflows for very old torch versions (#7090)
These jobs haven't been run in a long time and were originally used when
compatibility with torch <2 was more important.

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-02-28 01:33:01 +00:00
f2ed2531a7 Update parallelism for nv-torch-latest/nightly tests due to more GPUs/runner (#7086)
Signed-off-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2025-02-27 10:47:49 -08:00
f8d34295d0 Pin transformers version on tests that use latest. (#7085)
Latest transformers causes failures when cpu-torch-latest test, so we
pin it for now to unblock other PRs.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-02-27 08:15:11 -08:00
1d30b58cba Replace calls to python setup.py sdist with python -m build --sdist (#7069)
With future changes coming to pip/python/etc, we need to modify to no
longer call `python setup.py ...` and replace it instead:
https://packaging.python.org/en/latest/guides/modernize-setup-py-project/#should-setup-py-be-deleted


![image](https://github.com/user-attachments/assets/ea39ef7b-3cbe-4916-86f0-bc46a5fce96d)

This means we need to install the build package which is added here as
well.

Additionally, we pass the `--sdist` flag to only build the sdist rather
than the wheel as well here.

---------

Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-02-24 20:40:24 +00:00
33dd2e2165 nv-ds-chat breaks with latest transformers (#7052)
Signed-off-by: Logan Adams <loadams@microsoft.com>
2025-02-19 15:48:41 +00:00
079de6bdff Update workflows to cuda 12.4 (#7000)
- Update existing workflows that use cu121 to cu124. Note, this means
that where we download torch latest, we will now be getting torch 2.6
rather than the torch latest 2.5 provided with cuda 12.1.
- Note, nv-nightly is failing in master currently due to unrelated
errors, so this could be ignored in this PR (nv-nightly tested locally,
where it passes with 12.1 and it also passes with 12.4).

---------

Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Signed-off-by: Omar Elayan <oelayan@habana.ai>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Liangliang Ma <1906710196@qq.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>
2025-02-12 15:25:41 -08:00
a83ab17d3d Update A6000 tests transformers version (#7016)
Signed-off-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-02-08 00:26:02 +00:00
e7fc598652 [XPU] max1100 workflow update for docker and softwares (#7003)
1. update intel oneAPI basekit to 2025.0
2. update torch/ipex/oneccl to 2.5
2025-02-05 12:17:56 -08:00
fd40516923 Update GH org references (#6998)
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Signed-off-by: Fabien Dupont <fdupont@redhat.com>
Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>
2025-02-05 00:56:50 +00:00
241bffded3 Update A6000 workflows to use newer docker container - 24.09 vs 24.03 (#6967)
- Issues with nv-sd updates, will follow up with a subsequent PR
2025-01-31 23:07:12 +00:00
c963c21f5d Specify torchvision in nv-ds-chat workflow (prevents errors with torch 2.6) (#6982)
Fixes #6984.

The workflow was pulling the updated torch 2.6, which caused CI
failures. This keeps us on torch 2.5 for now, since installing
torchvision as a dependency later on was pulling torch 2.6 with it which
was unintended.

This PR also unsets NCCL_DEBUG to avoid a large print out in the case of
any errors.
2025-01-30 20:03:14 +00:00
8ad487254c Update torch versions to support 2.6 (#6977) 2025-01-29 00:12:58 +00:00
66d3d3e94d Pin nv-a6000 workflow (#6938)
Breaking change in transformers is
https://github.com/huggingface/transformers/pull/35235. Need to make
changes to unpin nv-a6000 workflow.
2025-01-13 10:34:15 -08:00
0fc3daade7 Add position_ids arg to OPTEmbedding forward function (#6939)
This PR updates the DeepSpeed `OPTEmbedding` forward function to include
a new `positions_ids` argument.

---------

Co-authored-by: Logan Adams <loadams@microsoft.com>
2025-01-09 20:11:35 +00:00
cc03c76d57 Update Gaudi2 jobs to latest 1.19 build (#6905)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-26 12:07:28 -08:00
4cd1d97460 Don't error out when cpu accelerator doesn't have torch (as default for whl building) (#6886)
This fixes a bug introduced in #6845, which breaks the `no-torch`
workflow that we require in order to do releases where we do not require
torch to be in the environment when building an sdist. This adds the
same logic to the cpuaccelerator that the cudaaccelerator had where we
don't require torch to be installed to build the whl.
2024-12-17 17:30:52 -08:00
87c650681e Remove pin from transformers version and fix Processing/Threading issues in tests (#6822)
Changes from https://github.com/huggingface/transformers/pull/34966
caused the `nv-torch-latest-v100` tests to fail with the following
error:

```
  File "/tmp/azureml/cr/j/e4bfd57a509846d6bbc4914639ad248d/exe/wd/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3941, in from_pretrained
    raise EnvironmentError(
OSError: Can't load the model for 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
```

Sample failure here:
https://github.com/microsoft/DeepSpeed/actions/runs/12169422174/job/33942348835?pr=6794#step:8:3506

This was resolved on the Transformers side here:
https://github.com/huggingface/transformers/pull/35236
2024-12-16 11:21:51 -08:00
853a97648b Fix xpu tests workflow failure by changing pip index url (#6864)
Update xpu-max1100.yml and xpu-compile.yml
2024-12-13 11:29:48 -08:00
074d5c69c3 Fix nv-torch-nightly test by pinning transformers (#6849) 2024-12-11 10:34:31 -08:00
06f1d3609e Unpin pytest-subtests now that 0.14.1 is released (#6844)
The issue we encountered was covered here:
https://github.com/pytest-dev/pytest-subtests/issues/173

And is resolved with the latest changes from this PR:
https://github.com/pytest-dev/pytest-subtests/issues/174, and is
published in the latest version 0.14.1.
2024-12-09 22:14:59 -08:00
08b907a226 Pin pytest-subtests version for accelerate tests (#6842) 2024-12-09 12:24:33 -08:00
9ca6016017 Pin HPU tests (#6831)
HPU tests are impacted by the same issue as other tests that use
transformers latest. This PR pins to a version of transformers before
the fix.
2024-12-06 14:29:00 -08:00