DeepSpeed

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-20 15:33:51 +08:00

Author	SHA1	Message	Date
Liangliang Ma	69e03e52d0	[XPU][CI] recover xpu-max1100 workflow (#7630 ) Reduce some test scope to recover CI workflow. Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-10-13 16:43:17 +00:00
Olatunji Ruwase	64ac13f72e	Enable forked PRs (#7486 ) Enable forked PRs --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-08-14 17:43:08 -04:00
Olatunji Ruwase	a12de38db6	Modal CI (#7289 ) This is an initial effort to migrate CI unto Modal infra. This PR creates two new workflows that run on Modal 1. modal-torch-latest: a subset of nv-torch-latest-v100 that includes `tests/unit/runtime/zero/test_zero.py`. 2. modal-accelerate: a full copy of nv-accelerate-v100. Follow up PRs will selectively migrate relevant workflows onto Modal. --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com> Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>	2025-08-11 20:13:39 +00:00
Olatunji Ruwase	8c83e42ba1	Fix cpu CI (#7481 ) Fix torch version --------- Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-08-11 11:53:09 -07:00
Logan Adams	43f00ba31c	Remove additional unused tests (human-eval) (#7445 )	2025-07-24 13:16:57 -07:00
Logan Adams	3bf53451e5	Remove tests from README that are already removed. (#7441 )	2025-07-21 20:56:11 -07:00
Stas Bekman	affee605e4	trying to fix nv-accelerate-v100.yml CI job (#7424 ) trying a day old accelerate from the day before `1ac8643df7` --------- Signed-off-by: Stas Bekman <stas@stason.org>	2025-07-11 10:07:27 -04:00
Stas Bekman	d3b9cb8c4e	sequence parallel default dtype (#7364 ) the newly released nccl finally started to use fp32 accumulation for reduction ops! * Floating point summation is always done in fp32 accumulators (with the exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus, the accuracy with fp8 and fp16 data types should be much improved. `72d2432094` So we should change the fp32 comms default for SP to the same dtype as inputs if `nccl>=2.27.3` - the user can still override the default. --------- Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>	2025-06-19 18:32:14 +00:00
Olatunji Ruwase	10b106619a	Don't break set_start_method (#7349 ) Fix #7347 --------- Signed-off-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal> Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com> Co-authored-by: Tunji Ruwase <tunji@ip-172-31-0-204.us-west-2.compute.internal>	2025-06-11 13:00:58 -04:00
Logan Adams	2ce5505799	Move pytest pinning from individual tests to requirements-dev.txt until fixed. (#7327 ) pytest 8.4.0 seems to break a number of our tests, rather than pinning in each individually, we should just put this in the requirements file until we resolve the issue. --------- Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>	2025-06-09 22:42:55 +00:00
Raza Sikander	2ad2011cc9	Fix pytest version to 8.3.5 in hpu-gaudi actions (#7337 ) This is needed to avoid the issue of ci failure in #7330 PR. Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>	2025-06-05 23:10:19 +00:00
Michael Wyatt	720787e79b	Bump to v0.17.0 (#7324 ) Co-authored-by: Logan Adams <loadams@microsoft.com>	2025-06-02 16:01:44 -07:00
Stas Bekman	4d00b38ada	Ulysses SP for HF Integration (#7268 ) This is the Deepspeed counterpart of https://github.com/snowflakedb/ArcticTraining/pull/45 - as the new feature(s) require changes on both sides. For PR reviewers: Readiness status: - [x] Code - [x] Tests - [ ] Docs - working on it Features: - [x] add support for delaying grad addition via `param.ds_grad_is_ready` flag (used when performing tiled compute in an autograd function) - [x] add light sp-only mpu version (Jeff Rasley) - [x] improved debug - [x] added `all_gather_object` to `dist` - [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from Megatron-Deepspeed plus modern MHA-variations) - [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL batches to be used by `UlyssesSPAttentionHF` - [x] `SequenceTiledCompute` - generic autograd function to perform compute after tiling on the sequence dimension - [x] `TiledMLP` - a specific autograd function to perform tiled MLP (it's much easier to understand before trying to grok `SequenceTiledCompute`) - [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari) - [x] torch-dist-check now allows `torch.distributed.nn` (which is needed since deepspeed's dist is not up to date with `torch.distributed.nn`) --------- Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Signed-off-by: Stas Bekman <stas@stason.org> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Jeff Rasley <jerasley@microsoft.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-05-31 07:25:23 +00:00
Stas Bekman	b66c81077c	anchor transformers version (#7316 ) some features require minimal transformers versions so let's start anchoring. and fixing tests that break with recent transformers. I need this fixed to be able to merge https://github.com/deepspeedai/DeepSpeed/pull/7268 which requires `transformers>=4.51.3` --------- Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>	2025-05-29 06:19:54 +00:00
Raza Sikander	ec6b254dce	Update gaudi2 nightly,ci to latest 1.21.0 build (#7313 ) Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-05-29 02:58:52 +00:00
Stas Bekman	b4cc079eee	CI: prefer bf16 over fp16 (#7304 ) these days fp16 is barely ever used, so we should be testing bf16 instead of fp16 where possible. had to fix a bunch of tests to adapt to this change. a few bugs as well on the way. --------- Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>	2025-05-28 00:49:21 +00:00
Olatunji Ruwase	0e741714f5	Enable ZeRO set/get APIs for NVMe offload (#7046 ) - Extend APIs for [debugging](https://deepspeed.readthedocs.io/en/latest/zero3.html#debugging) and [modifying](https://deepspeed.readthedocs.io/en/latest/zero3.html#modifying-partitioned-states) ZeRO partitioned states to NVMe offload. - Add vectorized update API. This is performance-critical for NVMe offloading scenarios. --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Guanhua Wang <alexwgh333@gmail.com>	2025-05-20 00:11:17 +00:00
Logan Adams	d46947db4a	Temporarily skip AIO tests due to an issue with runners (#7288 ) Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com> Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>	2025-05-18 23:36:06 +00:00
Logan Adams	930ab46e63	Fix issues XPU tests hit with extra-index-url (#7291 ) cc: @Liangliang-Ma --------- Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-05-16 19:07:35 -07:00
Liangliang Ma	5a4e7a08ec	[XPU] update xpu-max1100 CI workflow to torch 2.7 (#7284 ) Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2025-05-15 10:02:53 -07:00
Logan Adams	9926879b59	Update CPU torch version to 2.7 (#7241 ) Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-04-23 21:58:01 +00:00
Logan Adams	8d2865e014	Revert "Update torch cpu test version" This reverts commit 00b5678bbf10c12b97a5f80d4b89247dcd837a95.	2025-04-23 13:26:40 -07:00
Logan Adams	00b5678bbf	Update torch cpu test version Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-04-23 13:26:02 -07:00
Masahiro Tanaka	227a60c0c4	DeepCompile for enhanced compiler integration (#7154 ) This PR introduces DeepCompile, a new feature that efficiently integrates compiler optimizations with other DeepSpeed features. DeepCompile utilizes torch's dynamo to capture the computation graph and modifies it to incorporate DeepSpeed’s optimizations seamlessly. Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements such as proactive prefetching and selective unsharding to improve performance. (More details will be added later.) --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2025-04-16 04:33:53 +00:00
Logan Adams	3388f8331b	Update container version that runs on A6000 tests. (#7153 ) Changes from https://github.com/huggingface/transformers/pull/36654 in transformers cause issues with the torch 2.5 version we were using. This just updated us to use a newer version. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-03-19 23:42:38 +00:00
Raza Sikander	29e9fd53b5	Enhance Gaudi2 CI/Nightly Coverage with Model Parallelism and Linear Tests (#7146 ) Enhancing ci/nightly coverage for gaudi2 device Tests added : test_autotp_training.py test_ulysses.py test_linear::TestLoRALinear and test_linear::TestBasicLinear test_ctx::TestEngine these provide coverage for model_parallesim and linear feature. The tests are stable. 10/10 runs pass. New tests addition is expected to increase ci time by 3-4 mins and nightly job time by 15 min. Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>	2025-03-18 23:49:01 +00:00
Logan Adams	d095b18185	Unpin transformers version for most workflows (#7139 ) Unpin transformers version for all workflows except `nv-torch-latest-v100` as this still has a tolerance issue with some quantization tests. Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-03-14 13:52:44 -07:00
Raza Sikander	c1acd49cdf	Update gaudi2 nightly,ci to latest 1.20.0 build (#7093 ) Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com>	2025-03-07 22:46:47 +00:00
Logan Adams	02bbf50109	Remove workflows for very old torch versions (#7090 ) These jobs haven't been run in a long time and were originally used when compatibility with torch <2 was more important. Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-02-28 01:33:01 +00:00
Logan Adams	f2ed2531a7	Update parallelism for nv-torch-latest/nightly tests due to more GPUs/runner (#7086 ) Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2025-02-27 10:47:49 -08:00
Logan Adams	f8d34295d0	Pin transformers version on tests that use latest. (#7085 ) Latest transformers causes failures when cpu-torch-latest test, so we pin it for now to unblock other PRs. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-02-27 08:15:11 -08:00
Logan Adams	1d30b58cba	Replace calls to `python setup.py sdist` with `python -m build --sdist` (#7069 ) With future changes coming to pip/python/etc, we need to modify to no longer call `python setup.py ...` and replace it instead: https://packaging.python.org/en/latest/guides/modernize-setup-py-project/#should-setup-py-be-deleted ![image](https://github.com/user-attachments/assets/ea39ef7b-3cbe-4916-86f0-bc46a5fce96d) This means we need to install the build package which is added here as well. Additionally, we pass the `--sdist` flag to only build the sdist rather than the wheel as well here. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-02-24 20:40:24 +00:00
Logan Adams	33dd2e2165	nv-ds-chat breaks with latest transformers (#7052 ) Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-02-19 15:48:41 +00:00
Logan Adams	079de6bdff	Update workflows to cuda 12.4 (#7000 ) - Update existing workflows that use cu121 to cu124. Note, this means that where we download torch latest, we will now be getting torch 2.6 rather than the torch latest 2.5 provided with cuda 12.1. - Note, nv-nightly is failing in master currently due to unrelated errors, so this could be ignored in this PR (nv-nightly tested locally, where it passes with 12.1 and it also passes with 12.4). --------- Signed-off-by: Fabien Dupont <fdupont@redhat.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Omar Elayan <oelayan@habana.ai> Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Liangliang Ma <1906710196@qq.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Omar Elayan <142979319+oelayan7@users.noreply.github.com>	2025-02-12 15:25:41 -08:00
Logan Adams	a83ab17d3d	Update A6000 tests transformers version (#7016 ) Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>	2025-02-08 00:26:02 +00:00
Liangliang Ma	e7fc598652	[XPU] max1100 workflow update for docker and softwares (#7003 ) 1. update intel oneAPI basekit to 2025.0 2. update torch/ipex/oneccl to 2.5	2025-02-05 12:17:56 -08:00
Olatunji Ruwase	fd40516923	Update GH org references (#6998 ) Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Fabien Dupont <fdupont@redhat.com> Co-authored-by: Fabien Dupont <fabiendupont@fabiendupont.fr>	2025-02-05 00:56:50 +00:00
Logan Adams	241bffded3	Update A6000 workflows to use newer docker container - 24.09 vs 24.03 (#6967 ) - Issues with nv-sd updates, will follow up with a subsequent PR	2025-01-31 23:07:12 +00:00
Logan Adams	c963c21f5d	Specify torchvision in nv-ds-chat workflow (prevents errors with torch 2.6) (#6982 ) Fixes #6984. The workflow was pulling the updated torch 2.6, which caused CI failures. This keeps us on torch 2.5 for now, since installing torchvision as a dependency later on was pulling torch 2.6 with it which was unintended. This PR also unsets NCCL_DEBUG to avoid a large print out in the case of any errors.	2025-01-30 20:03:14 +00:00
Logan Adams	8ad487254c	Update torch versions to support 2.6 (#6977 )	2025-01-29 00:12:58 +00:00
Logan Adams	66d3d3e94d	Pin nv-a6000 workflow (#6938 ) Breaking change in transformers is https://github.com/huggingface/transformers/pull/35235. Need to make changes to unpin nv-a6000 workflow.	2025-01-13 10:34:15 -08:00
Lev Kurilenko	0fc3daade7	Add position_ids arg to OPTEmbedding forward function (#6939 ) This PR updates the DeepSpeed `OPTEmbedding` forward function to include a new `positions_ids` argument. --------- Co-authored-by: Logan Adams <loadams@microsoft.com>	2025-01-09 20:11:35 +00:00
Raza Sikander	cc03c76d57	Update Gaudi2 jobs to latest 1.19 build (#6905 ) Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2024-12-26 12:07:28 -08:00
Logan Adams	4cd1d97460	Don't error out when cpu accelerator doesn't have torch (as default for whl building) (#6886 ) This fixes a bug introduced in #6845, which breaks the `no-torch` workflow that we require in order to do releases where we do not require torch to be in the environment when building an sdist. This adds the same logic to the cpuaccelerator that the cudaaccelerator had where we don't require torch to be installed to build the whl.	2024-12-17 17:30:52 -08:00
Logan Adams	87c650681e	Remove pin from transformers version and fix Processing/Threading issues in tests (#6822 ) Changes from https://github.com/huggingface/transformers/pull/34966 caused the `nv-torch-latest-v100` tests to fail with the following error: ``` File "/tmp/azureml/cr/j/e4bfd57a509846d6bbc4914639ad248d/exe/wd/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3941, in from_pretrained raise EnvironmentError( OSError: Can't load the model for 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack. ``` Sample failure here: https://github.com/microsoft/DeepSpeed/actions/runs/12169422174/job/33942348835?pr=6794#step:8:3506 This was resolved on the Transformers side here: https://github.com/huggingface/transformers/pull/35236	2024-12-16 11:21:51 -08:00
Liangliang Ma	853a97648b	Fix xpu tests workflow failure by changing pip index url (#6864 ) Update xpu-max1100.yml and xpu-compile.yml	2024-12-13 11:29:48 -08:00
Logan Adams	074d5c69c3	Fix nv-torch-nightly test by pinning transformers (#6849 )	2024-12-11 10:34:31 -08:00
Logan Adams	06f1d3609e	Unpin pytest-subtests now that 0.14.1 is released (#6844 ) The issue we encountered was covered here: https://github.com/pytest-dev/pytest-subtests/issues/173 And is resolved with the latest changes from this PR: https://github.com/pytest-dev/pytest-subtests/issues/174, and is published in the latest version 0.14.1.	2024-12-09 22:14:59 -08:00
Logan Adams	08b907a226	Pin pytest-subtests version for accelerate tests (#6842 )	2024-12-09 12:24:33 -08:00
Logan Adams	9ca6016017	Pin HPU tests (#6831 ) HPU tests are impacted by the same issue as other tests that use transformers latest. This PR pins to a version of transformers before the fix.	2024-12-06 14:29:00 -08:00

1 2 3 4 5 ...

310 Commits