2958 Commits

Author SHA1 Message Date
1640f6df4f Update build_win.bat script to exclue GDS op as it lacks Windows support. (#6971)
Nvidia GDS [does not support
Windows](https://developer.nvidia.com/gpudirect-storage).
2025-01-24 21:58:43 +00:00
470dd6dceb Precisely track nvme optimizer offload (#6963)
Fix #4998
2025-01-23 16:42:06 +00:00
de4596bedc Update version.txt after 0.16.3 release (#6965)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.3
Author           - @loadams

Co-authored-by: loadams <loadams@users.noreply.github.com>
2025-01-21 14:34:26 -08:00
c17dc33c04 Using explicit GPU upcast for ZeRO-Offload (#6962)
Following discussion in
[PR-6670](https://github.com/microsoft/DeepSpeed/pull/6670), the explict
upcast is much more efficient than implicit upcast, this PR is to
replace implicit upcast with explict one.

The results on 3B model are shown below:

| Option | BWD (ms) | Speed up |
|------------|-----|------|
| Before PR-6670 | 25603.30 | 1x |
| After PR-6670 | 1174.31 | 21.8X |
| After this PR| 309.2 | 82.8X |
v0.16.3
2025-01-21 18:48:38 +00:00
8d1bc0a042 Update torch.norm to torch.linalg.norm and torch.linalg.vector_norm (#6931)
- [x] Update PR since `torch.norm` and `torch.linalg.norm` have
[different function
signatures](https://pytorch.org/docs/stable/generated/torch.linalg.norm.html#torch.linalg.norm).
- [x] Check if there are any numeric differences between the functions.
- [x] Determine why there appear to be performance differences from
others [here](https://github.com/pytorch/pytorch/issues/136360).
- [x] Update to `torch.linalg.vectornorm`
Follow up PR handles these in the comm folder: #6960
2025-01-21 16:49:58 +00:00
bc76b04e28 Add the missing view operations from sequence parallel(async). (#6750)
FYI @loadams 

a view operation was missing in some updates compared to the original
version
17ed7c77c5/deepspeed/sequence/layer.py (L56)

add missing view operation.
The shape required for the view cannot be easily obtained in the current
function, so refactor layout params code.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2025-01-21 16:49:06 +00:00
7f3d669b40 Remove Duplicate Declaration of pandas in Dockerfile (#6959)
### Description

This pull request removes the redundant installation of `pandas` from
the `Dockerfile`.
It was previously declared twice, and this update eliminates the
duplicate entry, improving the clarity and maintainability of the
`Dockerfile`.


018ece5af2/docker/Dockerfile (L124)


018ece5af2/docker/Dockerfile (L135)

### Changes

Removed the duplicate pandas installation line from the `RUN pip
install` command.
2025-01-17 17:44:49 +00:00
f97f0885cf Update import for torchvision.transformers (#6958)
Fixes import - found via
[torchfix](https://github.com/pytorch-labs/torchfix).
2025-01-17 09:43:51 -08:00
018ece5af2 Add extra_repr to Linear classes for debugging purpose (#6954)
**Summary**
This PR adds `extra_repr` method to some Linear classes so that
additional info is printed when printing such modules. It is useful for
debugging.
Affected modules:
- LinearLayer
- LinearAllreduce
- LmHeadLinearAllreduce

The `extra_repr` method gives the following info:
- in_features
- out_features
- bias (true or false)
- dtype

**Example**
Print llama-2-7b model on rank 0 after `init_inference` with world size
= 2.
Previously we only got class names of these modules:
```
InferenceEngine(
  (module): LlamaForCausalLM(
    (model): LlamaModel(
      (embed_tokens): Embedding(32000, 4096)
      (layers): ModuleList(
        (0-31): 32 x LlamaDecoderLayer(
          (self_attn): LlamaSdpaAttention(
            (q_proj): LinearLayer()
            (k_proj): LinearLayer()
            (v_proj): LinearLayer()
            (o_proj): LinearAllreduce()
            (rotary_emb): LlamaRotaryEmbedding()
          )
          (mlp): LlamaMLP(
            (gate_proj): LinearLayer()
            (up_proj): LinearLayer()
            (down_proj): LinearAllreduce()
            (act_fn): SiLU()
          )
          (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        )
      )
      (norm): LlamaRMSNorm((4096,), eps=1e-05)
      (rotary_emb): LlamaRotaryEmbedding()
    )
    (lm_head): LmHeadLinearAllreduce()
  )
)
```
Now we get more useful info:
```
InferenceEngine(
  (module): LlamaForCausalLM(
    (model): LlamaModel(
      (embed_tokens): Embedding(32000, 4096)
      (layers): ModuleList(
        (0-31): 32 x LlamaDecoderLayer(
          (self_attn): LlamaSdpaAttention(
            (q_proj): LinearLayer(in_features=4096, out_features=2048, bias=False, dtype=torch.bfloat16)
            (k_proj): LinearLayer(in_features=4096, out_features=2048, bias=False, dtype=torch.bfloat16)
            (v_proj): LinearLayer(in_features=4096, out_features=2048, bias=False, dtype=torch.bfloat16)
            (o_proj): LinearAllreduce(in_features=2048, out_features=4096, bias=False, dtype=torch.bfloat16)
            (rotary_emb): LlamaRotaryEmbedding()
          )
          (mlp): LlamaMLP(
            (gate_proj): LinearLayer(in_features=4096, out_features=5504, bias=False, dtype=torch.bfloat16)
            (up_proj): LinearLayer(in_features=4096, out_features=5504, bias=False, dtype=torch.bfloat16)
            (down_proj): LinearAllreduce(in_features=5504, out_features=4096, bias=False, dtype=torch.bfloat16)
            (act_fn): SiLU()
          )
          (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
          (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        )
      )
      (norm): LlamaRMSNorm((4096,), eps=1e-05)
      (rotary_emb): LlamaRotaryEmbedding()
    )
    (lm_head): LmHeadLinearAllreduce(in_features=2048, out_features=32000, bias=False, dtype=torch.bfloat16)
  )
)
```
2025-01-16 18:11:07 +00:00
05eaf3d1ca warn to warning (#6952)
`warn` is deprecated, see
https://docs.python.org/3/library/logging.html#logging.Logger.warning


```DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead```
2025-01-15 22:08:56 +00:00
fae714d6bd [inf] Add config var to enable keeping module on host (#6846)
Using keep_module_on_host config var will let us control if the loaded
checkpoints to model parameters will be moved to the device or stay on
host

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-01-15 19:25:29 +00:00
66d3d3e94d Pin nv-a6000 workflow (#6938)
Breaking change in transformers is
https://github.com/huggingface/transformers/pull/35235. Need to make
changes to unpin nv-a6000 workflow.
2025-01-13 10:34:15 -08:00
396f8db793 Remove op compilation flags due to perf issue (#6944)
in some scenarios some of the optimization
flags for the ops compiler for HPU can cause
a significant performance degradation. 
remove the flags until the issue is resolved
2025-01-13 16:50:22 +00:00
fa8db5cf2f Support pure meta model lm_head tp (#6812)
Add lm_head tp support when checkpoint not provided to
deepspeed.init_inference().

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
2025-01-10 22:18:01 +00:00
1d15ef0acf Add information on security expectations with this software (#6941)
Inspired by the link vllm
[includes](https://github.com/vllm-project/vllm/blob/main/SECURITY.md),
this starts to give users insight into the security expectations they
should have from using DeepSpeed.
2025-01-09 15:56:54 -08:00
0fc3daade7 Add position_ids arg to OPTEmbedding forward function (#6939)
This PR updates the DeepSpeed `OPTEmbedding` forward function to include
a new `positions_ids` argument.

---------

Co-authored-by: Logan Adams <loadams@microsoft.com>
2025-01-09 20:11:35 +00:00
45fce45c95 Add deepseek autotp (#6937)
Deepseek including Multi-Head Latent Attention(MLA) and MoE.

For MLA TP, we need to skip two low-rank layers("q_a_proj" and
"kv_a_proj_with_mqa)
For Deepseek MoE, tp_parse gets this moe layer name is
layer_idx.down_proj, it is hard to add the policy, so we set the
down_proj layer to all_reduce_linears default.
2025-01-09 18:11:32 +00:00
53fb5795a1 Fix windows blog examples (#6934) 2025-01-08 12:54:19 -08:00
b62c84d88d Fix building on Windows with presence of Triton (#6749)
This fixes some errors when installing DeepSpeed on Windows with the
presence of Triton.

I guess we can assume we don't need the warning about NFS on Windows for
now. I did not try how to detect NFS path on Windows, but we can detect
UNC path starting with `\\` if needed.

`os.rename` does not allow overwriting the file on Windows, and
`os.replace` is more cross-platform.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-01-08 18:59:41 +00:00
6628127a37 Update python version classifiers (#6933)
Update python version classifiers in setup.py to reflect python versions
currently supported.
2025-01-08 18:43:06 +00:00
c41b0c2855 Use torch.log1p (#6930)
This function provides greater precision than `log(1 + x)` for small
values of `x`.

Found with TorchFix https://github.com/pytorch-labs/torchfix/
2025-01-08 01:27:30 +00:00
c7f30322fd inference: remove unused _validate_args function (#5505)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-01-07 18:10:24 +00:00
f2cc80909b Check transformers version in BLOOM for inference v1 (#6766)
This PR checks that the `transformers` version is `<= 4.43.4` in the
BLOOM container for inference v1, due to breaking changes in
`transformers > 4.43.4`.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-01-06 17:13:35 -08:00
c348c5b11a Cleanup ops/transformer/inference tests (#6925) 2025-01-06 14:35:50 -08:00
b0040b6ca4 Reduce the device bubble introduced by heavy loop synchronization in coalesced fetch/release(z3_leaf_module) (#6694)
depend on https://github.com/microsoft/DeepSpeed/pull/6649

When performing fetch/release operations on Z3 leaf modules, the loop
time is excessively long in fine-grained module. Compared to non-leaf
modules, Z3 leaf modules may include a larger number of parameters.
Although each loop unit does not consume much time, the overall loop
length can be significant.

![image](https://github.com/user-attachments/assets/9891835a-2620-47f3-aba6-ea22b8905d1c)
**The fetch time is impacted by:**

Post-allgather operations (narrow, slice ,cat, difficult to avoid)
Memory pressure(record_stream/fetch event create&sync)
**The release time is impacted by:**
slice
Free parameter record_stream

Considering the fine-grained leaf modules, where each parameter is
relatively small, we can treat the parameters within each leaf module as
a unified entity to handle memory pressure. This approach can
approximately halve the CPU time required for fetch/release operations.

---------

Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2025-01-06 20:06:06 +00:00
c5e48f49d8 Add fp8_gemm fallback for non-triton systems (#6916)
- Removed try/except from __init__ file in fp_quantizer and added a
single entry point instead
- Renamed file fp8_gemm to fp8_gemm_triton, and the function matmul_fp8
to matmul_fp8_triton
- Added a new entry point fp8_gemm with matmul_fp8 inside, and if the
system supports triton it calls the triton implementation and if not it
calls the fallback

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-01-06 18:54:57 +00:00
f8c9f314ff [BUG FIX]:fix get torch.version.cuda error when cuda is None in rocm (#6909)
HI, I found some error when using deepspeed with rocm-torch
```
torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2]) 
```
will raise an AttributeError when torch.version.cuda is None. This
occurs because the CUDA version in rocm-torch/version.py is set to
always be None, leading to potential runtime errors in environments
where ROCm is being used.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-01-06 17:38:19 +00:00
0dbbb70b99 Fix checkpointable_layers Logic (#6881)
**Problem**

There's an edge-case in DeepSpeed, where if all three of the following
are true:
1. Deepspeed activation checkpointing is applied 
2. The user passes `checkpointable_layers` (e.g.
f532580567/megatron/model/gpt2_model.py (L175))
3. The user's model class contains `GPT2ModelPipe` or GPTModelPipe`

Then the `checkpointable_layers` will not be activation checkpointed. 

**Reason**

This is because in the current logic, `_is_checkpointable` will
short-circuit to just return layers matching
`ParallelTransformerLayerPipe` in the case of `self.__class__.__name__
in ('GPTModelPipe', 'GPT2ModelPipe')`. See
da771ed42e/deepspeed/runtime/pipe/module.py (L653)

**Proposed Fixes**

I think that `checkpointable_layers` should always be checked for, and
added logic to this effect. I also found the documentation for
`checkpointable_layers` confusing and contradictory, so I updated the
docstring. Lastly, I added a unit test for `checkpointable_layers`.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-01-04 05:57:49 +00:00
a8ede3a9df Cleanup ops/transformer/inference tests (#6830) 2025-01-03 08:25:50 -08:00
456c9ac679 Stage3: Use new torch grad accumulation hooks API (#6773)
* This commit addresses a Deepspeed issue
[#6718](https://github.com/microsoft/DeepSpeed/issues/6718)
* The existing code has been using the grad_acc node hook to reduce
params grads.
The constructs such as `param.data = replicated_tensor.data` used in
`allgather_params(..)`
are compiled into `param.set()` causing the hook assigned to the
grad_acc node not being called.
* Starting from PyTorch 2.1 there is a new and robust hook API on a
param itself: `param.register_post_accumulate_grad_hook(..)`
* This commit will make use of the proper API depending on the PyTorch
version
* It will also disable compile for PyTorch versions < 2.1

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2025-01-03 07:48:24 -08:00
3573858e7c Change compile for pipeline module torch.compile (#6478)
We have encountered and issue with torch.compile and the pipeline
module.
modifying a member of the module (micro_offset) during the forward
function will cause torch compile to restart the analysis and treat the
module as dynamic.
In order to bypass this issue without significantly changing the way the
pipeline module works we propose to compile only the layers in the
pipeline module instead of the forward function of pipeline module. this
will bypass the issue and should still give most of the benefit of torch
compiling the pipeline module while avoiding the issue.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-30 10:53:41 -08:00
cc03c76d57 Update Gaudi2 jobs to latest 1.19 build (#6905)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-26 12:07:28 -08:00
85cc5f9bb3 Fix error caused by all_reduce call in domino (#6880)
Fix #6851 
Initialize communication backend to fix error caused by all_reduce call
in the Domino transformer layer.
Verified correctness in local test.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-26 09:12:04 -08:00
eea5304807 hpu_accelerator: use torch.use_deterministic_algorithms (#6897)
formal API instead of hpu.setDeterministic
2024-12-19 21:13:46 -08:00
00ea0c46c2 Zero2: avoid graph breaks in torch.compile by using param_idx (#6803)
inside reduce_independent_p_g_buckets_and_remove_grads and in
reduce_ipg_grads which are being executed during the BWD hook in zero2,
the model param is being stored inside params_in_ipg_bucket.
torch.compile has hard time tracing parameters.
By using the param's static index inside the group the same logic can be
maintain with less complexity.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
2024-12-19 16:54:45 -08:00
4fd79205c6 Allow to compile collective for PT>2.3 (#6899)
Allow to compile collective for PT>2.3
commit re-uploaded due to github CI issue
originally uploaded by @nelyahu
2024-12-19 09:26:50 -08:00
f9e158a0f5 Update version.txt after 0.16.2 release (#6893)
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.2
Author           - @loadams

Co-authored-by: loadams <loadams@users.noreply.github.com>
2024-12-18 09:53:17 -08:00
b344c04df0 Update code owners (#6890)
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
v0.16.2
2024-12-18 08:49:28 -08:00
0b25630abe Add arctic model support by adding w2 to all_reduce (#6856)
As title says. 

Default behavior of arctic model produces shape issues with AutoTP due
to the MLP layer performing `w2 * act(w1*w3)`. However, method provided
to fix Mixtral-7x8b in #5257 does not work since the MLP for Arctic is
also used within a ModuleList for the MoE. This results in MLP weights
hiding behind individual experts as layers `#.w#`, which is not caught
by the fix in #5257. This adds the check directly within replace, where
it can check for actual layer names for the `w2` key in the model to
patch with `all_reduce`.

---------

Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-18 08:09:31 -08:00
4cd1d97460 Don't error out when cpu accelerator doesn't have torch (as default for whl building) (#6886)
This fixes a bug introduced in #6845, which breaks the `no-torch`
workflow that we require in order to do releases where we do not require
torch to be in the environment when building an sdist. This adds the
same logic to the cpuaccelerator that the cudaaccelerator had where we
don't require torch to be installed to build the whl.
2024-12-17 17:30:52 -08:00
2f32966b1c Update transformers ops unit tests to use requried_torch_version (#6884) 2024-12-17 11:53:47 -08:00
a964e43553 Fix --enable_each_rank_log when used with PDSH multi-node runner (#6863)
This PR addresses fixes
https://github.com/microsoft/DeepSpeed/issues/6859 by threading this
argument into the deepspeed launcher command build by PDSHRunner.

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-17 09:33:09 -08:00
da771ed42e Add MLP/lm_head tp grain size setting. (#6828)
This PR aims to add MLP/lm_head tp size granularity setting to
deepspeed.init_inference() API. It will be more flexible to set the
MLP/lm_head sharding grain size.

DNN library favors tensor size in granularity of power of 2, we pick 64
as a default size.

We aim to be able to set the MLP/lm_head tp grain size flexibly. This is
a preliminary solution. If there is a better solution, we can discuss it
together. Thanks~

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-12-16 14:14:53 -08:00
87c650681e Remove pin from transformers version and fix Processing/Threading issues in tests (#6822)
Changes from https://github.com/huggingface/transformers/pull/34966
caused the `nv-torch-latest-v100` tests to fail with the following
error:

```
  File "/tmp/azureml/cr/j/e4bfd57a509846d6bbc4914639ad248d/exe/wd/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3941, in from_pretrained
    raise EnvironmentError(
OSError: Can't load the model for 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
```

Sample failure here:
https://github.com/microsoft/DeepSpeed/actions/runs/12169422174/job/33942348835?pr=6794#step:8:3506

This was resolved on the Transformers side here:
https://github.com/huggingface/transformers/pull/35236
2024-12-16 11:21:51 -08:00
db98cc3ad1 Fix assertion for offloading states (#6855)
This PR fixes the assertions in `offload_states` method mentioned in
#6833.

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-16 11:05:55 -08:00
fc7c07007f Update real_accelerator.py (#6845)
### Comment out or delete `accelerate_name="cpu"` when `xpu` is not
detected.
When `xpu `is not detected it just pass at lines from 68 to 74 if
`DS_ACCELERATOR` is set. However, `cpu` is assigned to `accelerate_name`
if it cannot import `intel_extension_for_pytorch` or find` xpu`, namely,
at line from 125 to 133 when`DS_ACCELERATOR` is not set.

I found this problem yesterday and spent whole afternoon figuring it
out. I got `intel_extension_for_pytorch `installed with other package
which I do not use actually and have no idea about this. Then I found
that it `cpu` is assigned to accelerate_name directly if it cannot find
`xpu` and it affects `cuda` detection. In fact, `cpu` will be assigned
finally if `cuda` is even not detected at line from 170 to 177.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-13 16:41:43 -08:00
6e3e13cb28 Remove warnings from autodoc and sphinx (#6788)
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
2024-12-13 15:35:12 -08:00
8efbcc495c Update TSC (#6867) 2024-12-13 13:49:08 -08:00
b5e3fac6a5 add domino navigation (#6866)
add domino item into navigation list
2024-12-13 12:59:08 -08:00
d7750c3429 Domino updates (#6861)
Updating our website for Domino

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2024-12-13 11:40:41 -08:00