**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.3
Author - @loadams
Co-authored-by: loadams <loadams@users.noreply.github.com>
Following discussion in
[PR-6670](https://github.com/microsoft/DeepSpeed/pull/6670), the explict
upcast is much more efficient than implicit upcast, this PR is to
replace implicit upcast with explict one.
The results on 3B model are shown below:
| Option | BWD (ms) | Speed up |
|------------|-----|------|
| Before PR-6670 | 25603.30 | 1x |
| After PR-6670 | 1174.31 | 21.8X |
| After this PR| 309.2 | 82.8X |
### Description
This pull request removes the redundant installation of `pandas` from
the `Dockerfile`.
It was previously declared twice, and this update eliminates the
duplicate entry, improving the clarity and maintainability of the
`Dockerfile`.
018ece5af2/docker/Dockerfile (L124)018ece5af2/docker/Dockerfile (L135)
### Changes
Removed the duplicate pandas installation line from the `RUN pip
install` command.
Using keep_module_on_host config var will let us control if the loaded
checkpoints to model parameters will be moved to the device or stay on
host
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
in some scenarios some of the optimization
flags for the ops compiler for HPU can cause
a significant performance degradation.
remove the flags until the issue is resolved
This PR updates the DeepSpeed `OPTEmbedding` forward function to include
a new `positions_ids` argument.
---------
Co-authored-by: Logan Adams <loadams@microsoft.com>
Deepseek including Multi-Head Latent Attention(MLA) and MoE.
For MLA TP, we need to skip two low-rank layers("q_a_proj" and
"kv_a_proj_with_mqa)
For Deepseek MoE, tp_parse gets this moe layer name is
layer_idx.down_proj, it is hard to add the policy, so we set the
down_proj layer to all_reduce_linears default.
This fixes some errors when installing DeepSpeed on Windows with the
presence of Triton.
I guess we can assume we don't need the warning about NFS on Windows for
now. I did not try how to detect NFS path on Windows, but we can detect
UNC path starting with `\\` if needed.
`os.rename` does not allow overwriting the file on Windows, and
`os.replace` is more cross-platform.
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR checks that the `transformers` version is `<= 4.43.4` in the
BLOOM container for inference v1, due to breaking changes in
`transformers > 4.43.4`.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
depend on https://github.com/microsoft/DeepSpeed/pull/6649
When performing fetch/release operations on Z3 leaf modules, the loop
time is excessively long in fine-grained module. Compared to non-leaf
modules, Z3 leaf modules may include a larger number of parameters.
Although each loop unit does not consume much time, the overall loop
length can be significant.

**The fetch time is impacted by:**
Post-allgather operations (narrow, slice ,cat, difficult to avoid)
Memory pressure(record_stream/fetch event create&sync)
**The release time is impacted by:**
slice
Free parameter record_stream
Considering the fine-grained leaf modules, where each parameter is
relatively small, we can treat the parameters within each leaf module as
a unified entity to handle memory pressure. This approach can
approximately halve the CPU time required for fetch/release operations.
---------
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
- Removed try/except from __init__ file in fp_quantizer and added a
single entry point instead
- Renamed file fp8_gemm to fp8_gemm_triton, and the function matmul_fp8
to matmul_fp8_triton
- Added a new entry point fp8_gemm with matmul_fp8 inside, and if the
system supports triton it calls the triton implementation and if not it
calls the fallback
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
HI, I found some error when using deepspeed with rocm-torch
```
torch_cuda_version = ".".join(torch.version.cuda.split('.')[:2])
```
will raise an AttributeError when torch.version.cuda is None. This
occurs because the CUDA version in rocm-torch/version.py is set to
always be None, leading to potential runtime errors in environments
where ROCm is being used.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
**Problem**
There's an edge-case in DeepSpeed, where if all three of the following
are true:
1. Deepspeed activation checkpointing is applied
2. The user passes `checkpointable_layers` (e.g.
f532580567/megatron/model/gpt2_model.py (L175))
3. The user's model class contains `GPT2ModelPipe` or GPTModelPipe`
Then the `checkpointable_layers` will not be activation checkpointed.
**Reason**
This is because in the current logic, `_is_checkpointable` will
short-circuit to just return layers matching
`ParallelTransformerLayerPipe` in the case of `self.__class__.__name__
in ('GPTModelPipe', 'GPT2ModelPipe')`. See
da771ed42e/deepspeed/runtime/pipe/module.py (L653)
**Proposed Fixes**
I think that `checkpointable_layers` should always be checked for, and
added logic to this effect. I also found the documentation for
`checkpointable_layers` confusing and contradictory, so I updated the
docstring. Lastly, I added a unit test for `checkpointable_layers`.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
* This commit addresses a Deepspeed issue
[#6718](https://github.com/microsoft/DeepSpeed/issues/6718)
* The existing code has been using the grad_acc node hook to reduce
params grads.
The constructs such as `param.data = replicated_tensor.data` used in
`allgather_params(..)`
are compiled into `param.set()` causing the hook assigned to the
grad_acc node not being called.
* Starting from PyTorch 2.1 there is a new and robust hook API on a
param itself: `param.register_post_accumulate_grad_hook(..)`
* This commit will make use of the proper API depending on the PyTorch
version
* It will also disable compile for PyTorch versions < 2.1
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
We have encountered and issue with torch.compile and the pipeline
module.
modifying a member of the module (micro_offset) during the forward
function will cause torch compile to restart the analysis and treat the
module as dynamic.
In order to bypass this issue without significantly changing the way the
pipeline module works we propose to compile only the layers in the
pipeline module instead of the forward function of pipeline module. this
will bypass the issue and should still give most of the benefit of torch
compiling the pipeline module while avoiding the issue.
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Fix#6851
Initialize communication backend to fix error caused by all_reduce call
in the Domino transformer layer.
Verified correctness in local test.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
inside reduce_independent_p_g_buckets_and_remove_grads and in
reduce_ipg_grads which are being executed during the BWD hook in zero2,
the model param is being stored inside params_in_ipg_bucket.
torch.compile has hard time tracing parameters.
By using the param's static index inside the group the same logic can be
maintain with less complexity.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.2
Author - @loadams
Co-authored-by: loadams <loadams@users.noreply.github.com>
As title says.
Default behavior of arctic model produces shape issues with AutoTP due
to the MLP layer performing `w2 * act(w1*w3)`. However, method provided
to fix Mixtral-7x8b in #5257 does not work since the MLP for Arctic is
also used within a ModuleList for the MoE. This results in MLP weights
hiding behind individual experts as layers `#.w#`, which is not caught
by the fix in #5257. This adds the check directly within replace, where
it can check for actual layer names for the `w2` key in the model to
patch with `all_reduce`.
---------
Signed-off-by: Daniel Huang <daniel1.huang@intel.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This fixes a bug introduced in #6845, which breaks the `no-torch`
workflow that we require in order to do releases where we do not require
torch to be in the environment when building an sdist. This adds the
same logic to the cpuaccelerator that the cudaaccelerator had where we
don't require torch to be installed to build the whl.
This PR aims to add MLP/lm_head tp size granularity setting to
deepspeed.init_inference() API. It will be more flexible to set the
MLP/lm_head sharding grain size.
DNN library favors tensor size in granularity of power of 2, we pick 64
as a default size.
We aim to be able to set the MLP/lm_head tp grain size flexibly. This is
a preliminary solution. If there is a better solution, we can discuss it
together. Thanks~
---------
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Changes from https://github.com/huggingface/transformers/pull/34966
caused the `nv-torch-latest-v100` tests to fail with the following
error:
```
File "/tmp/azureml/cr/j/e4bfd57a509846d6bbc4914639ad248d/exe/wd/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3941, in from_pretrained
raise EnvironmentError(
OSError: Can't load the model for 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'hf-internal-testing/tiny-random-VisionEncoderDecoderModel-vit-gpt2' is the correct path to a directory containing a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.
```
Sample failure here:
https://github.com/microsoft/DeepSpeed/actions/runs/12169422174/job/33942348835?pr=6794#step:8:3506
This was resolved on the Transformers side here:
https://github.com/huggingface/transformers/pull/35236
### Comment out or delete `accelerate_name="cpu"` when `xpu` is not
detected.
When `xpu `is not detected it just pass at lines from 68 to 74 if
`DS_ACCELERATOR` is set. However, `cpu` is assigned to `accelerate_name`
if it cannot import `intel_extension_for_pytorch` or find` xpu`, namely,
at line from 125 to 133 when`DS_ACCELERATOR` is not set.
I found this problem yesterday and spent whole afternoon figuring it
out. I got `intel_extension_for_pytorch `installed with other package
which I do not use actually and have no idea about this. Then I found
that it `cpu` is assigned to accelerate_name directly if it cannot find
`xpu` and it affects `cuda` detection. In fact, `cpu` will be assigned
finally if `cuda` is even not detected at line from 170 to 177.
---------
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>