pytest 8.4.0 seems to break a number of our tests, rather than pinning
in each individually, we should just put this in the requirements file
until we resolve the issue.
---------
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
This fix is required to prevent the below error:
=================================== FAILURES
===================================
__________________ TestFp8ComposabilityAcrossZero.test[fp16]
___________________
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/pool.py",
line 125, in worker
result = (True, func(*args, **kwds))
File "/opt/conda/envs/py_3.10/lib/python3.10/multiprocessing/pool.py",
line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/root/PR/test/DeepSpeed/tests/unit/common.py", line 322, in
_dist_run
raise e
File "/root/PR/test/DeepSpeed/tests/unit/common.py", line 314, in
_dist_run
self.run(**self._fixture_kwargs)
File "/root/PR/test/DeepSpeed/tests/unit/common.py", line 470, in run
self._current_test(**fixture_kwargs)
File
"/root/PR/test/DeepSpeed/tests/unit/runtime/half_precision/test_fp8.py",
line 88, in test
loss = run_zero(stage, model_dtype)
File
"/root/PR/test/DeepSpeed/tests/unit/runtime/half_precision/test_fp8.py",
line 74, in run_zero
model.step()
File "/root/PR/test/DeepSpeed/deepspeed/runtime/engine.py", line 2387,
in step
self._take_model_step(lr_kwargs)
File "/root/PR/test/DeepSpeed/deepspeed/runtime/engine.py", line 2290,
in _take_model_step
self.optimizer.step()
File
"/root/PR/test/DeepSpeed/deepspeed/runtime/fp16/fused_optimizer.py",
line 255, in step
self.timers(OVERFLOW_CHECK_TIMER).start()
TypeError: 'NoneType' object is not callable
"""
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Fixes this warning:
```
/fsx/qgallouedec/miniconda3/envs/trl/lib/python3.12/site-packages/deepspeed/runtime/config_utils.py💯 PydanticDeprecatedSince211: Accessing the 'model_fields' attribute on the instance is deprecated. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
fields = self.model_fields
```
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Improve or fix some minor indentation, typo, and list numbering issues
of the Ulysses Plus tutorial.
---------
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
This is the Deepspeed counterpart of
https://github.com/snowflakedb/ArcticTraining/pull/45 - as the new
feature(s) require changes on both sides.
For PR reviewers:
Readiness status:
- [x] Code
- [x] Tests
- [ ] Docs - working on it
Features:
- [x] add support for delaying grad addition via
`param.ds_grad_is_ready` flag (used when performing tiled compute in an
autograd function)
- [x] add light sp-only mpu version (Jeff Rasley)
- [x] improved debug
- [x] added `all_gather_object` to `dist`
- [x] `UlyssesSPAttentionHF` (port of UlyssesAttention from
Megatron-Deepspeed plus modern MHA-variations)
- [x] `UlyssesSPDataLoaderAdapter` - DL adapter to shard the normal DL
batches to be used by `UlyssesSPAttentionHF`
- [x] `SequenceTiledCompute` - generic autograd function to perform
compute after tiling on the sequence dimension
- [x] `TiledMLP` - a specific autograd function to perform tiled MLP
(it's much easier to understand before trying to grok
`SequenceTiledCompute`)
- [x] added a differentiable `_DimZeroAllToAll` (Samyam Rajbhandari)
- [x] torch-dist-check now allows `torch.distributed.nn` (which is
needed since deepspeed's dist is not up to date with
`torch.distributed.nn`)
---------
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Jeff Rasley <jerasley@microsoft.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
This is a follow up to https://github.com/deepspeedai/DeepSpeed/pull/923
my original code was a copy from transformers, which has a different fs
layout and I missed that. So this PR is fixing it to actually do the
right thing.
Now you can have multiple clones of deepspeed and the tests will use the
local repo automatically and not the pre-installed deepspeed.
these days fp16 is barely ever used, so we should be testing bf16
instead of fp16 where possible.
had to fix a bunch of tests to adapt to this change. a few bugs as well
on the way.
---------
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
## Description
This PR fixes an issue where gradient clipping modifications are not
reflected in the global gradient norm calculation when CPU offloading is
enabled. The issue occurs because the `averaged_gradients` are not being
updated with the clipped gradients when CPU offloading is active.
## Problem
When using CPU offloading with gradient clipping:
1. The gradients are successfully clipped using `safe_set_local_grad`
2. However, the `_global_grad_norm` calculation still uses the original
unclipped gradients.
3. This leads to incorrect gradient norm reporting and potential issues
with gradient clipping effectiveness
## Solution
The fix ensures that the `averaged_gradients` are properly updated with
the clipped gradients when CPU offloading is enabled, similar to how it
works when CPU offloading is disabled.
## Testing
The fix has been tested with:
- CPU offloading enabled and disabled
- Different gradient clipping values
- A simple model with linear layers
- Both FP16 and BF16
## Related Issues
Fixes#7292
---------
Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Signed-off-by: Armin Zhu <mingzhengzhu1998@gmail.com>
Fix the memory usage of ZeRO-Offload with stage 1 and 2. Before the fix,
the memory usage is about 3x that of params_FP16. This is caused by the
H2D data copy is using different data type. Now the GPU memory usage is
about 1x params_FP16. And the H2D memory copy needs a 16bit pinned
memory buffer.
Some params are one-dimensional, this PR adds support for these params.
Resolve#7249
```log
param.shape torch.Size([768, 1536])
param.shape torch.Size([768])
...
```
```log
with deepspeed.module_inject.layers.GatherReplacedLayerParams([param], model, enabled=True):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 359, in __enter__
self.params[0].gather_params(self.params)
File "torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 473, in gather_params
param.shape[1],
~~~~~~~~~~~^^^
IndexError: tuple index out of range
```
---------
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: inkcherry <mingzhi.liu@intel.com>
XCCL will be used for XPU device on Pytorch-2.8, with this support will
remove torch-ccl on XPU device, and we will also reserve the old path
for torch-CCL enable.
---------
Signed-off-by: yisheng <yi.sheng@intel.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
extra_repr_str will be undefined if self.weight is None with current
code.
In addition, the shape is stored in ds_shape if it's in ZeRO-3, so we
also need to do this check (Although currently AutoTP hasn't supported
ZeRO-3).
```logs
File "deepspeed/__init__.py", line 394, in tp_model_init
model = TpTrainingManager(model=model, tp_size=tp_size, dtype=dtype).module
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/runtime/tensor_parallel/tp_manager.py", line 35, in __init__
self._apply_policies(parser_dict)
File "deepspeed/runtime/tensor_parallel/tp_manager.py", line 47, in _apply_policies
self._apply_injection_policy(self.config, client_module)
File "deepspeed/runtime/tensor_parallel/tp_manager.py", line 53, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, None, self.config, self.model_config)
File "deepspeed/module_inject/replace_module.py", line 400, in replace_transformer_layer
replaced_module = replace_module(model=model,
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/replace_module.py", line 653, in replace_module
replaced_module, _ = _replace_module(model, policy, state_dict=sd)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/replace_module.py", line 713, in _replace_module
_, layer_id = _replace_module(child,
^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/replace_module.py", line 713, in _replace_module
_, layer_id = _replace_module(child,
^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/replace_module.py", line 689, in _replace_module
replaced_module = policies[child.__class__][0](child,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/replace_module.py", line 333, in replace_fn
new_module = replace_wo_policy(child, _policy, prefix=prefix, state_dict=state_dict)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/replace_module.py", line 316, in replace_wo_policy
return _autotp._replace_module(module)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/auto_tp.py", line 481, in _replace_module
self._replace_module(child, name, class_name)
File "deepspeed/module_inject/auto_tp.py", line 466, in _replace_module
setattr(r_module, name, self.linear_policies[child.__class__](child, prev_name + '.' + name,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/auto_tp.py", line 361, in _replace
if 'Yuan' in str(self.module):
^^^^^^^^^^^^^^^^
File "torch/nn/modules/module.py", line 2940, in __repr__
mod_str = repr(module)
^^^^^^^^^^^^
File "torch/nn/modules/module.py", line 2940, in __repr__
mod_str = repr(module)
^^^^^^^^^^^^
File "torch/nn/modules/module.py", line 2934, in __repr__
extra_repr = self.extra_repr()
^^^^^^^^^^^^^^^^^
File "deepspeed/module_inject/layers.py", line 267, in extra_repr
out_features, in_features = self.weight.shape[-2:] if self.weight is not None else (None, None)
^^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: not enough values to unpack (expected 2, got 1)
```
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Until we sort out the new license situation disable this check so that
new code not owned by MSFT could be added
---------
Signed-off-by: Stas Bekman <stas@stason.org>
# PR Summary
This small PR resolves deprecation warnings caused by the use of
`distutils.spawn.find_executable`:
```python
DeprecationWarning: Use shutil.which instead of find_executable
```
Please note that `find_executable` is deprecated from Python 3.10 and
removed in 3.12. `shutil.which` available since Python 3.3.
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>