**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.7
Author - @loadams
Co-authored-by: loadams <loadams@users.noreply.github.com>
Torch loads and hipify JIT C++ extension by determining whether CUDA
headers and libraries are added to the build, based on the existence of
`.cu` or `.cuh` in `sources`, if we let `with_cuda` to be the default
`None`.
2a909cab16/torch/utils/cpp_extension.py (L1623-L1627)
While for some Ops, such as DeepCompile, there are no `.cu` or `.cuh`
files in the sources, but we still need to do the hipify on AMD as it
includes several CUDA headers in the C++ code. So, it's better for us to
control this behavior if it's not `build_for_cpu`, otherwise, the hipify
will get skipped.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Similar to #7211
When the optimizer is not specified, the optimizer will be type
`DeepSpeedZeRoOffload` instead of `DeepSpeedZeroOptimizer_Stage3` (e.g.
for ZeRO-3 pure inference), while `DeepSpeedZeRoOffload` doesn't have
`parameter_offload`.
56005d2b25/deepspeed/runtime/engine.py (L1684-L1707)
```log
File "deepspeed/runtime/engine.py", line 3919, in compile
backend = init_z3(self, backend, compile_config, compile_kwargs, schedule)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "deepspeed/compile/init_z3.py", line 36, in init_z3
optimizer.parameter_offload._remove_module_hooks()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'parameter_offload'
```
---------
Signed-off-by: Hollow Man <hollowman@opensuse.org>
Signed-off-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
We should use `torch.utils.cpp_extension.ROCM_HOME` for ROCm pytorch.
```log
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "DeepSpeed/setup.py", line 195, in <module>
builder.hipify_extension()
File "DeepSpeed/op_builder/builder.py", line 750, in hipify_extension
header_include_dirs=self.include_paths(),
^^^^^^^^^^^^^^^^^^^^
File "DeepSpeed/op_builder/dc.py", line 32, in include_paths
return ['csrc/includes', os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "<frozen posixpath>", line 76, in join
TypeError: expected str, bytes or os.PathLike object, not NoneType
```
Signed-off-by: Hollow Man <hollowman@opensuse.org>
This PR adds a missing line for scheduling in Z3 pass and fixes
attribute names in the profiler.
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.6
Author - @tohtana
Co-authored-by: tohtana <tohtana@users.noreply.github.com>
This PR introduces *DeepCompile*, a new feature that efficiently
integrates compiler optimizations with other DeepSpeed features.
DeepCompile utilizes torch's dynamo to capture the computation graph and
modifies it to incorporate DeepSpeed’s optimizations seamlessly.
Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements
such as proactive prefetching and selective unsharding to improve
performance.
(More details will be added later.)
---------
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
When the optimizer is not specified, the optimizer will be type
`DeepSpeedZeRoOffload` instead of `DeepSpeedZeroOptimizer_Stage3` (e.g.
for ZeRO-3 pure inference), while `DeepSpeedZeRoOffload` hasn't
implemented methods `reload_states` and `offload_states`.
56005d2b25/deepspeed/runtime/engine.py (L1684-L1707)
```log
File "deepspeed/runtime/engine.py", line 3904, in offload_states
self.optimizer.offload_states(include=include, device=device, pin_memory=pin_memory, non_blocking=non_blocking)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'offload_states'
```
In addition, https://github.com/deepspeedai/DeepSpeed/pull/6855 seems to
forget removing the check for `assert not self.zero_offload_param()`, as
suggested by
https://github.com/deepspeedai/DeepSpeed/issues/6833#issuecomment-2537295310,
it returns None when offload_param is not given, and the newly added
assertions have already covered these cases. This PR also removed this
old check.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
I want to reuse a composed module in the pipeline. For example, the
following `MyModule` has a member `linear`, which is also a module.
```python
class MyModule(torch.nn.Module):
def __init__(self, n_in: int, n_out: int):
super().__init__()
self.linear = torch.nn.Linear(n_in, n_out)
self.layer_norm = torch.nn.LayerNorm(n_out)
def forward(self, data: torch.Tensor) -> torch.Tensor:
hidden = self.linear(data)
hidden = self.layer_norm(hidden)
return hidden
```
`MyModule.linear.weight` should be synchronized among related ranks. As
a result, I add `linear.weight` to `TiedLayerSpec.tied_weight_attr`.
BTW, I generate the whole `tied_weight_attr` by the following
instruction.
```python
tied_weight_attr = [name for name, p in layer.named_parameters() if p.numel() > 1]
```
However, the builtin `getattr` used by `PipelineModule` fails to find a
nested attribute like `linear.weight`.
Hence, this PR first extends the builtin `getattr` to a recursive
version `PipelineModule._recursive_getattr`, accessing each attribute
segment one by one.
Meanwhile, the order of tied weights matters in synchronization. This PR
suggests to sort tie_keys in `PipelineModule._index_tied_modules` to
avoid hanging.
Signed-off-by: Mingjie Li <limingjie@chinamobile.com>
Co-authored-by: Mingjie Li <limingjie@chinamobile.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Warning:
```
site-packages/deepspeed/runtime/config_utils.py:64: PydanticDeprecatedSince211: Accessing this attribute on the instance is deprecated, and will be removed in Pydantic V3. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0.
kwargs = pydantic_config.model_fields[dep_field].json_schema_extra
```
Signed-off-by: Logan Adams <loadams@microsoft.com>
1. Add implementation for cross layer communication overlapping to
achieve communication "free".
2. Optimize the implementation for communication overlapping within
transformer layer.
Signed-off-by: Hongwei Chen <hongweichen@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
**Auto-generated PR to update version.txt after a DeepSpeed release**
Released version - 0.16.5
Author - @loadams
Co-authored-by: loadams <loadams@users.noreply.github.com>
# Background and rationale
In many use cases, particularly LLMs, one is faced with inputs
(sentences) of variable lengths. A common practice is to pack batches by
token count (not a fixed batch size), ie by putting together sentences
whose given metric (eg sequence lengths) will add up to an user-provided
value. As an example, in [Attention is all you
need](https://arxiv.org/abs/1706.03762), section 5.1:
> Sentence pairs were batched together by approximate sequence length.
Each training
batch contained a set of sentence pairs containing approximately 25000
source tokens and 25000
target tokens.
Dynamic batch sizes has been requested in [DeepSpeed issue
1051](https://github.com/microsoft/DeepSpeed/issues/1051), [DeepSpeed
issue 3455 ](https://github.com/microsoft/DeepSpeed/issues/3455),
[Pytorch Lightning issue
16914](https://github.com/Lightning-AI/pytorch-lightning/issues/16914),
[huggingface issue
2647](https://github.com/huggingface/accelerate/issues/2647) and is
available already in many libraries e.g. [NVIDIA
Triton](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher)
and [Meta FairSeq](https://github.com/facebookresearch/fairseq)
(implementation
[here](34973a94d0/fairseq/data/fairseq_dataset.py (L104))
).
The immediate use case for this is when one needs to maximize GPU
utilization. Moreover, this is particularly relevant for curriculum
learning where a `BxTxE` (Batch x Time x Embedding) -shaped input should
ideally have high `B` and low `T` at the early curriculum steps (many
short sentences packed together as a batch), and low `B` and high `T` at
the late steps (few long sentences in the batch). A dynamic size `T` is
already supported by Deepspeed, e.g. in the documentation for pipeline
parallelism's
[reset_activation_shape()](https://deepspeed.readthedocs.io/en/stable/pipeline.html#deepspeed.runtime.pipe.engine.PipelineEngine.reset_activation_shape):
> For curriculum learning that changes the seqlen of each sample, we
need to call this whenever the seqlen is going to change.
However, dynamic `B` is not supported. A dynamic `B` would require an
adequate increase/decrease of learning rate. This technique has been
applied previously, and the two most common LR scaling algorithms have
been described as:
1. Linear Scaling Rule: "When the minibatch size is multiplied by k,
multiply the learning rate by k", as in [Accurate, Large Minibatch SGD:
Training ImageNet in 1 Hour, Goyal et
al.](https://arxiv.org/abs/1706.02677)
2. Square Root scaling: "when multiplying the batch size by k, multiply
the learning rate by √k, to keep the variance in the gradient
expectation constant" by [One weird trick for parallelizing
convolutional neural networks, A. Krizhevsky et
al.](https://arxiv.org/abs/1404.5997)
In practice, the user picks the total token count per batch as the
metric that drives batching, instead of batching by sentence count.
During runtime, the variable batch size is computed and the LR is
adjusted respectively, based on the LR and batch size provided by the
config.
# Illustration of dynamic batch size, sequence length and LR
Imagine we picked a limit of `30` tokens per batch, and have set a
reference `lr=1e-3` for a `train_batch_size=2` (in the deepspeed
config). The batching algorithm for curriculum may pack the data into
batches of short sentences (left) at the early stages, and batches of
long sentences (right) as later stages, e.g.:

Above, we collected samples until we filled up the batch with at most 30
tokens. The batch sizes (number of samples) became then `10` and `4` on
the left and right examples, respectively. Using the linear scaling
rule, the LR for those batches become `5e-3` and `2e-3`.
# Pipeline parallelism
Pipeline parallelism requires the same batch size and same sequence
length across all micro-batches in a batch, as the activation sizes must
be fixed between gradient accumulation steps. Between batches, these may
change, and long as `engine.reset_activation_shape()` is called so that
the new shapes are communicated on the first gradient accumulation step
in the batch. Enforcing similar `BxTxE` between batches may lead to
smaller micro-batches. As an example, below we can see an illustration
of a 2-node 2-gradient-accumulation-step (ie 4 micro-batches) batching
for the same dataset, when preparing data for the regular DDP (left) and
for the pipeline parallelism use cases (right):

We can see that the pipeline use case (right) has the same `BxTxE` shape
across all the 4 micro-batches in the same batch, and in order to
respect that, it packs less samples in the batch, when compared to the
standard use case (left hand size)
# Attention Head
For an input of size `BxTxE` the attention has a shape of `TxT` for a
mask of fixed size across samples of same size, or `BxTxT` for a
different mask per sample (when samples have different sizes, as in the
dataset above). This 3D attention matrix can be illustrated for the DDP
microbatch 1 (picture above top-left, 4 sentences) as:

Note the memory savings: the attention head has a size of `BxTxT`, i.e.
a linear memory dependency on the batch size `B` and quadratic memory
dependency on the largest sequence length `T` in the (micro-) batch.
Thus, supporting a dynamic size `T` allows for an increase of `B`.
# PR overview
This PRs implements dynamic batching and LR scaling. The dataloader and
LR scheduler necessary can be retrieved by calling
`get_dataloader_and_lr_scheduler_for_variable_batch_size`. A small
explanation of that function follows:
- The logic behind the algorithms for LR scaling is in `scale_lr`;
- The partitioning of samples into batches is done by `batch_by_seqlen`.
- For pipeline parallelism, it is required that all micro-batches in a
pipeline pass to have the same activation shapes. This is enabled by
setting to `True` the following parameters:
- `required_microbatches_of_same_sizes` that will force the `B`
dimension to be the same across all gradient accumulation steps of all
dataloaders on a batch;
- `required_microbatches_of_same_lengths` that will force the `T`
dimension to be the same across all gradient accumulation steps. Works
by calling the user-provided `sample_padding_fn(sentence, len)` that
pads a given sentence to the argument length;
- `batch_by_seqlen` returns `microbatch_sample_ids` (the list of sample
ids per micro-batch), `batch_sizes` (the size of effective batch sizes,
and `batch_max_seqlens` (longest sequence across all microbatches in a
batch)
- `dataloader_for_variable_batch_size` relies on `microbatch_sample_ids`
and will iterate/collate/pad samples for every batch and return a
dataloader that iterates the final (variable-size) batches;
- `lr_scheduler_for_variable_batch_size` relies on `batch_sizes` to
compute the learning rate for each effective batch, taking into account
the batch size and LR in the config file, and scaling the LR based on
the size of each effective batch, and the scaling rule mentioned above
(Linear, Square root, etc).
- Special note to the `lr_scheduler` returned that will either accept
either:
1. an user-provided `Optimizer` that will scale the learning rates (in
param groups) at every batch, or
2. an user-defined `LRScheduler`, that in this case will first get the
learning rate from the scheduler and then scale it accordingly.
# Example
An example for the use case with and without pipelining is provided in
file
[`DeepSpeedExamples/training/data_efficiency/variable_batch_size_and_lr/variable_batch_size_and_lr_example.py`](https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/data_efficiency/variable_batch_size_and_lr).
The example shows an attention head with attention of variable-sized
`BxTxT` per batch, followed by a fixed size feed forward network. These
are the main blocks on a Large Language Model. The feed-forward (or
linear layer) that follows the attention head requires a constant input
size, equivalent to the largest sentence in the whole dataset, so the
output of the attention must be padded (see `feedforward: needs to
convert BxTxE to BxMxE by padding extra tokens` in the code).
# Config
The example file also comments the relevant deepspeed config with
comments:
```python
config = {
"train_batch_size": 16,
# `train_micro_batch_size_per_gpu` tells how many sequence packs of `max_tokens` each will be collated together.
# I.e. the number of tokens per micro batch (ie per gpu iteration) is `train_micro_batch_size_per_gpu`*`max_tokens`.
"train_micro_batch_size_per_gpu": 2,
"data_efficiency": {
"enabled": True,
# seed to be applied to all data efficiency modules, including dynamic batching
"seed": 42,
"data_sampling": {
"num_workers": 0, # dataloader num_workers argument
"pin_memory": False, # dataloader pin_memory argument
"dynamic_batching": {
# enables or disables dynamic batching
"enabled": True,
# how many tokens we need to fill a pack of sequences (that will be collated together as a sample)
"max_tokens": 100,
# Input and output write to read from or write the length of every sequence.
# Sequence lengths will be loaded from: {metrics_path}/seqlen/seqlen_sample_to_metric.bin and *.idx
# If files dont exist, they'll be computed and saved on the first run, and loaded on subsequent runs.
"metrics_path": "./curriculum_output/",
# As batch size increases/decreses, which method to use to scale LR accordingly?
# Options: linear, sqrt (square root), or None to disable
"lr_scaling_method": "linear",
# how to pick sentences to be packed into samples:
# - dataloader: by same order as they come in with the dataloader
# - seqlen: by sequence length (shortest to longest)
# - random: random order using the seed in config['data_efficiency']['seed'
"sentence_picking_order": "dataloader", # "random" / "seqlen" / "dataloader"
# minimum number of sequences required to reach `max_tokens`. If sentence pack is smaller, it's discarded.
"min_batch_size": 1,
# maximum number of sequences required to reach `max_tokens`. If sentence pack is larger, it's discarded.
"max_batch_size": 10,
# enable the output of microbatching information about sentence packing
"verbose": True,
},
},
},
}
```
# Future work
A follow-up PR will enable dynamic batching when calling
`deepspeed.initialize`. I.e. instead of this:
```python
engine, _, _, _ = deepspeed.initialize(config=config, model=model)
dataloader, lr_scheduler, _ = get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed(...)
engine.lr_scheduler = lr_scheduler
```
we'd ideally have this:
```python
engine, _, dataloader, lr_scheduler = deepspeed.initialize(config=config, model=model)
```
where `initialize` will call internally
`get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed`.
---------
Signed-off-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
ZeRO3 requires explicit cleaning in tests when reusing the environment.
This PR adds `destroy` calls to the tests to free memory and avoid
potential errors due to memory leaks.
Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>
This PR is a continuation of the efforts to improve DeepSpeed
performance when using PyTorch compile.
Dynamo breaks the graph because `flat_tensor.requires_grad = False`:
* Is a side-effecting operation on tensor metadata
* Occurs in a context where Dynamo expects static tensor properties for
tracing
`flat_tensor.requires_grad` is redundant and can be safely removed
because:
* `_allgather_params()` function is already decorated with
`@torch.no_grad()` which ensures the desired property
* `flat_tensor` is created using the `torch.empty()` which sets the
`requires_grad=False` by default.
---------
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Enhancing ci/nightly coverage for gaudi2 device
Tests added :
test_autotp_training.py
test_ulysses.py
test_linear::TestLoRALinear and test_linear::TestBasicLinear
test_ctx::TestEngine
these provide coverage for model_parallesim and linear feature.
The tests are stable. 10/10 runs pass.
New tests addition is expected to increase ci time by 3-4 mins and
nightly job time by 15 min.
Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>
Unpin transformers version for all workflows except
`nv-torch-latest-v100` as this still has a tolerance issue with some
quantization tests.
Signed-off-by: Logan Adams <loadams@microsoft.com>
Support training multiple models, such as in
[HF](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed_multiple_model)
Here is some update on supporting multiple DS engines with single
loss.backward(). The main message is that I think we can support this.
First, some context. Backward pass in ZeRO is complicated because the
optimizations/features require special handling of gradients, such as:
1. Gradient partitioning
2. Overlapping backward and reduction
3. Upcasting for fp32 grad accumulation
So, we created engine.backward(loss) as a wrapper function to provide us
fine-grained control over backward as below
```python
def backward(loss):
backward_prologue() # setup logic for special gradient handling
loss.backward()
backward_epilogue() # cleanup/teardown logic
```
As demonstrated by @muellerzr, this approach breaks down when loss
originates from multiple DS engines. Our proposed solution is to use
backward hooks on the module to launch backward_prologue() and
backward_epilogue() . Specifically,
1. backward pre hook on engine.module to launch backward_prologue()
before any module gradient is created.
2. backward post hook on engine.module to launch backward_epilogue()
after all module gradients are created.
We plan for this solution to preserve BC, i.e., engine.backward() will
remain correct for single engine scenarios.
The current status is that (1) is completed, while (2) is in progress.
To unblock e2e testing for multi-engine scenarios, since there are
probably other issues, we have a temporarily added
engine._backward_prologue() . You can try this out via the following
artifacts.
1. Simple multi-engine test code:
https://gist.github.com/tjruwase/f1adccf087b8fa269ffce2ab91c4f1c6#file-multi_engine-py
2. DS branch:
https://github.com/microsoft/DeepSpeed/tree/olruwase/zero_multi_models
---------
Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
As a part of joining the Linux Foundation AI&Data it makes sense to
rename the X/Twitter accounts associated with DeepSpeed.
---------
Signed-off-by: Logan Adams <loadams@microsoft.com>
Suppose qkv_linear_weight_shape = [in_features, out_features].
The qkv linear weight shape is [3, in_features, out_features] if using
fued_qkv gemm optimization. It will cause "ValueError: too many values
to unpack (expected 2)" issue when printing the model.
Solution: Take the last two weight dimensions shapes as in_features and
out_features.
Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
This PR is a continuation of the efforts to improve Deepspeed
performance when using PyTorch compile.
The `fetch_sub_module()` routine makes use of the `frozenset` which is
problematic because:
1. `iter_params` returns an iterable over model parameters
2. `frozenset` wraps this iterable, making it unmodifiable
3. PyTorch’s compilation process cannot infer how `frozenset` interacts
with tensors, leading to a graph break.
If we replace the `frozenset` with a modifiable `set`, then there is no
longer such graph break.
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Run pre-commit on all files may change files that are not part of the
current branch.
The updated script will only run pre-commit on the files that have been
changed in the current branch.
Signed-off-by: Hongwei <hongweichen@microsoft.com>
This PR is a continuation of the efforts to improve Deepspeed
performance when using PyTorch compile.
The `instrument_w_nvtx` decorator is used to instrument code with NVIDIA
Tools Extension (NVTX) markers for profiling and visualizing code
execution on GPUs.
Along with executing the function itself, `instrument_w_nvtx` makes
calls to `nvtx.range_push` and `nvtx.range_pop` which can't be traced by
Dynamo.
That's why this decorator causes a graph break.
The impact on performance can be significant due to numerous uses of the
decorator throughout the code.
We propose a simple solution: Don't invoke the sourceless functions when
torch is compiling.
---------
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>