DeepSpeed

mirror of https://github.com/deepspeedai/DeepSpeed.git synced 2025-10-20 15:33:51 +08:00

Author	SHA1	Message	Date
Connector Switch	9c9d32c2ca	[NFC] Fix comment related to SP group (#7234 ) Signed-off-by: c8ef <c8ef@outlook.com>	2025-04-21 02:41:23 +00:00
Masahiro Tanaka	962a8f0ad7	Recommend using latest (#7233 ) Add a sentence to DeepCompile blog to recommend using the latest version. Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>	2025-04-18 16:35:49 +00:00
Logan Adams	ff231af7e3	Update version.txt after 0.16.7 release (#7232 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.16.7 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2025-04-18 08:39:24 -07:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	c66fdaf3c9	Make sure it's not None before offloading contiguous_grad_buffer (#7227 ) Resolves #7223 When DeepCompile is enabled in ZeRO-3, contiguous_grad_buffer is released, so we should check and make sure it's not None before we continue. `227a60c0c4/deepspeed/compile/init_z3.py (L22-L25)` Signed-off-by: Hollow Man <hollowman@opensuse.org> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> v0.16.7	2025-04-18 06:49:51 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	625ea528d1	Pass `with_cuda` arg for jit_load in OpBuilder (#7226 ) Torch loads and hipify JIT C++ extension by determining whether CUDA headers and libraries are added to the build, based on the existence of `.cu` or `.cuh` in `sources`, if we let `with_cuda` to be the default `None`. `2a909cab16/torch/utils/cpp_extension.py (L1623-L1627)` While for some Ops, such as DeepCompile, there are no `.cu` or `.cuh` files in the sources, but we still need to do the hipify on AMD as it includes several CUDA headers in the C++ code. So, it's better for us to control this behavior if it's not `build_for_cpu`, otherwise, the hipify will get skipped. Signed-off-by: Hollow Man <hollowman@opensuse.org>	2025-04-17 23:46:06 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	86e51e69aa	Add defence for DeepCompile w/o optimizer (#7225 ) Similar to #7211 When the optimizer is not specified, the optimizer will be type `DeepSpeedZeRoOffload` instead of `DeepSpeedZeroOptimizer_Stage3` (e.g. for ZeRO-3 pure inference), while `DeepSpeedZeRoOffload` doesn't have `parameter_offload`. `56005d2b25/deepspeed/runtime/engine.py (L1684-L1707)` ```log File "deepspeed/runtime/engine.py", line 3919, in compile backend = init_z3(self, backend, compile_config, compile_kwargs, schedule) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "deepspeed/compile/init_z3.py", line 36, in init_z3 optimizer.parameter_offload._remove_module_hooks() ^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'parameter_offload' ``` --------- Signed-off-by: Hollow Man <hollowman@opensuse.org> Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2025-04-17 20:22:21 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	0f224d79c6	Fix build on AMD GPUs (related to DeepCompile) (#7224 ) We should use `torch.utils.cpp_extension.ROCM_HOME` for ROCm pytorch. ```log Traceback (most recent call last): File "<string>", line 2, in <module> File "<pip-setuptools-caller>", line 34, in <module> File "DeepSpeed/setup.py", line 195, in <module> builder.hipify_extension() File "DeepSpeed/op_builder/builder.py", line 750, in hipify_extension header_include_dirs=self.include_paths(), ^^^^^^^^^^^^^^^^^^^^ File "DeepSpeed/op_builder/dc.py", line 32, in include_paths return ['csrc/includes', os.path.join(torch.utils.cpp_extension.CUDA_HOME, "include")] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "<frozen posixpath>", line 76, in join TypeError: expected str, bytes or os.PathLike object, not NoneType ``` Signed-off-by: Hollow Man <hollowman@opensuse.org>	2025-04-17 16:27:10 +00:00
Masahiro Tanaka	1d022e05c3	Fix pass for z3 and profiler (#7222 ) This PR adds a missing line for scheduling in Z3 pass and fixes attribute names in the profiler. Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>	2025-04-17 15:33:31 +00:00
Olatunji Ruwase	8f93f8b9b0	Fix release links (#7219 ) Fix DS release links --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-04-16 08:41:55 -07:00
Logan Adams	a01a2688b9	Update version.txt after 0.16.6 release (#7218 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.16.6 Author - @tohtana Co-authored-by: tohtana <tohtana@users.noreply.github.com>	2025-04-15 23:04:11 -07:00
Masahiro Tanaka	227a60c0c4	DeepCompile for enhanced compiler integration (#7154 ) This PR introduces DeepCompile, a new feature that efficiently integrates compiler optimizations with other DeepSpeed features. DeepCompile utilizes torch's dynamo to capture the computation graph and modifies it to incorporate DeepSpeed’s optimizations seamlessly. Currently, DeepCompile supports ZeRO-1 and ZeRO-3, with enhancements such as proactive prefetching and selective unsharding to improve performance. (More details will be added later.) --------- Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: zafarsadiq <zafarsadiq120@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> v0.16.6	2025-04-16 04:33:53 +00:00
ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟	a21e5b9db6	Add defence for offload_states and reload_states w/o optimizer (#7211 ) When the optimizer is not specified, the optimizer will be type `DeepSpeedZeRoOffload` instead of `DeepSpeedZeroOptimizer_Stage3` (e.g. for ZeRO-3 pure inference), while `DeepSpeedZeRoOffload` hasn't implemented methods `reload_states` and `offload_states`. `56005d2b25/deepspeed/runtime/engine.py (L1684-L1707)` ```log File "deepspeed/runtime/engine.py", line 3904, in offload_states self.optimizer.offload_states(include=include, device=device, pin_memory=pin_memory, non_blocking=non_blocking) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'DeepSpeedZeRoOffload' object has no attribute 'offload_states' ``` In addition, https://github.com/deepspeedai/DeepSpeed/pull/6855 seems to forget removing the check for `assert not self.zero_offload_param()`, as suggested by https://github.com/deepspeedai/DeepSpeed/issues/6833#issuecomment-2537295310, it returns None when offload_param is not given, and the newly added assertions have already covered these cases. This PR also removed this old check. Signed-off-by: Hollow Man <hollowman@opensuse.org>	2025-04-10 04:40:57 +00:00
limjcst	185330cdff	Support complicated use cases with TiedLayerSpec (#7208 ) I want to reuse a composed module in the pipeline. For example, the following `MyModule` has a member `linear`, which is also a module. ```python class MyModule(torch.nn.Module): def __init__(self, n_in: int, n_out: int): super().__init__() self.linear = torch.nn.Linear(n_in, n_out) self.layer_norm = torch.nn.LayerNorm(n_out) def forward(self, data: torch.Tensor) -> torch.Tensor: hidden = self.linear(data) hidden = self.layer_norm(hidden) return hidden ``` `MyModule.linear.weight` should be synchronized among related ranks. As a result, I add `linear.weight` to `TiedLayerSpec.tied_weight_attr`. BTW, I generate the whole `tied_weight_attr` by the following instruction. ```python tied_weight_attr = [name for name, p in layer.named_parameters() if p.numel() > 1] ``` However, the builtin `getattr` used by `PipelineModule` fails to find a nested attribute like `linear.weight`. Hence, this PR first extends the builtin `getattr` to a recursive version `PipelineModule._recursive_getattr`, accessing each attribute segment one by one. Meanwhile, the order of tied weights matters in synchronization. This PR suggests to sort tie_keys in `PipelineModule._index_tied_modules` to avoid hanging. Signed-off-by: Mingjie Li <limingjie@chinamobile.com> Co-authored-by: Mingjie Li <limingjie@chinamobile.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>	2025-04-09 15:11:51 +00:00
Omar Elayan	56005d2b25	HPU accelerator memory mapping is broken because of torch fill uninit memory (#7209 )	2025-04-08 18:38:06 +00:00
inkcherry	29fa95a819	update dependencies version info (#7206 ) The release versions are now available. update from the master branch to use the minimum required versions instead. also link the example.https://github.com/deepspeedai/DeepSpeedExamples/pull/964 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com>	2025-04-08 15:22:58 +00:00
Logan Adams	027ee21af9	Update to fix pydantic warning (#7193 ) Warning: ``` site-packages/deepspeed/runtime/config_utils.py:64: PydanticDeprecatedSince211: Accessing this attribute on the instance is deprecated, and will be removed in Pydantic V3. Instead, you should access this attribute from the model class. Deprecated in Pydantic V2.11 to be removed in V3.0. kwargs = pydantic_config.model_fields[dep_field].json_schema_extra ``` Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-04-03 15:41:41 +00:00
Nadav Elyahu	3c1817f38f	Reland perf fix for nan inf check (#7184 ) replace previous usage with logical ops for nan/inf detect with torch.where --------- Signed-off-by: Nadav Elyahu <nelyahu@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>	2025-04-02 16:21:25 +00:00
Max Kovalenko	79ff162722	Update to new torch grad hook API: BF16Optimizer and Stage2 (#7189 ) This commit extends PR [#6773](https://github.com/deepspeedai/DeepSpeed/pull/6773) to ZerO2 as well as BF16Optimizer. Starting PyTorch 2.1 there is a new and robust hook API on a param itself: `param.register_post_accumulate_grad_hook()` A proper API is automatically selected depending on the PyTorch version. --------- Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: c8ef <c8ef@outlook.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Signed-off-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io> Signed-off-by: Hongwei Chen <hongweichen@microsoft.com> Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Signed-off-by: Liang Cheng <astarxp777@gmail.com> Signed-off-by: A-transformer <astarxp777@gmail.com> Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com> Co-authored-by: Masahiro Tanaka <mtanaka@microsoft.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Connector Switch <c8ef@outlook.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: A-transformer <cl5743590921@gmail.com> Co-authored-by: Raza Sikander <srsikander@habana.ai> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com> Co-authored-by: A-transformer <astarxp777@gmail.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Co-authored-by: Glaceon-Hyy <ffheyy0017@gmail.com>	2025-04-01 23:47:23 +00:00
Yejing-Lai	11aca297ed	Add qwen3 autotp support (#7187 ) Add qwen3 autotp support. Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>	2025-04-01 22:24:42 +00:00
Glaceon-Hyy	1f706621f1	Fix issue #5242 grad_norm and loss is nan (#7171 ) This PR addresses a regression introduced in commit [61daaa1](`61daaa1ea2`) that affects gradient clipping when handling infinite values. The modified NaN/Inf handling logic in total_norm calculation leads to unexpected behavior: Original logic ([v0.10.3](https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233)): Converted both NaN and Inf to -1 before entering unscale_and_clip_grads Post-commit behavior: When total_norm is Inf, inf_or_nan.logical_not() * total_norm produces NaN instead of 0, causing gradient clipping to fail Here is a minimal reproducible example comparing gradient clipping behavior across implementations. ```python import torch import numpy as np import copy def test(total_norm): test_old_deepspeed(total_norm) test_deepspeed(total_norm) test_torch(total_norm) test_deepspeed_fix(total_norm) def test_old_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1233 if total_norm == float('inf') or total_norm == -float('inf') or total_norm != total_norm: total_norm = torch.tensor(float(-1)) # https://github.com/deepspeedai/DeepSpeed/blob/v0.10.3/deepspeed/runtime/zero/stage_1_and_2.py#L1848 clip_grad = float(1.0) loss_scale = float(1.0) combined_scale = loss_scale clip = ((total_norm / loss_scale) + 1e-6) / clip_grad if clip > 1: combined_scale = clip * loss_scale print(f"old_deepspeed: {1. / combined_scale}") def test_deepspeed(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1710 norm_is_inf = total_norm.isinf() norm_is_nan = total_norm.isnan() inf_or_nan = norm_is_nan.logical_or(norm_is_inf) err = torch.tensor(-1.0, dtype=torch.float) total_norm = inf_or_nan * err + inf_or_nan.logical_not() * total_norm # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed: {1. / combined_scale}") def test_torch(total_norm_tensor): # https://github.com/pytorch/pytorch/blob/v2.6.0/torch/nn/utils/clip_grad.py#L155 total_norm = copy.deepcopy(total_norm_tensor) max_norm = float(1.0) clip_coef = max_norm / (total_norm + 1e-6) clip_coef_clamped = torch.clamp(clip_coef, max=1.0) print(f"torch: {clip_coef_clamped}") def test_deepspeed_fix(total_norm_tensor): total_norm = copy.deepcopy(total_norm_tensor) if total_norm.isinf() or total_norm.isnan(): total_norm = torch.tensor(-1.0, dtype=torch.float) # https://github.com/deepspeedai/DeepSpeed/blob/v0.16.4/deepspeed/runtime/zero/stage_1_and_2.py#L1970 clip_grad = float(1.0) loss_scale = float(1.0) clip = ((total_norm / loss_scale) + 1e-6) / clip_grad clip = torch.clamp(clip, min=1.0) combined_scale = clip * loss_scale print(f"test_deepspeed_fix: {1. / combined_scale}") if __name__ == '__main__': print("***NAN*") test(torch.tensor(float('nan'))) print("*INF*") test(torch.tensor(float('inf'))) print("*positive***") test(torch.tensor(float(2.0))) ``` Result: ![20250325165135](https://github.com/user-attachments/assets/bd32209d-14f6-4c21-8b57-f8bd94786fe2) --------- Signed-off-by: yueyang.hyy <yueyang.hyy@alibaba-inc.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>	2025-03-29 00:37:29 +00:00
inkcherry	b8cc1eb078	async tp allreduce (#7115 ) Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Signed-off-by: Liang Cheng <astarxp777@gmail.com> Signed-off-by: A-transformer <astarxp777@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: A-transformer <cl5743590921@gmail.com> Co-authored-by: Raza Sikander <srsikander@habana.ai> Co-authored-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com> Co-authored-by: A-transformer <astarxp777@gmail.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>	2025-03-28 22:48:17 +00:00
Hongwei Chen	f355b9eadf	Cross layer overlapping for domino (#7178 ) 1. Add implementation for cross layer communication overlapping to achieve communication "free". 2. Optimize the implementation for communication overlapping within transformer layer. Signed-off-by: Hongwei Chen <hongweichen@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-03-28 17:59:36 +00:00
Logan Adams	3054b934af	Update version.txt after 0.16.5 release (#7180 ) Auto-generated PR to update version.txt after a DeepSpeed release Released version - 0.16.5 Author - @loadams Co-authored-by: loadams <loadams@users.noreply.github.com>	2025-03-27 16:33:42 -07:00
Bruno Magalhaes	20f988eade	Variable batch size and LR scheduler (#7104 ) # Background and rationale In many use cases, particularly LLMs, one is faced with inputs (sentences) of variable lengths. A common practice is to pack batches by token count (not a fixed batch size), ie by putting together sentences whose given metric (eg sequence lengths) will add up to an user-provided value. As an example, in [Attention is all you need](https://arxiv.org/abs/1706.03762), section 5.1: > Sentence pairs were batched together by approximate sequence length. Each training batch contained a set of sentence pairs containing approximately 25000 source tokens and 25000 target tokens. Dynamic batch sizes has been requested in [DeepSpeed issue 1051](https://github.com/microsoft/DeepSpeed/issues/1051), [DeepSpeed issue 3455 ](https://github.com/microsoft/DeepSpeed/issues/3455), [Pytorch Lightning issue 16914](https://github.com/Lightning-AI/pytorch-lightning/issues/16914), [huggingface issue 2647](https://github.com/huggingface/accelerate/issues/2647) and is available already in many libraries e.g. [NVIDIA Triton](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/model_configuration.md#dynamic-batcher) and [Meta FairSeq](https://github.com/facebookresearch/fairseq) (implementation [here](`34973a94d0/fairseq/data/fairseq_dataset.py (L104)`) ). The immediate use case for this is when one needs to maximize GPU utilization. Moreover, this is particularly relevant for curriculum learning where a `BxTxE` (Batch x Time x Embedding) -shaped input should ideally have high `B` and low `T` at the early curriculum steps (many short sentences packed together as a batch), and low `B` and high `T` at the late steps (few long sentences in the batch). A dynamic size `T` is already supported by Deepspeed, e.g. in the documentation for pipeline parallelism's [reset_activation_shape()](https://deepspeed.readthedocs.io/en/stable/pipeline.html#deepspeed.runtime.pipe.engine.PipelineEngine.reset_activation_shape): > For curriculum learning that changes the seqlen of each sample, we need to call this whenever the seqlen is going to change. However, dynamic `B` is not supported. A dynamic `B` would require an adequate increase/decrease of learning rate. This technique has been applied previously, and the two most common LR scaling algorithms have been described as: 1. Linear Scaling Rule: "When the minibatch size is multiplied by k, multiply the learning rate by k", as in [Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Goyal et al.](https://arxiv.org/abs/1706.02677) 2. Square Root scaling: "when multiplying the batch size by k, multiply the learning rate by √k, to keep the variance in the gradient expectation constant" by [One weird trick for parallelizing convolutional neural networks, A. Krizhevsky et al.](https://arxiv.org/abs/1404.5997) In practice, the user picks the total token count per batch as the metric that drives batching, instead of batching by sentence count. During runtime, the variable batch size is computed and the LR is adjusted respectively, based on the LR and batch size provided by the config. # Illustration of dynamic batch size, sequence length and LR Imagine we picked a limit of `30` tokens per batch, and have set a reference `lr=1e-3` for a `train_batch_size=2` (in the deepspeed config). The batching algorithm for curriculum may pack the data into batches of short sentences (left) at the early stages, and batches of long sentences (right) as later stages, e.g.: ![dynamic_batch_size_and_lr](https://github.com/microsoft/DeepSpeed/assets/150697676/324bda09-8f0b-430c-bb33-cc1bd01c3fe7) Above, we collected samples until we filled up the batch with at most 30 tokens. The batch sizes (number of samples) became then `10` and `4` on the left and right examples, respectively. Using the linear scaling rule, the LR for those batches become `5e-3` and `2e-3`. # Pipeline parallelism Pipeline parallelism requires the same batch size and same sequence length across all micro-batches in a batch, as the activation sizes must be fixed between gradient accumulation steps. Between batches, these may change, and long as `engine.reset_activation_shape()` is called so that the new shapes are communicated on the first gradient accumulation step in the batch. Enforcing similar `BxTxE` between batches may lead to smaller micro-batches. As an example, below we can see an illustration of a 2-node 2-gradient-accumulation-step (ie 4 micro-batches) batching for the same dataset, when preparing data for the regular DDP (left) and for the pipeline parallelism use cases (right): ![dynamic_batch_size_and_lr_microbatching](https://github.com/microsoft/DeepSpeed/assets/150697676/3fed5e1c-f2f5-4efe-a9c5-5b5e20719d45) We can see that the pipeline use case (right) has the same `BxTxE` shape across all the 4 micro-batches in the same batch, and in order to respect that, it packs less samples in the batch, when compared to the standard use case (left hand size) # Attention Head For an input of size `BxTxE` the attention has a shape of `TxT` for a mask of fixed size across samples of same size, or `BxTxT` for a different mask per sample (when samples have different sizes, as in the dataset above). This 3D attention matrix can be illustrated for the DDP microbatch 1 (picture above top-left, 4 sentences) as: ![dynamic_batch_size_and_lr_attn_matrix](https://github.com/microsoft/DeepSpeed/assets/150697676/707d2f17-66da-4034-8a12-a87df2044bfb) Note the memory savings: the attention head has a size of `BxTxT`, i.e. a linear memory dependency on the batch size `B` and quadratic memory dependency on the largest sequence length `T` in the (micro-) batch. Thus, supporting a dynamic size `T` allows for an increase of `B`. # PR overview This PRs implements dynamic batching and LR scaling. The dataloader and LR scheduler necessary can be retrieved by calling `get_dataloader_and_lr_scheduler_for_variable_batch_size`. A small explanation of that function follows: - The logic behind the algorithms for LR scaling is in `scale_lr`; - The partitioning of samples into batches is done by `batch_by_seqlen`. - For pipeline parallelism, it is required that all micro-batches in a pipeline pass to have the same activation shapes. This is enabled by setting to `True` the following parameters: - `required_microbatches_of_same_sizes` that will force the `B` dimension to be the same across all gradient accumulation steps of all dataloaders on a batch; - `required_microbatches_of_same_lengths` that will force the `T` dimension to be the same across all gradient accumulation steps. Works by calling the user-provided `sample_padding_fn(sentence, len)` that pads a given sentence to the argument length; - `batch_by_seqlen` returns `microbatch_sample_ids` (the list of sample ids per micro-batch), `batch_sizes` (the size of effective batch sizes, and `batch_max_seqlens` (longest sequence across all microbatches in a batch) - `dataloader_for_variable_batch_size` relies on `microbatch_sample_ids` and will iterate/collate/pad samples for every batch and return a dataloader that iterates the final (variable-size) batches; - `lr_scheduler_for_variable_batch_size` relies on `batch_sizes` to compute the learning rate for each effective batch, taking into account the batch size and LR in the config file, and scaling the LR based on the size of each effective batch, and the scaling rule mentioned above (Linear, Square root, etc). - Special note to the `lr_scheduler` returned that will either accept either: 1. an user-provided `Optimizer` that will scale the learning rates (in param groups) at every batch, or 2. an user-defined `LRScheduler`, that in this case will first get the learning rate from the scheduler and then scale it accordingly. # Example An example for the use case with and without pipelining is provided in file [`DeepSpeedExamples/training/data_efficiency/variable_batch_size_and_lr/variable_batch_size_and_lr_example.py`](https://github.com/deepspeedai/DeepSpeedExamples/tree/master/training/data_efficiency/variable_batch_size_and_lr). The example shows an attention head with attention of variable-sized `BxTxT` per batch, followed by a fixed size feed forward network. These are the main blocks on a Large Language Model. The feed-forward (or linear layer) that follows the attention head requires a constant input size, equivalent to the largest sentence in the whole dataset, so the output of the attention must be padded (see `feedforward: needs to convert BxTxE to BxMxE by padding extra tokens` in the code). # Config The example file also comments the relevant deepspeed config with comments: ```python config = { "train_batch_size": 16, # `train_micro_batch_size_per_gpu` tells how many sequence packs of `max_tokens` each will be collated together. # I.e. the number of tokens per micro batch (ie per gpu iteration) is `train_micro_batch_size_per_gpu``max_tokens`. "train_micro_batch_size_per_gpu": 2, "data_efficiency": { "enabled": True, # seed to be applied to all data efficiency modules, including dynamic batching "seed": 42, "data_sampling": { "num_workers": 0, # dataloader num_workers argument "pin_memory": False, # dataloader pin_memory argument "dynamic_batching": { # enables or disables dynamic batching "enabled": True, # how many tokens we need to fill a pack of sequences (that will be collated together as a sample) "max_tokens": 100, # Input and output write to read from or write the length of every sequence. # Sequence lengths will be loaded from: {metrics_path}/seqlen/seqlen_sample_to_metric.bin and .idx # If files dont exist, they'll be computed and saved on the first run, and loaded on subsequent runs. "metrics_path": "./curriculum_output/", # As batch size increases/decreses, which method to use to scale LR accordingly? # Options: linear, sqrt (square root), or None to disable "lr_scaling_method": "linear", # how to pick sentences to be packed into samples: # - dataloader: by same order as they come in with the dataloader # - seqlen: by sequence length (shortest to longest) # - random: random order using the seed in config['data_efficiency']['seed' "sentence_picking_order": "dataloader", # "random" / "seqlen" / "dataloader" # minimum number of sequences required to reach `max_tokens`. If sentence pack is smaller, it's discarded. "min_batch_size": 1, # maximum number of sequences required to reach `max_tokens`. If sentence pack is larger, it's discarded. "max_batch_size": 10, # enable the output of microbatching information about sentence packing "verbose": True, }, }, }, } ``` # Future work A follow-up PR will enable dynamic batching when calling `deepspeed.initialize`. I.e. instead of this: ```python engine, _, _, _ = deepspeed.initialize(config=config, model=model) dataloader, lr_scheduler, _ = get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed(...) engine.lr_scheduler = lr_scheduler ``` we'd ideally have this: ```python engine, _, dataloader, lr_scheduler = deepspeed.initialize(config=config, model=model) ``` where `initialize` will call internally `get_dataloader_and_lr_scheduler_for_variable_batch_size_deepspeed`. --------- Signed-off-by: Bruno Magalhaes <bruno.magalhaes@synthesia.io> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> v0.16.5	2025-03-27 14:51:13 +00:00
Logan Adams	ac295aa06c	Fix typos in GDS blog (#7177 ) Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-03-26 23:32:17 +00:00
Stas Bekman	2dbb5d9416	fix `seq_parallel_communication_data_type` constant. (#7175 ) A user couldn't override `seq_parallel_communication_data_type` because of a typo in a name, this PR fixes it.	2025-03-26 21:14:14 +00:00
Hongwei Chen	e7e20e4ace	Link AutoTP blog in the front page (#7167 ) Signed-off-by: Hongwei <hongweichen@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-03-25 00:52:42 +00:00
Connector Switch	2b245a999e	[NFC] Typo fix in SP layer. (#7152 ) Signed-off-by: c8ef <c8ef@outlook.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2025-03-24 22:30:33 +00:00
Masahiro Tanaka	9ae010e629	Add destroy to tests to free memory (#7160 ) ZeRO3 requires explicit cleaning in tests when reusing the environment. This PR adds `destroy` calls to the tests to free memory and avoid potential errors due to memory leaks. Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>	2025-03-24 21:49:09 +00:00
Max Kovalenko	d40cf4662c	Avoid graph break by removing redundant requires_grad attr change (#7158 ) This PR is a continuation of the efforts to improve DeepSpeed performance when using PyTorch compile. Dynamo breaks the graph because `flat_tensor.requires_grad = False`: * Is a side-effecting operation on tensor metadata * Occurs in a context where Dynamo expects static tensor properties for tracing `flat_tensor.requires_grad` is redundant and can be safely removed because: * `_allgather_params()` function is already decorated with `@torch.no_grad()` which ensures the desired property * `flat_tensor` is created using the `torch.empty()` which sets the `requires_grad=False` by default. --------- Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>	2025-03-24 19:50:30 +00:00
inkcherry	1ca83a6bb9	hf tp+zero training doc. (#7151 ) @tjruwase Don't merge yet, I will leave a comment when it is ready for merge. Thank you. --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <loadams@microsoft.com>	2025-03-20 23:23:43 +00:00
Masahiro Tanaka	2e7f8e580e	fix leak of z3 buffer Signed-off-by: Masahiro Tanaka <mtanaka@microsoft.com>	2025-03-20 18:58:15 +00:00
Logan Adams	3388f8331b	Update container version that runs on A6000 tests. (#7153 ) Changes from https://github.com/huggingface/transformers/pull/36654 in transformers cause issues with the torch 2.5 version we were using. This just updated us to use a newer version. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-03-19 23:42:38 +00:00
Raza Sikander	29e9fd53b5	Enhance Gaudi2 CI/Nightly Coverage with Model Parallelism and Linear Tests (#7146 ) Enhancing ci/nightly coverage for gaudi2 device Tests added : test_autotp_training.py test_ulysses.py test_linear::TestLoRALinear and test_linear::TestBasicLinear test_ctx::TestEngine these provide coverage for model_parallesim and linear feature. The tests are stable. 10/10 runs pass. New tests addition is expected to increase ci time by 3-4 mins and nightly job time by 15 min. Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai>	2025-03-18 23:49:01 +00:00
A-transformer	64fcff052d	Correct the BACKWARD_PREFETCH_SUBMIT mismatch (#7120 ) Correct the BACKWARD_PREFETCH_SUBMIT mismatch FORWARD_PREFETCH_SUBMIT = 'forward_prefetch_submit' --------- Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Signed-off-by: A-transformer <astarxp777@gmail.com> Co-authored-by: Raza Sikander <srsikander@habana.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com>	2025-03-17 18:05:37 +00:00
saurabhkoshatwar	591d4d4d54	Conditionally quote env vars (#7071 ) Resolves #6997 This PR conditionally quotes environment variable values—only wrapping those containing special characters (like parentheses) that could trigger bash errors. Safe values remain unquoted. --------- Signed-off-by: Saurabh <saurabhkoshatwar1996@gmail.com> Signed-off-by: Saurabh Koshatwar <saurabhkoshatwar1996@gmail.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-03-17 14:51:27 +00:00
Logan Adams	d095b18185	Unpin transformers version for most workflows (#7139 ) Unpin transformers version for all workflows except `nv-torch-latest-v100` as this still has a tolerance issue with some quantization tests. Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-03-14 13:52:44 -07:00
A-transformer	39027c3008	Add conditional expression (#7119 ) Keeps lines within PEP 8 length limits. Enhances readability with a single, concise expression. Preserves original functionality. --------- Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Signed-off-by: Liang Cheng <astarxp777@gmail.com> Signed-off-by: A-transformer <astarxp777@gmail.com> Co-authored-by: Raza Sikander <srsikander@habana.ai> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com> Co-authored-by: A-transformer <astarxp777@gmail.com>	2025-03-14 17:47:47 +00:00
Olatunji Ruwase	9288bc4f82	Avoid missing attr error (#7133 ) Fix #7132 Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com>	2025-03-14 16:40:32 +00:00
Logan Adams	7d9dbf2830	Update CONTRIBUTING.md to reflect changes from CLA to DCO (#7135 ) Copy changes from https://github.com/deepspeedai/DeepSpeed-MII/pull/558. Fixes issue where docs still referenced CLA. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-03-14 08:06:30 -07:00
Olatunji Ruwase	b418cf6c1b	Training multiple models (#7018 ) Support training multiple models, such as in [HF](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed_multiple_model) Here is some update on supporting multiple DS engines with single loss.backward(). The main message is that I think we can support this. First, some context. Backward pass in ZeRO is complicated because the optimizations/features require special handling of gradients, such as: 1. Gradient partitioning 2. Overlapping backward and reduction 3. Upcasting for fp32 grad accumulation So, we created engine.backward(loss) as a wrapper function to provide us fine-grained control over backward as below ```python def backward(loss): backward_prologue() # setup logic for special gradient handling loss.backward() backward_epilogue() # cleanup/teardown logic ``` As demonstrated by @muellerzr, this approach breaks down when loss originates from multiple DS engines. Our proposed solution is to use backward hooks on the module to launch backward_prologue() and backward_epilogue() . Specifically, 1. backward pre hook on engine.module to launch backward_prologue() before any module gradient is created. 2. backward post hook on engine.module to launch backward_epilogue() after all module gradients are created. We plan for this solution to preserve BC, i.e., engine.backward() will remain correct for single engine scenarios. The current status is that (1) is completed, while (2) is in progress. To unblock e2e testing for multi-engine scenarios, since there are probably other issues, we have a temporarily added engine._backward_prologue() . You can try this out via the following artifacts. 1. Simple multi-engine test code: https://gist.github.com/tjruwase/f1adccf087b8fa269ffce2ab91c4f1c6#file-multi_engine-py 2. DS branch: https://github.com/microsoft/DeepSpeed/tree/olruwase/zero_multi_models --------- Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>	2025-03-11 20:59:23 +00:00
Logan Adams	436c126211	Add sequential pytest mark to TestNVMeCheckpointing to resolve pytest forked hangs (#7131 ) Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-03-11 17:20:32 +00:00
inkcherry	8ec1af5f5c	fix keep_module_on_host (#7112 ) Reapply https://github.com/deepspeedai/DeepSpeed/pull/6846. FYI @oelayan7 --------- Signed-off-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-03-10 16:57:34 +00:00
Raza Sikander	c1acd49cdf	Update gaudi2 nightly,ci to latest 1.20.0 build (#7093 ) Signed-off-by: Shaik Raza Sikander <srsikander@habana.ai> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Signed-off-by: Hongwei <hongweichen@microsoft.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com>	2025-03-07 22:46:47 +00:00
Logan Adams	c2c8199394	Update references to new X/Twitter handle (#7110 ) As a part of joining the Linux Foundation AI&Data it makes sense to rename the X/Twitter accounts associated with DeepSpeed. --------- Signed-off-by: Logan Adams <loadams@microsoft.com>	2025-03-04 23:22:38 +00:00
Yejing-Lai	71807bceba	Fix fused_qkv print model ValueError (#7109 ) Suppose qkv_linear_weight_shape = [in_features, out_features]. The qkv linear weight shape is [3, in_features, out_features] if using fued_qkv gemm optimization. It will cause "ValueError: too many values to unpack (expected 2)" issue when printing the model. Solution: Take the last two weight dimensions shapes as in_features and out_features. Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-03-04 21:17:12 +00:00
Max Kovalenko	17c6595442	Avoid graph break due to unsupported frozenset (#7105 ) This PR is a continuation of the efforts to improve Deepspeed performance when using PyTorch compile. The `fetch_sub_module()` routine makes use of the `frozenset` which is problematic because: 1. `iter_params` returns an iterable over model parameters 2. `frozenset` wraps this iterable, making it unmodifiable 3. PyTorch’s compilation process cannot infer how `frozenset` interacts with tensors, leading to a graph break. If we replace the `frozenset` with a modifiable `set`, then there is no longer such graph break. Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>	2025-03-04 17:16:21 +00:00
Hongwei Chen	e4c7931a4e	Only run pre-commit on the changes (#7106 ) Run pre-commit on all files may change files that are not part of the current branch. The updated script will only run pre-commit on the files that have been changed in the current branch. Signed-off-by: Hongwei <hongweichen@microsoft.com>	2025-03-04 08:15:03 -08:00
Max Kovalenko	776822fe21	Avoid graph breaks in torch.compile caused by inner classes in the backward hooks (#7062 ) This PR is part of the effort to improve Deepspeed performance when using PyTorch compile. There is a known [bug](https://github.com/pytorch/pytorch/issues/128942) in torch.compile which causes a graph break when an inner class is defined within a method that is being compiled. The following would then appear in the log: `[__graph_breaks] torch._dynamo.exc.Unsupported: missing: LOAD_BUILD_CLASS` This is the case with the inner classes `PreBackwardFunctionForModule` and `PostBackwardFunctionModule`. While there is an open PyTorch [PR#133805 ](https://github.com/pytorch/pytorch/pull/133805) for this, we can solve the issue by moving the inner classes into the initialization code. No graph breaks and the corresponding logs are produced anymore. --------- Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Signed-off-by: Olatunji Ruwase <olruwase@microsoft.com> Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: shaomin <wukon1992@gmail.com> Signed-off-by: Stas Bekman <stas@stason.org> Signed-off-by: siqi <siqi@tecorigin.com> Signed-off-by: Logan Adams <loadams@microsoft.com> Signed-off-by: Wei Wu <wuwei211x@gmail.com> Signed-off-by: ShellyNR <shelly.nahir@live.biu.ac.il> Signed-off-by: Lai, Yejing <yejing.lai@intel.com> Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: inkcherry <mingzhi.liu@intel.com> Co-authored-by: wukong1992 <wukong1992@users.noreply.github.com> Co-authored-by: shaomin <wukon1992@gmail.com> Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: loadams <loadams@users.noreply.github.com> Co-authored-by: Stas Bekman <stas00@users.noreply.github.com> Co-authored-by: siqi654321 <siqi202311@163.com> Co-authored-by: siqi <siqi@tecorigin.com> Co-authored-by: Wei Wu <45323446+U-rara@users.noreply.github.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Co-authored-by: Shelly Nahir <73890534+ShellyNR@users.noreply.github.com> Co-authored-by: snahir <snahir@habana.ai> Co-authored-by: Yejing-Lai <yejing.lai@intel.com>	2025-03-04 00:41:51 +00:00
Max Kovalenko	a88f56a045	Avoid graph breaks by disabling sourceless calls in instrument_w_nvtx (#7081 ) This PR is a continuation of the efforts to improve Deepspeed performance when using PyTorch compile. The `instrument_w_nvtx` decorator is used to instrument code with NVIDIA Tools Extension (NVTX) markers for profiling and visualizing code execution on GPUs. Along with executing the function itself, `instrument_w_nvtx` makes calls to `nvtx.range_push` and `nvtx.range_pop` which can't be traced by Dynamo. That's why this decorator causes a graph break. The impact on performance can be significant due to numerous uses of the decorator throughout the code. We propose a simple solution: Don't invoke the sourceless functions when torch is compiling. --------- Signed-off-by: Max Kovalenko <mkovalenko@habana.ai> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>	2025-03-03 22:54:08 +00:00

... 3 4 5 6 7 ...

2958 Commits