1531 Commits

Author SHA1 Message Date
407708cdb6 add support for tensor learning rate (vs scalar) (#7633)
This change is intended to help enable support for using a tensor
learning rate value vs a scalar ones.
We found this helpful in cases where the optimizer is torch.compiled (in
such cases changing the scalar LR value could cause recompilation
degrading the performance).
The implementation allows the model script to determine the type of LR
value used by setting the initial value.

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2025-10-20 05:32:29 +00:00
2734a6a15f Take **kwargs in __init__ of DeepSpeedZeroOptimizer subclasses (#7634)
DeepSpeedZeroOptimizer provides a rich, evolving list of keyword
arguments. It is tedious and error-prone to list all of them in its
subclasses. As an example, the recent introduction of zenflow_config in
the middle of that list has caused unit test failures (e.g.
https://github.com/deepspeedai/DeepSpeed/actions/runs/18560070656/job/52906645682?pr=7633)

Convert the keyword argument list in DeepSpeedZeroOptimizer subclasses
to **kwargs for the consistency of configurable items and their default
values. Passing an unknown parameter to such subclasses will now raise
an error on their call to DeepSpeedZeroOptimizer.__init__() instead of
their own __init__(). It still ensures that typo in such parameters fail
early.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-10-19 22:07:48 -07:00
7cb1b88ec4 Add ZenFlow code for Stage 3 (#7516)
This PR completes the ZenFlow integration for DeepSpeed ZeRO Stage 3. 

Highlights:

- ZenFlowSelectiveAdamW_stage3: Optimizer with importance-aware
selective parameter updates for ZeRO Stage 3.
- ZenFlowZeroOptimizer_Stage3: Full Stage 3 optimizer integration with
partitioned parameters and CPU offload.
- Configurable via ZenFlowConfig, fully integrated with
DeepSpeedZeroConfig for Stage 3.
- Unit tests for Stage 3 cases ensuring correctness and compatibility.

Note: Intergration with ZeRO Stage 1&2 was introduced in #7391

---------

Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Co-authored-by: Ma, Guokai <guokai.ma@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu>
2025-10-13 12:19:18 -04:00
1b08325da3 [TiledMLP] moe support (#7622)
MoE routers seem to drop the `bs` dimension in `x` so the `[bs, seqlen,
hidden_size]` is no longer expected. support that use-case.

Signed-off-by: Stas Bekman <stas@stason.org>
2025-10-07 13:33:34 +00:00
71d077da73 Enable grad scaler for ZeRO-0 + torch.autocast path (#7619)
Currently, the DeepSpeed engine does not enable the grad scaler for the
ZeRO-0 and `torch.autocast` path, even when dtype is set to `fp16`. This
leads to errors in tests when we replace our hard-coded tolerances with
PyTorch’s [standard
tolerances](https://docs.pytorch.org/docs/stable/testing.html#torch.testing.assert_close)
(Thank you @stas00 for you suggestion regarding the previous PR).

This PR enables the grad scaler for this path to improve accuracy, and
refactors the tests to simplify validation by using
`torch.testing.assert_close`. The tests now rely on PyTorch’s standard
(and stricter) tolerances, and they still pass.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-10-04 13:21:08 +00:00
4eb37729de add print_dist util (#7621)
a refactor follow up to
https://github.com/deepspeedai/DeepSpeed/pull/7617 as suggested by
@tohtana to create 2 independent utils, sharing the main logic via
another util.

Signed-off-by: Stas Bekman <stas@stason.org>
2025-10-03 19:30:26 -07:00
7d9a2f2bf3 Improve leaf module interface (enable via config, relax matching criteria, add document, etc.) (#7604)
This PR improves the usability of the leaf module feature.

Here are the changes:
- Allow enabling the leaf module via both the DeepSpeed config and APIs.
- Relax matching criteria to support class-based matching.
- Support multiple ways of specifying the target module: class, class
name (with or without package name), module name, or suffix.
- Add documentation to the training guide, including config snippets and
explanations of default behavior.
- Add default classes (e.g., Mixtral, Qwen2/Qwen3) that automatically
enable the leaf module feature. (Welcoming requests to add more classes)

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-10-03 09:45:28 +00:00
2a76988958 DeepCompile: Use min_cut_rematerialization for partitioning joint graphs (#7609)
# Motivation

PyTorch provides `min_cut_rematerialization_partition()` to partition a
joint graph while respecting recomputation annotation. That algorithm
forms a data-flow-like graph from the joint graph, adds to edges weights
from some recomputation-cost-related heuristics and applies the min-cut
algorithm to determine which nodes to recompute. Users can force
recomputation of a node by annotating its `node.meta["recompute"]` to
MUST_RECOMPUTE or PREFER_RECOMPUTE, as is implemented in [1].

While originally designed for activation checkpointing,
min_cut_rematerialization can also be used to recompute param aliases.
When partitioning a joint graph, we don't want to save for backward the
gathered parameters and values computed from them via aliasing ops, as
that essentially means the gathered parameter will be saved. Instead of
customizing the partitioner or patching `choose_saved_values_set`, we
can achieve that by annotating such nodes to be MUST_RECOMPUTE.

Both eager and inductor backends can use min_cut_rematerialization
easily. The eager backend can use min-cut by customizing the
partition_fn for `aot_module_simplified`, and is already using that for
graphs with activation checkpointing enabled. The inductor backend uses
that algorithm since torch 2.0.0 [2] and is still the default after the
inductor partitioner is made configurable a few weeks ago [3].

That approach also helps DeepCompile + torch autocast nicely. When
autocast is enabled, downcasted parameters are preferred to be
recomputed. It suffices to mark such casting nodes as must-recompute.

[1]
https://github.com/pytorch/pytorch/blob/main/torch/_functorch/partitioners.py#L1813
[2]
https://github.com/pytorch/pytorch/blob/v2.0.0/torch/_inductor/compile_fx.py#L459
[3] https://github.com/pytorch/pytorch/pull/157580

# Proposal

Motivated by the flexibility and the requirement for optimizing
DeepCompile + autocast, I propose to switch to the min-cut-based
partitioner for both backends. This PR implements that switch, cleans up
dead code and also recomputes downcasted parameters in the backward.

# Preliminary Evaluation

Here's a summary of the tests using
https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3 on
a 8x RTX 5090 node.

| Configuration | Base Time (ms) | Base Mem (GB) | Time with this PR
(ms) | Mem with this PR (GB) |

|---------------------|----------------|---------------|------------------------|-----------------------|
| eager + autocast | 551.92 | 12.07 | 571.24 | 9.96 |
| eager + bf16 | 419.87 | 9.47 | 445.76 | 7.30 |
| inductor + autocast | 546.97 | 12.84 | 570.09 | 13.04 |
| inductor + bf16 | 444.03 | 10.01 | 444.70 | 10.19 |

## Reduced memory with eager backend

The initial goal of this PR is to reduce peak memory usage when torch
autocast is enabled. That is achieved according to the first row of the
table, but in two different ways simultaneously.

1. Downcasted parameters during forward are throwed away and recomputed
(by the fused cast + allgather) in the backward pass.
2. Without this PR, `fast_free_schedule` will arange most allgather at
the beginning of the graph. That leads to a even higher peak during
forward, but is no longer seen with PR.
3. By diffing the graphs passed to `add_z3_gather_release`, I noticed
that recomputations selected by min-cut is slightly different (that test
script has activation checkpointing enabled for the LLM module). That
can also impact computation time and memory usage.

Here's the shape of memory usage before this PR with eager backend +
torch autocast. eager + BF16 shows similar shapes. Numbers reported in
the table are peak during forward. The peak memory usage during backend
reduces ~0.7GB in both cases.

<img width="1482" height="629" alt="image"
src="https://github.com/user-attachments/assets/7e7ec859-9a04-4ddd-ba37-c2d475a81058"
/>

After this PR:

<img width="1482" height="453" alt="image"
src="https://github.com/user-attachments/assets/f15c71b8-f823-4aa5-801a-a36188c5e866"
/>

## Similar memory with inductor backend

Unlike eager backend, the inductor backend uses similar memory with or
without this PR. The memory usage pattern is as follows, which requires
further analysis.

Before this PR:

<img width="1070" height="613" alt="image"
src="https://github.com/user-attachments/assets/317b9a58-d4ef-459f-ac7b-67ef2318a9de"
/>

After this PR:

<img width="911" height="536" alt="image"
src="https://github.com/user-attachments/assets/7e737a81-cf27-402c-aeea-dfe661043fc1"
/>

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-10-03 03:39:38 +00:00
9cbd3edd0d [wall_clock_breakdown] always log stats when enabled (#7617)
currently when main logger is WARN level, `wall_clock_breakdown: true`
never logs - which is invalid as it disables this crucial at times
functionality. Plus I think we have a disconnect somewhere since the
recently added `--log_level` flag doesn't seem to change this logger's
level.

The future plan is to be able to have different log levels for different
modules, but for now just use `print` if `wall_clock_breakdown` is
`True`, so this functionality is not log-level dependent.

`print` is also less noisy than the logger, because of the long prefix
generated by the latter, which is of no value to the user since we print
stats and not code related logs, so the printed results are easier to
digest.

Signed-off-by: Stas Bekman <stas@stason.org>
2025-10-02 19:08:39 -04:00
e37c37acdd Fixed save_checkpoint race when consolidating NVMe offloaded tensors (#7613)
Past Discussion: #7549

Signed-off-by: H1manshu21 <himanshuwindows8.1@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-10-01 18:08:20 +00:00
07e76bd45f Fixed the issue that universal checkpoint cannot be loaded for stage3 when world size expansion. (#7599)
When the world size expands from 2 to 4, then convert to universal
checkpoint, and load from universal checkpoint.
The new rank, for example, rank3 will load model file
`zero_pp_rank_3_mp_rank_00_model_states.pt`. But this file was not
produced during the last execution.
For stage3, just load the first file, that is
`zero_pp_rank_0_mp_rank_00_model_states`.
The existing unit test
TestZeROUniversalCheckpointDP::test_dp_world_size_2to4 can verify this
problem.

---------

Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-10-01 15:37:19 +00:00
aa90f544e3 DeepCompile: Fix IPG bucket clearing (#7610)
PR #6993 replaces the flat IPG buffers with a dict maintaining
type-indexed buckets. The member is also renamed from
`_ipg_bucket_flat_buffer` to `ipg_buckets`.

Update the bucket clearing logic in `init_z3` accordingly.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-10-01 03:42:51 +00:00
e32e817306 Handle the case of DeepCompile's enabled but not activated (#7603)
This PR improves state management for DeepCompile in the engine.

Previously, the system relied only on the config flag indicating whether
DeepCompile was enabled. However, DeepCompile is actually activated only
when `compile()` is called. This meant that if DeepCompile was enabled
in the config but `compile()` was never called, it could lead to invalid
internal states (as shown in #7598).

Since `enabled == True` should be interpreted as an option that modifies
the behavior of `compile()`, this PR introduces clearer state
management:
- If .compile() is not called, the DeepCompile config has no effect on
behavior. A one-time message is shown instead.
- A new state, DeepCompile activated, is introduced. This represents the
condition where DeepCompile is both enabled in the config and .compile()
has been called.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-30 17:21:55 -07:00
4efd7eca73 DeepCompile: Fuse allgather and downcast (#7588)
With autocast enabled, a majority of weights are downcasted before being
used in calculations. Today zero3_compile gathers the FP32 weights
before they are downcasted. That is sub-optimal because FP32 weights
consumes more bandwidth to allgather and takes more time to downcast.

To reduce communication and downcast time, fuse allgather and downcast
in the dc ops. The target type is now passed to allgather_param() and
prefetch_params_fused() which will downcast the (partial) weights before
launching allgathers.

This corresponds to issue 1 of #7577.

Tested with
https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3
(run with `deepspeed --num_gpus=N this_file.py -c -p -m 23` to collect
torch and memory profiles, and with DINOV2_DEPTH = SIGLIP_DEPTH = 3,
LLAMA2_DEPTH = 4 for faster compileation) on 5090 (which has limited
inter-GPU bandwidth), time per step decreases from 438ms to 337ms and
peak GPU memory usage from 9.5GB to 8.5GB.

Profiles of a single step before this PR:

<img width="1235" height="1029" alt="image"
src="https://github.com/user-attachments/assets/d9fe5296-7731-4542-924b-421ff7415054"
/>

<img width="1466" height="616" alt="image"
src="https://github.com/user-attachments/assets/aa192802-8633-4e36-b2c4-f28b1b432663"
/>

After this PR:

<img width="1218" height="1006" alt="image"
src="https://github.com/user-attachments/assets/18a0e09c-155b-4783-adb5-b4d36c5c3691"
/>

<img width="1537" height="559" alt="image"
src="https://github.com/user-attachments/assets/16a2ca74-8a89-4db9-9b68-81844295c61b"
/>

This PR also reduces peak memory usage because the
`fast_free_schedule()` today always arranges param allgathers and
downcasts at the beginning of the graph. While the original FP32 params
can be freed early, all FP16/BF16-casted params are kept in GPU memory
at the beginning of the backward graph, leading to a higher peak in
memory usage.

P.S. Probably due to organization branch rule settings, I don't find
anywhere to allow reviewers to modify the branch. So I'll update the
branch per reviewers' comments and rebase if needed.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-29 03:15:33 +00:00
6fcccfa2c9 DeepCompile: Specify tensor aliasing in C++ op schema (#7597)
PyTorch C++ op schema [1] allows specifying tensor storage aliasing by
annotating `(a)` after input/output types. Torch inductor takes this
information to determine where to insert explicit `del` statements for
tensors that are no longer needed.

If what an op schema specifies disagrees with the op implementation,
inductor-generated code is likely to release tensors earlier than
expected and leads to wrong results.

`wait_allgather` and `release_param` return the first argument unchanged
and that aliasing should be annotated in the schema.

Also remove the code related to `clone_custom_op_output` as it is solely
a workaround of the aforementioned issue.

[1]
https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/README.md

Fixes: #7596

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-29 02:40:09 +00:00
47b3fb5e7f Fixed the problem of loading universal checkpoint error in multi-machine mode. (#7601)
In a multi-machine environment, loading the stage3 universal checkpoint
will produce incorrect results, causing the loss to increase abnormally.
2025-09-28 20:26:11 +00:00
66c70312f2 Change current_device() to current_device_name() (#7600)
This PR fix a bug that in some place get_accelerator().current_device()
are used instead of get_accelerator().current_device_name(). This would
be mostly fine but on CPU this won't work

`torch.empty(3, device=get_accelerator().current_device()` <-- won't
work other than CUDA device
`torch.empty(3,
device=torch.device(get_accelerator().current_device()))` <-- works for
GPU device, but won't work for CPU
`torch.empty(3,
device=torch.device(get_accelerator().current_device_name()))` <-- works
for both GPU device and CPU
`torch.empty(3, device=get_accelerator().current_device_name())` <--
this also works, but not as formal as the last one.

This bug is exposed when I tried to run AutoTP training on Xeon server
for debug purpose.

---------

Signed-off-by: Guokai Ma <guokai.ma@gmail.com>
2025-09-28 10:19:49 -07:00
91d14527b6 Fix the universal checkpoint issue for stage3 when there are multiple subgroups. (#7585)
**Describe the bug**

When the model is large and there are multiple subgroups, we use
ds_to_universal.py, will fail ,the error log are below:

```
*** 1. Extracting ZeRO fragments
  0%|                                                     | 0/1 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/work/zhengchenyu/ai-project/qwen3/scripts/ds_to_universal_example.py", line 21, in <module>
    main()
  File "/work/zhengchenyu/ai-project/qwen3/scripts/ds_to_universal_example.py", line 18, in main
    ds_to_universal_main(args)
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 523, in main
    _extract_zero_shard_files_stage3(args, optim_files, param_shapes, dp_degree, temp_dir)
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 375, in _extract_zero_shard_files_stage3
    _do_parallel_work(do_work, list(range(dp_degree)), args.num_extract_workers)
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 359, in _do_parallel_work
    results.append(do_work(work))
                   ^^^^^^^^^^^^^
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 167, in extract_zero_shards_stage3
    dump_param_fragment(temp_dir, 0, dp_index, state_key, flat_state[state_key], name, offset,
  File "/opt/conda/lib/python3.11/site-packages/deepspeed/checkpoint/ds_to_universal.py", line 194, in dump_param_fragment
    state_flat_tensor = state_flat_tensor.narrow(0, offset, numel).clone()
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (0) + length (155582464) exceeds dimension size (74499072).

```

**To Reproduce**
Steps to reproduce the behavior:
1. Use large model to run, or set sub_group_size to a lower value. Then
train and save model
2. Run ds_to_universal.py

**The reason**

I found that the previous stage3 universal checkpoint implementation did
not take subgroups into account. I also found the following problems
during debugging.

* Unable to handle multiple sub-groups, which will result in data loss
* When load_checkpoint is True, then all process will save to same zero
model checkpoint file. If multiple processes write at the same time, the
file will be corrupted. Occasionally, file corruption was discovered
during testing.

Relete issue:  #7584

---------

Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-27 17:39:43 +00:00
6ea345ae27 Simplify leaf module hook (#7592)
This PR simplifies hooks for leaf module using PyTorch's API.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-27 13:12:15 -04:00
b75654001a disables ZeRO checkpoint loading path when stage=0 (#7586)
Fixes #7571 

When ZeRO is disabled (stage 0) and bf16 is enabled, the current guard
sets `load_zero_checkpoint=True`, which leads to `_load_zero_checkpoint`
and `_restore_from_bit16_weights()` being called even though no ZeRO
state exists.

This PR removes the `self.bfloat16_enabled()` condition so that
load_zero_checkpoint is tied strictly to `self.zero_optimization()`.

Stage 0 (BF16/FP16/FP32): cleanly skips ZeRO checkpoint path.

Stage ≥ 1: loads ZeRO partitioned optimizer state as before.

cc @sfc-gh-truwase

Signed-off-by: Naveenraj Kamalakannan <therealnaveenkamal@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-25 20:31:14 +00:00
16c1bf429f Include init file for superoffload folder (#7591)
This PR just fixes tiny error for pr
[7559](https://github.com/deepspeedai/DeepSpeed/pull/7559) in the
comment reported error
[here](https://github.com/deepspeedai/DeepSpeed/pull/7559#issuecomment-3329036699).

```
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1462, in _configure_optimizer
[rank1]:     self.optimizer = self._configure_zero_optimizer(basic_optimizer)
[rank1]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/usr/local/lib/python3.11/dist-packages/deepspeed/runtime/engine.py", line 1835, in _configure_zero_optimizer
[rank1]:     from deepspeed.runtime.superoffload.superoffload_stage3 import SuperOffloadOptimizer_Stage3
[rank1]: ModuleNotFoundError: No module named 'deepspeed.runtime.superoffload'
```

Create `__init__.py` for superoffload folder to avoid import error when
superoffload folder irgnored by pip installation.

---------

Signed-off-by: nguyen599 <pnvmanh2123@gmail.com>
2025-09-24 16:50:17 +00:00
af56ed4d37 SuperOffload Release (#7559)
This PR introduces **SuperOffload**—an optimizer designed for Superchips
(Nvidia GH200 & GB200, AMD MI300A) with high CPU–GPU bandwidth. It
enables **full fine-tuning** of **GPT-OSS-20B, Qwen3-14B, and Phi-4** on
a single GH200 GPU, achieving up to **~500 TFLOPS**, using Hugging Face
Transformers and DeepSpeed—no custom modeling code required.

SuperOffload extends ZeRO-Offload with fine-grained control and CPUAdam
rollback utilities, allowing GPU execution to overlap with CPUAdam. This
reduces GPU idle time and improves overall efficiency.

Key changes:
- New SuperOffloadOptimizer_Stage3 optimizer.
- C++/CUDA binding for adam_rollback to revert one optimization step.
- Config additions including super_offload and cpuadam_cores_perc.

A detailed blog and tutorial will be available soon.

---------

Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-24 13:09:23 +00:00
17d80ce440 Deepcompile: Make size of activation to free configurable (#7582)
In deepcompile free-activation mode, only activations larger than a
threshold are eagerly freed. The threshold is hardcoded today and thus
may not be suitable in all cases.

This PR first generalizes the dc.init() interface to take the whole
compile_config object, and then converts the threshold into a config
item.

This corresponds to issue 3 of #7577.

---------

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-24 01:37:46 +00:00
bc9ed477e9 Broadcast fp16 overflow in Z1 (#7580)
Fix #7568

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-23 15:51:43 +00:00
8c7c56a932 Deepcompile: Fix bugs when applying deepcompile to VLA-like models (#7569)
**Describe the bug**

When applying deepcompile to the OpenVLA model (which is composed of two
vision transformers and a llama-7B), I met the following issues:

a. Not all parameters are trained, which leads to compile-time
exceptions as well as incorrect invocation of `endBackward()`.
b. `release_param()` can be passed a tuple, not a tensor.
c. A use-before-define error in `fast_free_schedule()`.

This PR attempts to fix all of those issues. Patch 1~2 resolves a, 3
resolves b and 4 resolves c.

**To Reproduce the issues**
Use this script:
https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3

1. `deepspeed --num_gpus=N openvla-like.py -c`

---------

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-23 07:27:15 +00:00
35de2030be logging: Also set log level of logger handlers (#7576)
After #7526 the default logger passes logs to a StreamHandler, which has
its own log level. Changing the log level of the logger alone does not
take effect in such case.

Update the log level of all handlers when changing the parent logger's.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-23 03:32:37 +00:00
325c6c5e9c DeepCompile ZeRO-3: robust allgather for uneven shards; fix profiling… (#7489)
… meta key (max_mem)

---------

Signed-off-by: Abhishek <dalakotiashu150@gmail.com>
Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Co-authored-by: Abhishek <dalakotiashu150@gmail.com>
Co-authored-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-22 16:45:00 -07:00
e4f6da9685 [bugfix] fix partition context unpatch (#7566)
## Fix asymmetric patching/unpatching in
InsertPostInitMethodToModuleSubClasses

### Problem Description

The `InsertPostInitMethodToModuleSubClasses` context manager patches
`__init__` methods of model classes during entry and unpatches them
during exit.

However, asymmetric condition checks between patching and unpatching can
introduce subtle inheritance bugs.

### Root Cause Analysis

The issue occurs with classes that have multiple inheritance where:
1. **Child class A** does not override `__init__`
2. **Parent class B** does not inherit from `nn.Module`
3. **Parent class C** inherits from `nn.Module`

**Current asymmetric logic:**
```python
# Patching (entry): Only patch classes with explicit __init__
def _enable_class(cls):
    if '__init__' in cls.__dict__:  #  Strict check
        cls._old_init = cls.__init__
        cls.__init__ = partition_after(cls.__init__)

# Unpatching (exit): Restore any class with _old_init
def _disable_class(cls):
    if hasattr(cls, '_old_init'):  #  Permissive check
        cls.__init__ = cls._old_init
```

**Execution flow:**
1. **During entry**: Child A is skipped (no explicit `__init__`), Parent
C is patched
2. **During exit**: Child A inherits `_old_init` from Parent C and gets
incorrectly "restored"

**Result**: Child A's `__init__` points to Parent C's original
`__init__`, bypassing Parent B and breaking the inheritance chain.

### Reproduction Case

This pattern is common in Hugging Face models:
```python
class Qwen3ForSequenceClassification(GenericForSequenceClassification, Qwen3PreTrainedModel):
    pass  # No explicit __init__

# GenericForSequenceClassification - not a nn.Module subclass
# Qwen3PreTrainedModel - inherits from nn.Module
```

### Solution

Apply symmetric condition checking in both patch and unpatch operations:

```python
def _disable_class(cls):
    # Match the patching condition: only restore classes we explicitly patched
    if '__init__' in cls.__dict__ and hasattr(cls, '_old_init'):
        cls.__init__ = cls._old_init
        delattr(cls, '_old_init')  # Optional cleanup
```

This ensures that only classes that were explicitly patched during entry
get restored during exit.

### Testing

The fix has been validated against the Qwen3ForSequenceClassification
reproduction case and resolves the inheritance chain corruption.

### Related Issues
- External issue: https://github.com/modelscope/ms-swift/pull/5820

Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2025-09-19 07:24:33 +00:00
2585881ae9 Make Muon optimizer easier to enable (#7555)
The original Muon optimizer PR
(https://github.com/deepspeedai/DeepSpeed/pull/7509) requires user to
explicitly set `use_muon` flags in `model.parameters()`, as shown in
test
https://github.com/deepspeedai/DeepSpeed/blob/master/tests/unit/ops/muon/test_muon.py#L27
.

This PR integrate setting of `use_muon` into DeepSpeed before engine
initialization. This makes Muon optimizer easier to use. User only needs
to change optimizer in `config.json` from `AdamW` to `Muon`, no need to
change code. It will solve the following issue
https://github.com/deepspeedai/DeepSpeed/issues/7552

---------

Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
2025-09-17 09:52:11 -04:00
2d84be8159 deepcompile: Create a full list of no-copy ops (#7562)
The list of torch no-copy ops is hard coded and does not include all
operations that may aliasing inputs in their outputs.

Instead of using a fixed list, iterate over all ops under torch.ops.aten
and identify those with aliasing behavior by inspecting their schema.

With PyTorch 2.7.1, the default overload of ops identified by the
updated logic include:

  - _nested_view_from_buffer
  - _reshape_alias
  - alias
  - as_strided
  - conj
  - detach
  - diagonal
  - expand
  - imag
  - lift_fresh
  - narrow
  - permute
  - pin_memory
  - positive
  - real
  - reshape
  - squeeze
  - t
  - unfold
  - unsqueeze
  - view
  - view_as_complex
  - view_as_real
  - most operations whose name ends with an underscore

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-16 09:05:11 -07:00
e9d5d416cc deepcompile: Record graph order using OrderedDict (#7563)
On clear, GraphOrder does not clears ordered_frames. That may confuses
subsequent passes after the first iteration.

Use an OrderedDict to record the mapping from frame IDs to other
graph-related information.

Also fix the type annotation of graph_order which is a list of (int ,
bool) tuples instead of a list of int.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-16 05:25:32 +00:00
660ee89529 deepcompile: Create dummy inputs using empty_strided (#7564)
CUDA tensors may have a larger storage than numel() * dtype.itemsize due
to alignment considerations. Creating dummy tensors by
torch.zero().as_strided() leads to out-of-bound errors in such cases.

Create dummy inputs by empty_strided().zero_() instead.

Signed-off-by: Junjie Mao <junjie.mao@linux.alibaba.com>
2025-09-15 14:19:06 -07:00
0e859aa0d3 Fix gradient buffer access for DeepCompile Z1/2 (#7548)
The initialization of DeepCompile+Z1/2 now fails due to the change
introduced in #7509.

This PR resolves the issue by:
- Adding an argument to optimizer.get_flat_partition
- Skipping the entire allreduce function in the engine

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-10 18:12:02 +00:00
0012ff6ea8 Limit random seed range in tests (#7553)
`pytest-randomly` often passes a large seed value to `set_random_seed`
and causes an error
([example](https://github.com/deepspeedai/DeepSpeed/actions/runs/17620450004/job/50064585974))
```
E ValueError: Seed must be between 0 and 2**32 - 1
```

This PR limits the range of seed values by taking a modulo.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-10 17:45:37 +00:00
8cbbbb539d [MoE] Fix misuse of num_experts as expert parallel group size (ep_size) (#7551)
Fixes #7535 

## Description
This PR fixes a bug in inference/engine.py where num_experts
(moe_experts) was incorrectly passed as the expert parallel group size
(ep_size) when creating expert parallel groups.

Currently:
```
if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.moe_experts)
```
This causes **invalid** behavior whenever `num_experts > world_size`,
because `_create_ep_parallel_group` expects a group size, not the total
number of experts as pointed out by @Arnoochka

## Root Cause

num_experts = number of experts inside the MoE layer.

ep_size = how many GPUs to group together for expert parallelism.

These were mixed up in the code.

##Fix

Replaced the incorrect call with the proper ep_size argument:
```
if moe and dist.get_world_size() > 1:
    self._create_ep_parallel_group(config.moe.ep_size)
```


Additionally, added a safety check in _create_ep_parallel_group to catch
invalid configurations:

```
num_ep_groups = dist.get_world_size() // moe_ep_size
if num_ep_groups == 0:
    raise ValueError(
        f"Invalid ep_size={moe_ep_size} for world_size={dist.get_world_size()}"
    )
```
## Backward compatibility
- If a user was already running with ep_size >= num_experts, the old
code worked fine which would still work fine.
- Only the previously broken case (num_experts > world_size) now works
correctly.

Signed-off-by: Flakes342 <ayushtanwar1729@gmail.com>
2025-09-09 22:31:44 -07:00
08879a3916 avoid setting device_id to init_process_group (#7542)
In some usecases such as vllm, we need to new distributed group not only
on gpu, but also on cpu, if we set `device_id` here, it will prevent us
from new distributed group on cpu:
[L230](https://github.com/vllm-project/vllm/blob/main/vllm/distributed/parallel_state.py#L230)
. This PR fixes this bug.

---------

Signed-off-by: Liu, Kaixuan <kaixuan.liu@intel.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-09-05 06:06:26 +00:00
43537d0a60 Autotune ZenFlow affinity (#7506)
This PR address the following ZenFlow optimizer core binding issue.
https://github.com/deepspeedai/DeepSpeed/issues/7478

With this PR, ZenFlow optimizer worker would derive its core binding
from deepspeed core binding mechanism. The algorithm is as following:
1. Each DeepSpeed rank get its core binding by using DeepSpeed command
line `--bind_cores_to_rank`, this command would assign each CPU physical
cores to different workers
2. When spawing ZenFlow optimizer worker, DeepSpeed would split current
CPU affinity list into two sublist: pt_affinity and zf_affinity
3. zf_affinity would be used to set affinity of ZenFlow optimizer
worker. pt_affinity would be used to set current pytorch process.
4. By default, one cores is reserved by each pytorch process, the rest
is used by ZenFlow optimizer worker. The number of cores reserved for
pytorch process can be changed by ZenFlow config variable:
`pt_reserved_cores`

---------

Signed-off-by: Guokai Ma <guokai.ma@gmail.com>
Signed-off-by: Ma, Guokai <guokai.ma@intel.com>
Signed-off-by: aeeeeeep <aeeeeeep@proton.me>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: aeeeeeep <aeeeeeep@proton.me>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
Co-authored-by: Zhipeng Wang <zwanga@wustl.edu>
Co-authored-by: Peng Du <pedu@linkedin.com>
Co-authored-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-09-04 07:10:39 -04:00
66bf2a642d Relax restrictions of torch.autocast integration (#7543)
This PR relaxes two restrictions on torch.autocast in the DeepSpeed
engine:

1) Nesting torch.autocast
Currently, we do not expect `torch.autocast` to be used outside the
DeepSpeed engine. Here is the current behavior:
- If `torch.autocast` is enabled in the DeepSpeed config and the engine
detects it is also enabled outside, a warning is displayed.
- If it is disabled in the config, the engine raises an error.

This design prevents the following usage:
```python
with torch.autocast(...):
    logits = deepspeed_model(...)
    loss = criteria_fn(logits)
```
In this case, we also want to apply autocast to `criteria_fn`. With the
current behavior, we would need move `deepspeed_model(...)` outside the
`torch.autocast` context, leading to inconsistent code between DeepSpeed
and non-DeepSpeed setups. (cannot be handled with `enabled` arg of
`torch.autocast`)

Change in this PR:
`torch.autocast` outside the DeepSpeed engine is ignored, and
- If `torch_autocast` is enabled in the config, DeepSpeed will follow
that setting.
- If it is disabled, DeepSpeed falls back to its own mixed-precision
support (or FP32).

In these cases, DeepSpeed engine shows a message to explain the
behavior.

2) Model’s dtype

Previously, DeepSpeed assumed the model’s dtype must be FP32 when
`torch.autocast` was enabled. However, models with lower-precision
parameters (e.g., BF16) can also be used with autocast. For example, if
both the model and `torch.autocast` use BF16, autocast will upcast
precision-sensitive ops as needed.

Change in this PR:
Removed the assertion that restricted the model’s dtype to FP32.

This PR also adds and updates tests to cover these new behaviors.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
2025-09-03 12:15:10 -07:00
8af75487f4 Fix zenflow_torch_adam.py (#7544)
`_disable_dynamo_if_unsupported` fallback wasn't getting created under
certain conditions. This PR is fixing this. Also removed debug print.

Fixes issue installing deepspeed on torch 2.4 and 2.1 that triggered
this:
```
#42 15.84       Traceback (most recent call last):
#42 15.84         File "<string>", line 2, in <module>
#42 15.84         File "<pip-setuptools-caller>", line 34, in <module>
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/setup.py", line 40, in <module>
#42 15.84           from op_builder import get_default_compute_capabilities, OpBuilder
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/op_builder/__init__.py", line 18, in <module>
#42 15.84           import deepspeed.ops.op_builder  # noqa: F401 # type: ignore
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/__init__.py", line 25, in <module>
#42 15.84           from . import ops
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/__init__.py", line 6, in <module>
#42 15.84           from . import adam
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/adam/__init__.py", line 9, in <module>
#42 15.84           from .zenflow_torch_adam import ZenFlowSelectiveAdamW
#42 15.84         File "/tmp/pip-install-qgzd6ybt/deepspeed_b3b4858a062d49c7b8d6ef31332a96cf/deepspeed/ops/adam/zenflow_torch_adam.py", line 685, in <module>
#42 15.84           @_disable_dynamo_if_unsupported(single_tensor_fn=_single_tensor_adamw)
#42 15.84       NameError: name '_disable_dynamo_if_unsupported' is not defined
#42 15.84       [WARNING] ZenFlow disabled: torch internal optimizer symbols could not be imported: cannot import name '_disable_dynamo_if_unsupported' from 'torch.optim.optimizer' (/usr/local/lib/python3.10/dist-packages/torch/optim/optimizer.py)
```

---------

Signed-off-by: Stas Bekman <stas@stason.org>
2025-09-03 18:14:18 +00:00
1e183a6a9d Fix scaling and allgather with torch.autocast (#7534)
This PR includes these two fixes:
- Use GradScaler only for FP16 (not for BF16)
- Fix dtype conversion for ZeRO3 allgather
- The reduce hook should be called only once, even when a parameter is
shared across multiple layers (tied parameters).
- Currently, the hook is triggered at each tied layer because we
temporarily set `.data` with a different dtype.
- The fix ensures that the parameter consistently retains the same
dtype.

---------

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: jakehemmerle <jakehemmerle@protonmail.com>
Signed-off-by: Qi Bin <qibin0506@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: digger yu <digger-yu@outlook.com>
Co-authored-by: Jake Hemmerle <jakehemmerle@gmail.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Qi Bin <qibin0506@users.noreply.github.com>
2025-09-03 01:22:19 +00:00
c07b3abf9a fixed DeepSpeedCPULion with ZeRO-Offload bug (#7531)
fixed DeepSpeedCPULion with ZeRO-Offload bug
[issues/7524](https://github.com/deepspeedai/DeepSpeed/issues/7524)

Signed-off-by: Qi Bin <qibin0506@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
2025-09-02 21:40:14 +00:00
066d912052 [logging] less startup noise (#7526)
This PR removes some and enables removing other startup noise -
especially when it's replicated rank-times and doesn't carry any
informative payload.

1. add `--log_level` flag which sets the launcher's logger to a desired
setting - defaulting to `logging.INFO` for now for BC, but will change
to `logging.WARNING` in v1
2. add `--quiet/-q` flag which sets the launcher's logger to
`logging.ERROR` which essentially disables startup info messages
3. change the logging defaults elsewhere to `logging.WARNING` (main
impact is the accelerator.py), once deepspeed started the frameworks
control its loglevel for each rank, so the tricky part is this pre-start
stage info logs. this part is breaking BC as there is no machinery to
set the logger level for `real_accelerator.py`)
4. builder is changed to non-verbose (BC breaking)

---------

Signed-off-by: Stas Bekman <stas@stason.org>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-02 19:14:57 +00:00
411e20a3f7 undo the revert (#7536)
replay https://github.com/deepspeedai/DeepSpeed/pull/3019 as it got
reverted
2025-09-02 14:24:48 -04:00
902e78c989 fix typo s/1014 /1024 (#7528)
fix typo s/1014 /1024  
         s/was_interruptted /was_interrupted

detail info 
        modified:   deepspeed/autotuning/scheduler.py
        modified:   deepspeed/autotuning/utils.py

Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-09-01 01:12:40 +00:00
eabb687ac1 ZeRO3: Improve mismatch detection (#7525)
ZeRO3 tracks DDP (SPMD) behavior by matching values different training
states across ranks. Some of these states are represented as lists, and
mismatches sometimes manifests as hangs during error detection. This PR
improves error detection by first validating the list lengths across
ranks before validating the list contents.

Motivated by
https://github.com/deepspeedai/DeepSpeed/issues/7461#issuecomment-3235146207

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-08-31 17:57:10 -04:00
889f0ead27 Enable non-ZeRO mode (#7515)
Enabled via `stage=0` which corresponds to DDP. 
Remove hardwired path to b16_optimizer.
Enable`torch.autocast` for DDP training
Enable native mixed precision DDP for bfloat16
Update torch.autocast and native mixed precision UTs

<img width="976" height="184" alt="image"
src="https://github.com/user-attachments/assets/92904cdc-e312-46a4-943f-011eb5ab146a"
/>

---------

Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
2025-08-27 14:07:29 -04:00
66ad278048 Enabling Muon Optimizer in DeepSpeed (#7509)
Authorship: @pengdurice and @PKUWZP 

Related Issue: #7438

# Introduction

[Muon](https://arxiv.org/abs/2502.16982), a new optimizer that has
attracted the community’s attention recently shows promising results in
training large language models. Adding the Muon Optimizer to DeepSpeed,
a popular OSS framework for large scale training and inference is
critically important for DeepSpeed users and developers. There has been
a [PR](https://github.com/deepspeedai/DeepSpeed/pull/7454) attempting
the adoption. (Huge Thanks to @qimcis), which is a good starting point.
It still requires more substantial effort to make it fully compatible
and work within DeepSpeed. We are publishing this PR to fully enable
Muon Optimizer capabilities for DeepSpeed.

# Issues and solutions
## Issues
1. With stage 1, 2 or 3, the optimizer states will be partitioned within
the same data parallel group. This means that each process is already
handling only parts of the model parameters and there is no need to use
the DP solution as in the
[code](https://github.com/KellerJordan/Muon/blob/master/muon.py#L195).
2. The parameters (and the gradients) will be flattened to 1D vector
before being used in the optimizer, thus nullifying the major hypothesis
of the muon optimizer: it works by orthogonalizing the updates for each
matrix (dim >=2)

## Solutions
To solve the issues, we propose this new PR in which: 
1. We simplify the Muon code by
[removing](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-c9052994e41caee9ca88363749c10af08655f8019f08dc971c018663d25a3712R22)
the partitioning and muon updates logics.

2. We
[move](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1867)
the muon update to the
[get_flat_partition](https://github.com/deepspeedai/DeepSpeed/compare/master...pengdurice:DeepSpeed:peng-add-muon-v1#diff-99dcf26ea2876ff5bbf05b5165c4133eaa0d0f36b170685643c2f7e2eb566addR1848)
function of stage 1 and 2 DeepSpeedZeroOptimizer in which per parameter
gradients are collected before being flattened and used by the optimizer
to update the model parameters. Since each parameter is still in its
original shape, we can easily apply the muon updates.
3. We also save the momentum buffer into the optimizer’ state so that we
have a smooth convergence after applying the saved checkpoints.
4. We added comprehensive unit tests to validate Muon Optimizer's
correctness and functionality.

# Future directions and roadmap
In the future, several follow up works are of interests:
- [ ] Create a CPU offload version.
- [ ] Apply Muon to Stage 3
- [ ] Use the highly optimized version of Adam for the Adam part of
MuonWithAuxAdam optimizer.
- [ ] More efficient implementations e.g. a) add specialized kernels for
Newton-Schulz iteration and muon updates; b) parallelize updates for the
parameters (currently, each parameter is updated separately and
sequentially)

---------

Co-authored-by: Peng Du <pedu@linkedin.com>
Co-authored-by: pengdurice <pengduhit@gmail.com>
Co-authored-by: Zhipeng Wang <zhipengbayern@gmail.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-26 18:34:35 -07:00
38d1a9eb64 Fix assert when 'pp_int' object has no attribute 'custom_print_str' (#7507)
Fix assert `'pp_int' object has no attribute 'custom_print_str'` when
tracking deepspeed module with some track debug tools like
[objwatch](https://github.com/aeeeeeep/objwatch)

```python3
    import objwatch
    objwatch.watch(targets=[deepspeed], framework="torch.distributed", indexes=[0,], with_locals=True)
```

Signed-off-by: aeeeeeep <aeeeeeep@proton.me>
2025-08-25 10:57:08 -04:00
bc8c0db3b4 Support DeepSpeed offload and reload states with ZeRO1 and ZeRO2 (#7421)
Please refer to https://github.com/deepspeedai/DeepSpeed/issues/7251

---------

Signed-off-by: lym <letusgo126@126.com>
Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Signed-off-by: Alex Kiefer <alexkiefer51@gmail.com>
Signed-off-by: Stas Bekman <stas@stason.org>
Signed-off-by: Sam Foreman <saforem2@gmail.com>
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: huanyuqu <yc37960@um.edu.mo>
Signed-off-by: weeknan <zhounan0431@163.com>
Signed-off-by: WoosungMyung <dntjd517@naver.com>
Signed-off-by: Nir Sonnenschein <nsonnenschein@habana.ai>
Signed-off-by: Junjie Mao <banxing.mjj@alibaba-inc.com>
Signed-off-by: vinceliu <lpnpcs@gmail.com>
Signed-off-by: Tingfeng Lan <erc8gx@virginia.edu>
Signed-off-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Olatunji Ruwase <tjruwase@gmail.com>
Signed-off-by: Tunji Ruwase <tunji.ruwase@snowflake.com>
Signed-off-by: Yao, Matrix <matrix.yao@intel.com>
Signed-off-by: Yusen Wu <xrn4ub@virginia.edu>
Signed-off-by: cyy <cyyever@outlook.com>
Co-authored-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
Co-authored-by: Alexander Kiefer <56556451+alexk101@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
Co-authored-by: Hongwei Chen <33092912+hwchen2017@users.noreply.github.com>
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Sam Foreman <saforem2@gmail.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: huanyuqu <55744355+huanyuqu@users.noreply.github.com>
Co-authored-by: weeknan <57584045+weeknan@users.noreply.github.com>
Co-authored-by: Olatunji Ruwase <tjruwase@gmail.com>
Co-authored-by: Aurick Qiao <aurick.qiao@snowflake.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
Co-authored-by: Zhipeng Wang <zhipeng.rainbowserie@gmail.com>
Co-authored-by: WoosungMyung <115716986+WoosungMyung@users.noreply.github.com>
Co-authored-by: Nir Sonnenschein <nsonnenschein@habana.ai>
Co-authored-by: Junjie Mao <junjie.mao@hotmail.com>
Co-authored-by: Junjie Mao <banxing.mjj@alibaba-inc.com>
Co-authored-by: lpnpcs <lpnpcs@vip.qq.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Co-authored-by: Tingfeng Lan <tafflann@outlook.com>
Co-authored-by: Rui Yan <49115835+yanrui27@users.noreply.github.com>
Co-authored-by: Feng Yunlong <20281571+AlongWY@users.noreply.github.com>
Co-authored-by: Yao Matrix <matrix.yao@intel.com>
Co-authored-by: Tingfeng Lan <erc8gx@virginia.edu>
Co-authored-by: Yusen Wu <xrn4ub@virginia.edu>
Co-authored-by: Yuanyuan Chen <cyyever@outlook.com>
Co-authored-by: Michael Wyatt <michael.wyatt@snowflake.com>
2025-08-20 22:03:26 +00:00
8cf5fc5787 Reduce performance impact of compiler.enable decorator (#7498)
For some accelerators (such as HPU) running in a non-compile scenarios,
the `compiler.enable` decorator can cause significant performance drops
up to 8-12%.

We can easily avoid the performance hit in non-compile scenarios, by
detecting the ongoing compilation and returning immediately.

Signed-off-by: Max Kovalenko <mkovalenko@habana.ai>
Co-authored-by: Olatunji Ruwase <tunji.ruwase@snowflake.com>
2025-08-18 22:04:10 +00:00