Compare commits

..

158 Commits

Author SHA1 Message Date
eebd57f4fd Update
[ghstack-poisoned]
2025-11-07 17:04:23 +00:00
280d77bd86 Update
[ghstack-poisoned]
2025-11-07 16:30:18 +00:00
54c90ae440 Update (base update)
[ghstack-poisoned]
2025-11-07 16:30:18 +00:00
fee1ac927d [DebugMode] add stack traces (#166440)
Captures stack trace for torch_dispatch calls, under `with DebugMode(record_stack_trace=True)`: Traces aren't rendered in debug string, but are in `.stack_trace` for each log.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166440
Approved by: https://github.com/yushangdi
2025-11-03 02:48:09 +00:00
4a7fefd7c7 [dynamo] fix pos-only names should can be collected in **kwargs (#166798)
See the new testcase for more details. It fails on trunk and is fixed by this PR.

```python
In [1]: def func(a, /, **kwargs):
   ...:     return a, kwargs

In [2]: func(1, a=2)
Out[2]: (1, {'a': 2})
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166798
Approved by: https://github.com/guilhermeleobas
2025-11-03 02:40:34 +00:00
3b4315940d [export] Fix static_input_indices for aot_export_joint (#166761)
`static_input_indices` is used for cudagraphs to determine which input indices are static and will not have changing addresses. Since export never integrated with cudagraphs this information was not necessary. But now we need it!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166761
Approved by: https://github.com/BoyuanFeng
2025-11-03 01:57:51 +00:00
3eddf04922 Revert "Add min/max support for barebones uint types (#166813)"
This reverts commit 9c22bbb2dce31b854e3387db77eaff501434f352.

Reverted https://github.com/pytorch/pytorch/pull/166813 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/166813#issuecomment-3478450413))
2025-11-02 22:50:36 +00:00
7c203b8420 [BE] Using std::move to reduce copy constructor calls by one. (#163599)
inspired by https://github.com/pytorch/pytorch/pull/163416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163599
Approved by: https://github.com/Skylion007
2025-11-02 21:54:58 +00:00
3ca216ae17 Add claude skills for uint support and AT_DISPATCH_V2 (#166814)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166814
Approved by: https://github.com/Skylion007, https://github.com/malfet
ghstack dependencies: #166813
2025-11-02 21:36:19 +00:00
9c22bbb2dc Add min/max support for barebones uint types (#166813)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166813
Approved by: https://github.com/Skylion007
2025-11-02 21:36:19 +00:00
6268883f9c [MPS] Refactor torch.cat and add fast path for contiguous inputs (#166556)
In many cases when the fast path is used, the performance is pretty similar to what it used to be. However, with tensors on the order of about 1000 elements there is a modest speedup, which increases as the number of input tensors increases and the number of dimensions increases.

This script was used for performance comparison: <1f04647bbf/cat/perf0.py>

Before change:

```
idx: cpu time, mps time, speedup, op, args, kwargs
-----------------------------------------
0: 0.000843 ms, 0.010431 ms, 0.08, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': -1}
1: 0.000838 ms, 0.013467 ms, 0.06, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': 1}
2: 0.000792 ms, 0.009457 ms, 0.08, cat, [[tensor(shape[10, 5]), tensor(shape[5, 5])]], {'dim': 0}
3: 0.000834 ms, 0.010694 ms, 0.08, cat, [[tensor(shape[1, 2, 3]), tensor(shape[1, 2, 3])]], {'dim': -2}
4: 0.000627 ms, 0.000641 ms, 0.98, cat, [[tensor(shape[0]), tensor(shape[0])]], {'dim': 0}
5: 0.001172 ms, 0.006493 ms, 0.18, cat, [[tensor(shape[0]), tensor(shape[5, 5])]], {'dim': 1}
6: 0.000812 ms, 0.006148 ms, 0.13, cat, [[tensor(shape[0, 5]), tensor(shape[5, 5])]], {'dim': 0}
7: 0.000686 ms, 0.009382 ms, 0.07, cat, [[tensor(shape[1]), tensor(shape[1])]], {}
8: 0.000738 ms, 0.006532 ms, 0.11, cat, [[tensor(shape[2, 2, 2, 2])], 1], {}
9: 0.003835 ms, 0.193963 ms, 0.02, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1}
10: 0.552435 ms, 0.690500 ms, 0.80, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1}
11: 0.488799 ms, 0.708988 ms, 0.69, cat, "[[tensor(shape[1, 3, 2]), tensor(shape[2, 3, 2]), tensor(shape[3, 3, 2]), tensor(shape[1, 3, 2]), te...", {'dim': 0}
12: 0.000799 ms, 0.005997 ms, 0.13, cat, [[tensor(shape[1000]), tensor(shape[1000])]], {'dim': 0}
13: 0.000916 ms, 0.011791 ms, 0.08, cat, [[tensor(shape[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), tensor(shape[2, 2, 2, 2, 2, 2, 2, 2, 2, 2])]], {'dim': 0}
14: 0.001028 ms, 0.012269 ms, 0.08, cat, "[[tensor(shape[1000]), tensor(shape[1000]), tensor(shape[1000]), tensor(shape[1000]), tensor(shape[1...", {'dim': 0}
15: 0.001127 ms, 0.025197 ms, 0.04, cat, "[[tensor(shape[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), tensor(shape[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), tensor(s...", {'dim': 0}
16: 0.321997 ms, 0.142815 ms, 2.25, cat, [[tensor(shape[1000000]), tensor(shape[1000000])]], {'dim': 0}
17: 1.989967 ms, 1.013615 ms, 1.96, cat, [[tensor(shape[1000000, 3, 2]), tensor(shape[1000000, 3, 2])]], {'dim': 0}
18: 3.161745 ms, 0.965378 ms, 3.28, cat, [[tensor(shape[3, 1000000, 2]), tensor(shape[3, 1000000, 2])]], {'dim': 1}
19: 3.416246 ms, 0.972278 ms, 3.51, cat, [[tensor(shape[3, 2, 1000000]), tensor(shape[3, 2, 1000000])]], {'dim': 2}
```

After change:

```
idx: cpu time, mps time, speedup, op, args, kwargs
-----------------------------------------
0: 0.000902 ms, 0.011074 ms, 0.08, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': -1}
1: 0.000899 ms, 0.010453 ms, 0.09, cat, [[tensor(shape[5, 5]), tensor(shape[5, 5])]], {'dim': 1}
2: 0.000771 ms, 0.005843 ms, 0.13, cat, [[tensor(shape[10, 5]), tensor(shape[5, 5])]], {'dim': 0}
3: 0.000776 ms, 0.010449 ms, 0.07, cat, [[tensor(shape[1, 2, 3]), tensor(shape[1, 2, 3])]], {'dim': -2}
4: 0.000616 ms, 0.000600 ms, 1.03, cat, [[tensor(shape[0]), tensor(shape[0])]], {'dim': 0}
5: 0.001150 ms, 0.007624 ms, 0.15, cat, [[tensor(shape[0]), tensor(shape[5, 5])]], {'dim': 1}
6: 0.000728 ms, 0.007949 ms, 0.09, cat, [[tensor(shape[0, 5]), tensor(shape[5, 5])]], {'dim': 0}
7: 0.000671 ms, 0.005458 ms, 0.12, cat, [[tensor(shape[1]), tensor(shape[1])]], {}
8: 0.000770 ms, 0.006590 ms, 0.12, cat, [[tensor(shape[2, 2, 2, 2])], 1], {}
9: 0.003835 ms, 0.190193 ms, 0.02, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1}
10: 0.529047 ms, 0.734389 ms, 0.72, cat, "[[tensor(shape[3, 1, 2]), tensor(shape[3, 2, 2]), tensor(shape[3, 3, 2]), tensor(shape[3, 1, 2]), te...", {'dim': 1}
11: 0.512615 ms, 0.531172 ms, 0.97, cat, "[[tensor(shape[1, 3, 2]), tensor(shape[2, 3, 2]), tensor(shape[3, 3, 2]), tensor(shape[1, 3, 2]), te...", {'dim': 0}
12: 0.000740 ms, 0.004288 ms, 0.17, cat, [[tensor(shape[1000]), tensor(shape[1000])]], {'dim': 0}
13: 0.000955 ms, 0.004119 ms, 0.23, cat, [[tensor(shape[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), tensor(shape[2, 2, 2, 2, 2, 2, 2, 2, 2, 2])]], {'dim': 0}
14: 0.001037 ms, 0.004578 ms, 0.23, cat, "[[tensor(shape[1000]), tensor(shape[1000]), tensor(shape[1000]), tensor(shape[1000]), tensor(shape[1...", {'dim': 0}
15: 0.001115 ms, 0.004918 ms, 0.23, cat, "[[tensor(shape[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), tensor(shape[2, 2, 2, 2, 2, 2, 2, 2, 2, 2]), tensor(s...", {'dim': 0}
16: 0.334119 ms, 0.145008 ms, 2.30, cat, [[tensor(shape[1000000]), tensor(shape[1000000])]], {'dim': 0}
17: 2.419846 ms, 0.984192 ms, 2.46, cat, [[tensor(shape[1000000, 3, 2]), tensor(shape[1000000, 3, 2])]], {'dim': 0}
18: 3.117338 ms, 1.000345 ms, 3.12, cat, [[tensor(shape[3, 1000000, 2]), tensor(shape[3, 1000000, 2])]], {'dim': 1}
19: 3.047707 ms, 0.971730 ms, 3.14, cat, [[tensor(shape[3, 2, 1000000]), tensor(shape[3, 2, 1000000])]], {'dim': 2}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166556
Approved by: https://github.com/malfet
2025-11-02 21:27:05 +00:00
16212f0d6b [Sparse] support for exp op (#166801)
support for exp op in Sparse tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166801
Approved by: https://github.com/eqy
2025-11-02 21:14:43 +00:00
c8adc08b3b [Fix] Optimize max unpooling index validation using aminmax (#165394)
Replace separate min() and max() calls with single aminmax() call in max_unpool_out_mps_template to improve performance by reducing tensor traversals from O(2n) to O(n).

Changes:
- Use indices.aminmax() instead of separate indices.min()/max() calls
- Add required ATen/ops/aminmax.h header for AT_PER_OPERATOR_HEADERS
- Maintain identical bounds checking logic and error handling

This optimization is particularly beneficial for large indices tensors, improving cache locality and reducing computational overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165394
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-11-02 19:42:02 +00:00
23b57a445c Remove setup-env instructions; it's confusing (#166749)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166749
Approved by: https://github.com/mlazos
2025-11-02 19:22:53 +00:00
cyy
6c7cad6972 Use Python 3.10 typing (#148418)
Use Python 3.10 typing in some files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148418
Approved by: https://github.com/mlazos
2025-11-02 16:16:52 +00:00
bb54296258 Fix source_fn_stack being None (#166728)
Summary: Apparently source_fn_stack can be empty

Test Plan: CI

Differential Revision: D85956753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166728
Approved by: https://github.com/SS-JIA, https://github.com/Skylion007, https://github.com/mlazos, https://github.com/atalman
2025-11-02 13:50:16 +00:00
5e05a0ae99 Revert "Fix: list index out of range with softmax when using 0 dim (#166547)"
This reverts commit 0674e0a0f14775f920296e9dfb8b61e4960bf99d.

Reverted https://github.com/pytorch/pytorch/pull/166547 on behalf of https://github.com/atalman due to Fail: test/test_torchfuzz_repros.py::TestFuzzerCompileIssues::test_fuzzer_issue_163971 [GH job link](https://github.com/pytorch/pytorch/actions/runs/19008635308/job/54286552036) [HUD commit link](0674e0a0f1) ([comment](https://github.com/pytorch/pytorch/pull/166547#issuecomment-3477962809))
2025-11-02 13:29:03 +00:00
298666631b [user-streams] Switch to fx annotations at trace time (#166472)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166472
Approved by: https://github.com/anijain2305
ghstack dependencies: #164819, #165211, #165212, #165356, #164523, #162905, #166471
2025-11-02 11:55:51 +00:00
e471800dce [user-streams] cleanup StreamVariable signature (#166471)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166471
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #164819, #165211, #165212, #165356, #164523, #162905
2025-11-02 11:55:51 +00:00
18f4259626 [dynamo] Remove retrieving objects by ID (#162905)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162905
Approved by: https://github.com/anijain2305
ghstack dependencies: #164819, #165211, #165212, #165356, #164523
2025-11-02 11:55:43 +00:00
d962bed157 [user-streams] Add basic stream tests (#164523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164523
Approved by: https://github.com/anijain2305
ghstack dependencies: #164819, #165211, #165212, #165356
2025-11-02 11:55:37 +00:00
76780b1a3d [user-streams] Handle returning the current stream with/without device index (#165356)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165356
Approved by: https://github.com/anijain2305
ghstack dependencies: #164819, #165211, #165212
2025-11-02 11:55:30 +00:00
cee03634da [user-streams] Track symbolic current stream (#165212)
merge into stream tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165212
Approved by: https://github.com/anijain2305
ghstack dependencies: #164819, #165211
2025-11-02 11:55:22 +00:00
bc03d7c974 [user-streams] Add current stream source (#165211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165211
Approved by: https://github.com/anijain2305
ghstack dependencies: #164819
2025-11-02 11:55:15 +00:00
f013e804c8 [user-streams] Fix stream graph output semantics (#164819)
Preivously, we would stash a single stream value we constructed at trace time in a global and return the same value from repeated calls to the graph.

With this PR, we construct the stream value in advance, reference the constructed value in the graph via the lookup table, and if that value is returned as an output, read the value from the lookup table and return it (in bytecode, not as a graph output, since we don't support arbitrary stream outputs).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164819
Approved by: https://github.com/anijain2305
2025-11-02 11:55:08 +00:00
0674e0a0f1 Fix: list index out of range with softmax when using 0 dim (#166547)
Fixes #163971

Problem:
PyTorch's inductor compiler crashed with IndexError: list index out of range when compiling code that uses  0-dimensional tensors with operations like torch.softmax(scalar_tensor, dim=0).

A 0-dim tensor has shape = torch.Size([]) (empty shape)

```
ndim = 0 (zero dimensions)

len(shape) = 0 (no indices to access)

# Line 972: Pad other_shape to match inp dimensions
other_shape = [1] * (inp_ndim - len(other_shape)) + list(other_shape)

# For scalar tensors:
# inp_ndim = 0  # as input is scalar
# other_shape = []
# Result: [1] * (0 - 0) + [] = [] (still empty!)

dim = match.kwargs["dim"]  # dim = 0
if isinstance(dim, int):
    dim = (dim,)

# crash is happening here!
return all(statically_known_true(other_shape[d] == 1) for d in dim)
#                                 ^^^^^^^^^^^^^^^^
#                                 Tries other_shape[0] but other_shape = [] (empty!)
#                                 → IndexError: list index out of range
```

The function _other_is_broadcasted_in_dim() is an optimization check for a softmax fusion pattern. It verifies whether it's safe to rewrite:

```
# From
scaled = inp * other
result = scaled - scaled.amax(dim, keepdim=True)

# To this more stable form:
result = (inp - inp.amax(dim, keepdim=True)) * other
```

The optimization is only valid if other is constant across the reduction dimension (i.e., broadcasted to size 1 in that dimension). Otherwise, scaling changes which element is the maximum.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166547
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/leslie-fang-intel
2025-11-02 06:43:34 +00:00
b7d348a907 [vision hash update] update the pinned vision hash (#166771)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166771
Approved by: https://github.com/pytorchbot
2025-11-02 04:24:38 +00:00
9f9dbe0a9a add a curve for customized compilation in the kernel benchmarking scripts (#166697)
It's nice to add a curve with a customized compilation options so that we can compare side-by-side the perf improvement of new features.

E.g. for mix-order-reduction, by running the following command
```
python benchmarks/dynamo/genai_layers/benchmark.py --tolerance=1e-2 --exit-on-accuracy-failure --visualize rmsnorm_backward --custom-compile-name="compiled-no-fusion" --custom-compile-options='{"triton.mix_order_reduction":false}'
```

I get following output:
```
Geomean speedup for benchmark RMSNormBackward
  eager 11 data points
  compiled 11 data points, 15.82x speedup
  quack 11 data points, 15.45x speedup
  liger 11 data points, 14.06x speedup
  compiled-no-fusion 11 data points, 10.26x speedup
```

The output shows that the feature on average improve perf by `15.82 / 10.26 = 1.54x` for all the shapes tested. (I remove a shape (32768, 32768) whose rnumel is too large and not representative).

The new curve also shows up in the figure:
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/1ffac2bc-e726-4f1e-806d-e9e5de711492" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166697
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #166053, #166382, #166461, #166585, #166675
2025-11-01 22:09:56 +00:00
a19e92d433 report geomean for norm bwd benchmarking (#166675)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166675
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #166053, #166382, #166461, #166585
2025-11-01 22:09:56 +00:00
c3dc0c7089 [Inductor] mix order reduction heuristics and tuning (#166585)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166585
Approved by: https://github.com/jansel, https://github.com/PaulZhang12
ghstack dependencies: #166053, #166382, #166461
2025-11-01 22:09:48 +00:00
04d6a6f339 [inductor] Make mix-order-reduction split size not depends on split-reduction heuristics (#166461)
split size is critical for mix order reduction perf while the one picked by split reduction heuristics can be very bad for mix order reduction.

<img width="1197" height="596" alt="Screenshot 2025-10-27 at 11 17 16 PM" src="https://github.com/user-attachments/assets/7faa11ad-3a7a-4b29-90ed-e85fc01077ea" />

For the first shape in the chart, split reduction picks a split-size around 2000 and results in poor perf. It important to allow mix-order reduction decides split size itself. (ss_8 in the chart means split-size == 8)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166461
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #166053, #166382
2025-11-01 22:09:40 +00:00
0573747b6a [inductor] more aggressive mix order reduction (#166382)
More aggressive mix order reductions so that when rnumel is larger than 1024 we can still generate the fused kernel. Also use more warps in that case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166382
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #166053
2025-11-01 22:09:32 +00:00
a663eb9c80 [FlexFlash] CuteDSL flat indexer needs to be colexigraphic in coordinate space (#166657)
Benchmarks on Hopper:
Note the triton impl is not using max-autotune because I didnt feel like waiting for 90x plots
<img width="12517" height="5995" alt="combined_comparison" src="https://github.com/user-attachments/assets/d94debd9-920d-4413-b51f-b8e906e4fb01" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166657
Approved by: https://github.com/v0i0, https://github.com/mlazos, https://github.com/eellison
ghstack dependencies: #166359
2025-11-01 21:18:51 +00:00
764c54ecae [DebugMode] dispatch call hooks (#166348)
Adds customizable hooks on `__torch_dispatch__` calls for logging/recording arbitrary values.

Recording hooks store the hook outputs for each call at `debug_mode.operators[*].record`
```python
with DebugMode() as debug_mode, DebugMode.dispatch_hooks(record_hook = some_func):
    # some compute
    ...
```

Logging hooks annotate the string dump:
```python
with DebugMode() as debug_mode, DebugMode.dispatch_hooks(log_hook = some_func):
    ...
```

Adds default hooks `DebugMode.record_outputs()` and `DebugMode.log_tensor_hashes()`, for checking numerical equivalence. The hashing hook borrows from the Observer. Example dump:
```
aten::sum(dt: f32[8, 32]| S(0))
  aten::sum(t: f32[1, 32])  # {'hash': 3.2215590476989746}
  _c10d_functional::all_gather_into_tensor(t: f32[1, 32], 8, 0)  # {'hash': 204.8783062621951}
  _c10d_functional::wait_tensor(t: f32[8, 32])  # {'hash': 204.8783062621951}
  aten::mm(t: f32[1, 8], t: f32[8, 32])  # {'hash': 12.014171155635267}
  aten::sum(t: f32[1, 32])  # {'hash': 3.2215590476989746}
  aten::t(t: f32[1, 8])  # {'hash': 3.7167285680770874}
  aten::detach(t: f32[8, 1])  # {'hash': 3.7167285680770874}
...
```

On the FSDP2 / simple FSDP NE in https://github.com/pytorch/pytorch/pull/164939, with hashing, this produces 2 log dumps (FSDP2: P2010198620, simple FSDP: P2010198963). I asked Claude to check the hashes, it wrote an analysis script, and was able to guess RMS norm as the root cause: P2010195076

Another throw-away example for logging per-op memory usage: https://gist.github.com/pianpwk/372082bf29467aa4aa25cb26dee24aea

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166348
Approved by: https://github.com/yushangdi
2025-11-01 21:10:43 +00:00
0d81bb7f9c [3/N] Use 'is' in callable comparisons (#166780)
It is generally advised to use `is/is not` for comparisons against torch functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166780
Approved by: https://github.com/Skylion007
2025-11-01 20:23:56 +00:00
82fafb3304 Revert "Make PT2 compile backprop through custom op without autograd key a hard error (#166367)"
This reverts commit 84776e13744db6d59b41a063bb8714e2bffe7a06.

Reverted https://github.com/pytorch/pytorch/pull/166367 on behalf of https://github.com/atalman due to backends/xnnpack/test/recipes/test_xnnpack_recipes.py::TestXnnpackRecipes::test_all_models_with_recipes [GH job link](https://github.com/pytorch/pytorch/actions/runs/18999845549/job/54266149620) [HUD commit link](84776e1374) ([comment](https://github.com/pytorch/pytorch/pull/166367#issuecomment-3476757660))
2025-11-01 20:14:22 +00:00
401c2f9657 [FP8][H100][TF32] Disable tf32 for emulated reference computation in test_scaled_mm_vs_emulated_block_wise (#162997)
Fails with 2 mismatches otherwise

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162997
Approved by: https://github.com/Skylion007
2025-11-01 20:13:11 +00:00
13549e0e10 Revert "Avoid DDE in narrow with unbacked start (#166361)"
This reverts commit 1aef88c72d3aef629b20e97a188c9dc4bab46a1a.

Reverted https://github.com/pytorch/pytorch/pull/166361 on behalf of https://github.com/atalman due to examples/models/llama/tests/test_export_llama_lib.py::ExportLlamaLibTest::test_has_expected_ops_and_op_counts [GH job link](https://github.com/pytorch/pytorch/actions/runs/18993202115/job/54257916041) [HUD commit link](1aef88c72d) ([comment](https://github.com/pytorch/pytorch/pull/166361#issuecomment-3476752974))
2025-11-01 20:07:01 +00:00
82d86bacf3 [inductor] track reduction before splitting (#166053)
Keep tracking of the reduction before splitting.

In the mix-order reduction context, if one of the reduction is split, it makes it much harder to fuse with the other reduction. Tracking the metadata of the reduction before splitting to make the fusion possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166053
Approved by: https://github.com/jansel
2025-11-01 19:41:21 +00:00
3b5d38a3bc Fix comparing inductor actual strides vs bw graph for activations should not throw DDE. (#166277)
Fix https://github.com/pytorch/pytorch/issues/163894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166277
Approved by: https://github.com/Lucaskabela
2025-11-01 19:26:20 +00:00
84776e1374 Make PT2 compile backprop through custom op without autograd key a hard error (#166367)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166367
Approved by: https://github.com/bdhirsh
2025-11-01 17:01:31 +00:00
b3861ac8e7 [reland] Warn if AccumulateGrad stream does not match producer node stream (#166136)
ghstack-source-id: 59641aa32dc6fd027abf3276017432b693aa71f8
Pull-Request-resolved: https://github.com/pytorch/pytorch/pull/165065

Fixes #ISSUE_NUMBER

Opening a new PR for codev

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166136
Approved by: https://github.com/ngimel
2025-11-01 12:33:48 +00:00
4cc64d6234 [inductor] pre grad graph bisecting (#166344)
A few things to note:
1. Customers like vllm use a custom backend (e.g. VllmBackend), split the graph, and call standalone_compile for each split. If we let the bisector override the backend, we won't bisect thru the custom backend. `test_configs.bisect_keep_custom_backend_for_inductor` is used to keep the custom backend if we are bisecting for inductor.
2. pre_grad_graph bisecting and lowering bisecting so far does not compose well with each other since an issue may be just captured by the first one we try. `test_configs.bisect_pre_grad_graph` is used to enable the 'pre_grad_graph' bisecting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166344
Approved by: https://github.com/eellison
2025-11-01 09:22:21 +00:00
1aef88c72d Avoid DDE in narrow with unbacked start (#166361)
Slice knows how to handle unbacked start, we do not need to offset start before calling slice, we can leave it for slice.
The only edge case is when start<0 and start+length ==0 in that case slice and narrow would deviate,
for that case we shall pass dim_size instead of start+length

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166361
Approved by: https://github.com/aorenste
2025-11-01 07:10:23 +00:00
f0745ddb11 Replace c10::call_once with static initialization (#166381)
This PR replaces c10::call_once calls with static initialization when possible. C++11 semantics guarantees that static initialization is atomic. Static initialization also has lower cost than using c10::call_once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166381
Approved by: https://github.com/malfet
2025-11-01 07:09:40 +00:00
4316df857c [3.14] Fix torch.package.importer (#166767)
That relies on internal implementation of `picker._getattribute` which
changed from (i.e. takes object and string and returns tuple)
9ab89c026a/Lib/pickle.py (L316)
To (takes object and iterable of strings and returns object
631ba3407e/Lib/pickle.py (L315)

Test plan:
```
python -c "import torch; print(torch.package.sys_importer.get_name(torch.cuda.Stream))"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166767
Approved by: https://github.com/williamwen42
2025-11-01 05:05:47 +00:00
9d6597b1e9 Correctly use test parameters (#166726)
This PR uses unused arguments in some tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166726
Approved by: https://github.com/rec, https://github.com/albanD, https://github.com/Skylion007
2025-11-01 04:43:31 +00:00
e8fadba28c [pytree] add treespec_{leaf,tuple,dict} functions for args_spec modification (#160843)
The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class.

Changes:

1. Add function `treespec_leaf()` to replace `LeafSpec()`.
2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `*args` / `**kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class.
3. Change `len(spec.children_specs)` to `spec.num_children`.
4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`.

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843
Approved by: https://github.com/mlazos
2025-11-01 04:12:11 +00:00
60333de85d Revert "Remove setup-env instructions; it's confusing (#166749)"
This reverts commit 3dc92d69ed40fd952244e54bbda0240928756654.

Reverted https://github.com/pytorch/pytorch/pull/166749 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/166749#issuecomment-3475481831))
2025-11-01 02:55:56 +00:00
3dc92d69ed Remove setup-env instructions; it's confusing (#166749)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166749
Approved by: https://github.com/mlazos
2025-11-01 01:48:15 +00:00
f91899ca6c [2/N] Add strict parameter to Python zip calls (#166257)
This PR adds `strict=True/False` to zip calls in test utils. strict=True is passed when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166257
Approved by: https://github.com/janeyx99
2025-11-01 00:35:41 +00:00
e2dc32f4ba Replace decltype(auto) with auto (#166537)
This PR replaces `decltype(auto)` with `auto` for C++ return type deduction and simplifies some templates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166537
Approved by: https://github.com/Skylion007
2025-11-01 00:30:23 +00:00
83cc38d9c1 [precompile] Preserve default arguments for dynamo capture (#166654)
Summary:
Handle the case where there's default arguments on function signature.

Test Plan:
pytest test/export/test_experimental.py -k test_dynamo_graph_capture_default_args

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166654
Approved by: https://github.com/tugsbayasgalan
2025-11-01 00:12:10 +00:00
8d599045cf add shape check for avg_pool2d (#161952)
Fix https://github.com/pytorch/pytorch/issues/153312.

**Example:**
```python
import torch

print(torch.__version__)

tensor = torch.tensor([[ -7.8130e-88, -2.2092e-138,  -1.8673e+03, -7.6272e-253,  3.9203e+110,
           1.8380e-51,  2.8762e+268,  2.9094e+286,  5.1816e-228, -4.4916e+191,
          -7.4057e+80,  -9.1955e-18,  5.6536e+225,  8.8364e-175,  1.5053e-226],
        [-3.0521e+239, -2.8307e+306,   1.3297e-03, -9.9969e-132,  2.8920e-286,
           2.3964e+58, -6.8138e-281,  2.0321e-305,  -3.5127e+74,  -4.7560e-92,
          -8.9403e-99, -1.9739e-187, -2.5124e-173,  2.0458e+295,   4.4992e+52],
        [  6.8752e+21,  1.9332e+189, -8.6940e-189,  -6.6743e-15,   1.4691e+41,
           1.0338e+63,  -2.0779e-28, -7.6642e+104,  1.3390e+284, -8.0859e+194,
          8.4600e+107,   4.9115e-44,  1.1665e+285,  5.1275e+203,  9.7580e+303]],
       dtype=torch.float64)

try:
    res = torch.nn.functional.lp_pool1d(
        tensor,
        norm_type=-1.38119e+150,
        kernel_size=7879455037536781369,
        ceil_mode=True,
    )
    print("CPU result:", res)
except RuntimeError as e:
    print(f"CPU error: {e}")

tensor_gpu = tensor.to("cuda:0")
try:
    res = torch.nn.functional.lp_pool1d(
        tensor_gpu,
        norm_type=-1.38119e+150,
        kernel_size=7879455037536781369,
        ceil_mode=True,
    )
    print("GPU result:", res)
except RuntimeError as e:
    print(f"GPU error: {e}")
```

**Output:**

- before
```
2.9.0a0+git8703deb
CPU result: tensor([[0.],
        [0.],
        [0.]], dtype=torch.float64)
GPU error: integer out of range
```

- after
```
2.9.0a0+git2e893df
CPU error: integer out of range
GPU error: integer out of range
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161952
Approved by: https://github.com/mingfeima, https://github.com/malfet
2025-10-31 22:52:41 +00:00
fd5da81fdd [AI Codemod][DevmateFBSourceTestFailureBot] Fix for T243177299 ("Your diff, D85182174, broke some tests") (#166753)
Summary:
As per title, a bot created this diff because this test broke due to [a different PR.](https://github.com/pytorch/pytorch/pull/166026)

<Erased bot summary in case anything we don't want to make external.>

Test Plan:
Bot ran the tests and they passed.

<Erased bot test plan in case anything we don't want to make external.>

Differential Revision: D85745809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166753
Approved by: https://github.com/d4l3k
2025-10-31 22:49:59 +00:00
9261a1fb12 [MPS] Error out when BatchNorm is called for Complex (#166215)
Or BatchNorm or LayerNorm for Long types

Discovered while trying to enable `test_ops.py` for MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166215
Approved by: https://github.com/dcci, https://github.com/kulinseth, https://github.com/Skylion007
ghstack dependencies: #166214, #166687
2025-10-31 22:44:29 +00:00
clr
d80ae738c9 compile_worker: Make a timer class (#166465)
This subclass allows us to trigger an action after we haven't seen any activity
for a certain amount of seconds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166465
Approved by: https://github.com/masnesral
2025-10-31 22:39:31 +00:00
51667435f5 [FlexFlash] Wire up mask_mod + blockmask to flash impl (#166359)
I have some local changes that I need to push to flash first
https://github.com/Dao-AILab/flash-attention/pull/1970

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166359
Approved by: https://github.com/v0i0
2025-10-31 22:07:40 +00:00
2699f5410b Revert "[xpu][feature] Integrate OneDNN SDPA training forward/backward into XPU OVERRIDEABLE Backend (#162454)"
This reverts commit fd68d409ada709450ced3030bde89ec662a3f7b7.

Reverted https://github.com/pytorch/pytorch/pull/162454 on behalf of https://github.com/atalman due to internal build failure ([comment](https://github.com/pytorch/pytorch/pull/162454#issuecomment-3475009089))
2025-10-31 21:58:52 +00:00
9970fb97ff Fix Tril Triu SymInt (#166627)
Fixes #165613

### Summary:

- This MR fixes an issue where `torch.tril `and `torch.triu` with dynamic diagonal values cause torch.export to incorrectly infer unnecessary constraints between dynamic dimensions.
-  Ensured proper SymInt type annotations for diagonal parameter
-  Updated C++ implementation to correctly handle SymInt diagonal values.

### Impacts:
module: dynamic shapes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166627
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-10-31 21:53:20 +00:00
dfebdcab86 [GraphPartition] cache get_free_symbol_uses (#166338)
Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs.
ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)

I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node.

Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times.
ee7434be82/torch/_inductor/ir.py (L4541-L4543)

This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338
Approved by: https://github.com/eellison
2025-10-31 21:24:05 +00:00
b09fb481e0 [CD] Upgrade GCC version to 13 for XPU build (#162474)
Follow #152426
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162474
Approved by: https://github.com/zxiiro, https://github.com/atalman
2025-10-31 21:15:37 +00:00
4e7232c5da [MPS] Fix smooth_l1_loss backward for fp16 (#166687)
And enable fp16 implementation for CPU, which simplifies OpInfo definitions for the op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166687
Approved by: https://github.com/Skylion007
ghstack dependencies: #166214
2025-10-31 21:13:46 +00:00
93a70c717a Revert "Add CUDA MXFP4 scaled mm support via. FBGEMM (#166526)"
This reverts commit e3ae0594d16134632ff587c9ab400d4148c83e9f.

Reverted https://github.com/pytorch/pytorch/pull/166526 on behalf of https://github.com/atalman due to Failing internal test ([comment](https://github.com/pytorch/pytorch/pull/166526#issuecomment-3474907536))
2025-10-31 21:10:28 +00:00
d97144d31e [5/N] Remove unused loop variables in tests (#166716)
This PR removes unused loop variables in tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166716
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
2025-10-31 20:47:57 +00:00
e4043884c7 [dynamo, 3.14] fix segfault due to improper create_call_function_ex (#166678)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166678
Approved by: https://github.com/malfet
2025-10-31 20:44:53 +00:00
4a7bc1d522 [BE][Typing][Dynamo] Type misc files in torch/_dynamo/variables/ (#166569)
Provides type coverage to ~3000 LOC and 200 methods in  `torch/_dynamo/variables/`

This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...)

### Coverage report:
```
mypy torch_dynamo/variables --linecount-report /tmp/coverage_log
```
Compare before to after - we go from 3826 to 7221 lines covered

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569
Approved by: https://github.com/williamwen42, https://github.com/Skylion007
2025-10-31 20:42:27 +00:00
8209a0506b [Pytorch] Enable aarch64 convert autovec only on clang (#166739)
Summary: We've noted issues with modern GCC versions. Until further investigation is carried, we'll leave the code only enabled on clang

Test Plan: CI

Differential Revision: D85968395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166739
Approved by: https://github.com/mcfi, https://github.com/Skylion007, https://github.com/robert-hardwick
2025-10-31 20:22:33 +00:00
70aeb49198 [dynamo] clarify graph break handling/logging in symbolic_convert (#166587)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166587
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166476, #166477, #166586
2025-10-31 20:13:16 +00:00
cf9a834f39 [BE] Move GreenContext implementation details to cpp (#166462)
- Remove all complex defines logic from the header
- Make GreenContext constructor private, as  it should only be created via the static method as singleton
- Delete unused `getContext` and `getGreenContext` methods
- Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo
- Suppress `-Wunused-private-field` is GreenContext is not available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-31 20:11:02 +00:00
856a7a5298 Add missing device to namedtensor tests (#166717)
This PR passes unused `device` argument to tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166717
Approved by: https://github.com/Skylion007
2025-10-31 20:04:41 +00:00
ef8d97efcf fix broken nn_convolution test (#166666)
Summary: Broken by oss diff during oncall by third party contributor

Test Plan: buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test:nn_convolution -- --run-disabled

Differential Revision: D85899891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166666
Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/Skylion007
2025-10-31 19:59:50 +00:00
d2be06f673 [cpu][fix] Update ACL version to fix crashes with tensor sizes > 2^31-1 (#165904)
----

- Updates Arm Compute Library (ACL) to v52.6.0
- v52.6.0 contains https://github.com/ARM-software/ComputeLibrary/pull/1201 which fixes crashes with tensors of sizes > 2^31-1

fixes: #165654

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165904
Approved by: https://github.com/malfet
2025-10-31 19:37:26 +00:00
08f4535378 Refactor AOTAutogradCacheEntry into AOTAutogradResult (#166656)
This PR refactors the name AOTAutogradCacheEntry into AOTAutogradResult, and BundledAOTAutogradCacheEntry into BundledAOTAutogradResult. It also moves all coresponding files to a new file, `aot_autograd_result`, which is analogous to `output_code.py` from Inductor.

Having all these be called cache entries made sense when all we used them for was caching. But with AOT compile using BundledAOTAutogradCacheEntry, we want a more generalized naming structure.

This is a no-op change,  and all existing tests should pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166656
Approved by: https://github.com/zhxchen17
ghstack dependencies: #166650
2025-10-31 18:54:09 +00:00
30157d30f0 Add regional aot eager support to AOTAutogradCacheEntry (#166650)
This PR does two things:

- It genericizes `BundledAOTAutogradCacheEntry` to support *any* outputcode, not just CompiledFxGraphs
- It adds a brand new OutputCode for the `aot_eager_regional_inductor` backend, i.e. a graph module that has regional inductor components in it.

This allows BundledAOTAutogradCache to just integrate nicely with inductor out of the box, but more importantly, it allows the result of aot_autograd to be fully serializable when using `aot_eager_regional_inductor`. This will allow us to AOT precompile cases where we have an eager graph that has scooped up inductor bits.

It's a bit unfortunate that the naming makes BundledAOTAutogradCacheEntry sound like its primary use is for caching, but really the more common use is going to be as an AOTAutogradOutput. It may be worth revisiting how to refactor/rename these in a later PR:

- AOTAutogradCacheEntry -> AOTAutogradResult
- BundledAOTAutogradCacheEntry -> BundledAOTAutogradResult

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166650
Approved by: https://github.com/zhxchen17
2025-10-31 18:54:09 +00:00
b470e59c38 partitioner option to ignore partitioner_tag for abstract usage (#166725)
Partitioner functionality is appealing to use in different scenarios (E.g. Autoparallel)

We have special logic about "partitioner_tag" from meta that is only needed for forward/backward split.

Adding optional argument to avoid it and do only generic split based on inputs/outputs.

Potentially we want to make `_extract_graph_with_inputs_outputs` without underscore :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166725
Approved by: https://github.com/bdhirsh
2025-10-31 18:50:02 +00:00
85b85f6c2c Revert "[pytree] add treespec_{leaf,tuple,dict} functions for args_spec modification (#160843)"
This reverts commit 108bb224f77842593009214ebf6258030b934642.

Reverted https://github.com/pytorch/pytorch/pull/160843 on behalf of https://github.com/atalman due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/160843#issuecomment-3474354428))
2025-10-31 18:31:32 +00:00
b71966f67b [PyTorch] Improve aarch64 performance of bfloat16 ops - retry (#166028) (#166641)
Summary:

PR allows compiler to better optimize some bfloat16-based operations, when ran on NEON

Retrying to land the code, after noting that these expressions became available in recent compiler versions.

Current CI benchmark ‎binary_test.py will measure affected codepaths.

Benchmarks show measurable improvements on clang-19, when targeting armv9-a+sve2:

Before:
bfloat16 add: 250.503us
bfloat16 sub: 245.674us
bfloat16 neg: 113.945us
bfloat16 abs: 115.953us
bfloat16 reciprocal: 262.602us

After:
bfloat16 add: 203.862us ---> 23% higher throughput
bfloat16 sub: 201.526us ---> 22% higher throughput
bfloat16 neg: 68.416us ---> 67% higher throughput
bfloat16 abs: 71.003us  ---> 63% higher throughput
bfloat16 reciprocal: 177.834us ---> 48% higher throughput

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Reviewed By: mcfi

Differential Revision: D85809843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166641
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-10-31 18:21:04 +00:00
0947765eb9 Cache even more work for return_and_correct_aliasing (#166365)
Yet another pass found even more work we can move to be done only once. This seems to knock a few microseconds off the DTensor dispatch fast path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166365
Approved by: https://github.com/bdhirsh
2025-10-31 18:03:05 +00:00
239e7b541a [ROCm][CI] upgrade nightly wheels to ROCm 7.1 (#166730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166730
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-31 17:30:47 +00:00
ffaa6578b7 Revise deprecation warning for ONNX exporter (#166692)
Updated deprecation warning for ONNX export to reflect the current state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166692
Approved by: https://github.com/titaiwangms
2025-10-31 17:23:55 +00:00
365ed62f61 Document LibTorch ABI more, add README to headeronly (#166661)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166661
Approved by: https://github.com/mikaylagawarecki, https://github.com/albanD
2025-10-31 17:18:13 +00:00
fcc1063566 Revert "[BE][Typing][Dynamo] Type misc files in torch/_dynamo/variables/ (#166569)"
This reverts commit aa9c96af041b26c9c55adac490f3449b98f27d06.

Reverted https://github.com/pytorch/pytorch/pull/166569 on behalf of https://github.com/Lucaskabela due to Lintrunner not fixed due to race condition at landing ([comment](https://github.com/pytorch/pytorch/pull/166569#issuecomment-3474012637))
2025-10-31 16:59:33 +00:00
121235956b update Node.is_impure check if subgraph contains impure ops (#166609)
Summary:
## Context
when `const_fold.split_const_subgraphs` sees a `call_module` node that is a GraphModule, by the existing implementation it can mark this node as const-foldable when it shouldn't.

For example, a parent graph contains a `call_module` to a subgraph that has no inputs but contain impure ops inside.
```
parent graph():
    %sub : [num_users=1] = call_module[target=sub](args = (), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%sub, slice(None, None, None)), kwargs = {})
    return (getitem,)

submodule graph():
    %randn : [num_users=1] = call_function[target=torch.ops.aten.randn.default](args = ([5, 10],), kwargs = {device: cpu, pin_memory: False})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%randn, 1), kwargs = {})
    return (add,)
```
when `submodule` graph is fed to const_fold.split_const_subgraph, it would come out unmodified since randn is impure.

But if the `submodule` is called by a `parent` graph, when `parent` is fed to const_fold.split_const_subgraph, it would come out folded.
```
parent after fold graph():
    %_fx_const_folded_attrs : [num_users=1] = get_attr[target=_FX_CONST_FOLDED_ATTRS]
    return (_fx_const_folded_attrs,)
```

This is because `node.is_impure()` check inside `const_fold.split_const_subgraph` fail through, leading the call_module node to be marked as pure.

## Fix

We can update `fx.node.Node.is_impure` function to check for ops inside a call_module node with an additional `subgraph_has_impure_ops` check:
- if a call_module node calls a GraphModule,
- check any call_function nodes are impure ops
- recursively check any call_module nodes that call GraphModule

If the call_module subgraph has impure ops, return True to `is_impure`

Test Plan: added tests to test_fx_const_fold.py

Differential Revision: D85798483

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166609
Approved by: https://github.com/blaine-rister
2025-10-31 16:58:18 +00:00
aa9c96af04 [BE][Typing][Dynamo] Type misc files in torch/_dynamo/variables/ (#166569)
Provides type coverage to ~3000 LOC and 200 methods in  `torch/_dynamo/variables/`

This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...)

### Coverage report:
```
mypy torch_dynamo/variables --linecount-report /tmp/coverage_log
```
Compare before to after - we go from 3826 to 7221 lines covered

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569
Approved by: https://github.com/williamwen42
2025-10-31 16:56:50 +00:00
c3b71d5499 [ROCm][CI] remove relaxed tolerance for tf32 tests (#166478)
Instead of relaxing tolerances for certain unit tests that exercise TF32 on MI300, skip the tests until hipblaslt accuracy is improved.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166478
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Co-authored-by: Jagadish Krishnamoorthy <jagadish.krishnamoorthy@amd.com>
2025-10-31 16:15:42 +00:00
1e3600b528 [MPS] Move logaddexp/logaddexp2 to Metal and support complex (#166670)
NOTE: Complex inputs are only supported in `logaddexp`. Since `logaddexp2` does not support complex inputs for CPU, it is not enabled for MPS in this PR either.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166670
Approved by: https://github.com/malfet
2025-10-31 16:15:02 +00:00
fee7624bd6 [PT2] set choice handler in config (#166607)
Summary:
We were setting the custom inductor choice using `torch._inductor.virtualized.V.set_choices_handler(CustomInductorChoices())`. However, this leads to inconsistent behaviors, even for jobs that are submitted back to back.

In this diff, we pass in the choice handler via an inductor config and overwrite the default behavior when the config is provided. This sovles the inconsistent behavior.

Test Plan: see D85785892 (internal only)

Differential Revision: D85785879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166607
Approved by: https://github.com/eellison
2025-10-31 15:40:05 +00:00
24e94e021a [ROCm][CI] create ROCm 7.1 magma tarball (#166693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166693
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-31 15:20:00 +00:00
69be99ee51 Remove manually synced arch versions in tools/nightly.py (#166616)
Discussed with @atalman offline. To reduce duplicate changes and reduce the number of files to change when updating arch versions.

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166616
Approved by: https://github.com/ezyang
2025-10-31 15:11:28 +00:00
034e951b0c [CUDA][cuBLASLt] addmm -- extend bias fusions to cases with (1 by n) shapes (#166307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166307
Approved by: https://github.com/eqy
2025-10-31 14:30:41 +00:00
160ab53dd5 Update weight tensor initialization in RMSNormalization (#166550)
Ensure a >1d tensor as weight for ORT compatibility.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166550
Approved by: https://github.com/titaiwangms
2025-10-31 14:29:27 +00:00
5bcfdae71d Revert "Make PT2 compile backprop through custom op without autograd key a hard error (#166367)"
This reverts commit 4acc66f1192ab7743abcc50383aefc5447447f9d.

Reverted https://github.com/pytorch/pytorch/pull/166367 on behalf of https://github.com/atalman due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/166367#issuecomment-3473150269))
2025-10-31 13:44:05 +00:00
4e8ba37ce3 Revert "[BE] Move GreenContext implementation details to cpp (#166462)"
This reverts commit 5d288bc3f73873887f681e15af83c5525e6a60bd.

Reverted https://github.com/pytorch/pytorch/pull/166462 on behalf of https://github.com/atalman due to Sorry, Reverting. Failure: test/test_matmul_cuda.py::TestMatmulCudaCUDA::test_greencontext_carveout_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/18962393091/job/54154156892) [HUD commit link](85b035ca9c) ([comment](https://github.com/pytorch/pytorch/pull/166462#issuecomment-3473060299))
2025-10-31 13:20:48 +00:00
26534e9809 Revert "[GraphPartition] cache get_free_symbol_uses (#166338)"
This reverts commit a6b1ef17173f56ba93ac97ff4384fa4060b5e41e.

Reverted https://github.com/pytorch/pytorch/pull/166338 on behalf of https://github.com/atalman due to Failure: test/nn/test_convolution.py::TestConvolutionNN::test_conv3d_overflow_values [GH job link](https://github.com/pytorch/pytorch/actions/runs/18961173726/job/54149112920) [HUD commit link](a6b1ef1717) ([comment](https://github.com/pytorch/pytorch/pull/166338#issuecomment-3472980329))
2025-10-31 12:57:56 +00:00
657f8c3e21 Revert "Fix torch.full with dynamic tensor fill_value in torch.compile (#166554)"
This reverts commit 32066772b3dee643b1657b8957f32b5ac8b1390a.

Reverted https://github.com/pytorch/pytorch/pull/166554 on behalf of https://github.com/atalman due to Failure: test/nn/test_pooling.py::TestPoolingNNDeviceTypeCPU::test_max_pool_nan_inf_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/18959368975/job/54144148546) [HUD commit link](32066772b3) ([comment](https://github.com/pytorch/pytorch/pull/166554#issuecomment-3472976911))
2025-10-31 12:55:31 +00:00
b0831930ed [inductor] Mark / restrict tests that only work if ATen is used for matmul (#166518)
These tests only work if max_autotune=False (default), which for matmul means falling back to ATen. This PR just documents / makes that transparent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166518
Approved by: https://github.com/eellison
2025-10-31 12:29:06 +00:00
c01636e1bc Fixes the sparse tensor issue (#163535)
Fixes #148324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163535
Approved by: https://github.com/janeyx99
2025-10-31 11:48:31 +00:00
fd68d409ad [xpu][feature] Integrate OneDNN SDPA training forward/backward into XPU OVERRIDEABLE Backend (#162454)
This is the second PR split from https://github.com/pytorch/pytorch/pull/156272

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162454
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/drisspg
2025-10-31 11:20:38 +00:00
0d3a4f7155 [CD] Enable Inductor performance test for xpu (#166289)
Add Dynamo benchmark performance tests for XPU backend

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166289
Approved by: https://github.com/EikanWang, https://github.com/atalman
2025-10-31 10:52:07 +00:00
108bb224f7 [pytree] add treespec_{leaf,tuple,dict} functions for args_spec modification (#160843)
The goal of this PR is to provide a standard way to create simple treespec instances and hide the implementation details of the `PyTreeSpec` class.

Changes:

1. Add function `treespec_leaf()` to replace `LeafSpec()`.
2. Add function `treespec_tuple(...)` and `treespec_dict(...)` to create treespec for `tuple` / `dict` which is used for `*args` / `**kwargs`. This avoids direct modification to `treespec` instances that rely on the implementation details of the `PyTreeSpec` class.
3. Change `len(spec.children_specs)` to `spec.num_children`.
4. Change `isinstance(spec, LeafSpec)` to `spec.is_leaf()`.

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160843
Approved by: https://github.com/mlazos
2025-10-31 10:33:16 +00:00
fc8ac1216c [4/N] Remove unused loop variables in tests (#166690)
This PR removes unused loop variables in tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166690
Approved by: https://github.com/justinchuby, https://github.com/mlazos
2025-10-31 10:20:48 +00:00
030de07aff [2/N] Use 'is' in callable comparisons (#166685)
It is generally advised to use `is/is not` for comparisons against torch functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166685
Approved by: https://github.com/xmfan, https://github.com/mlazos
2025-10-31 08:08:07 +00:00
7d67a41db4 make FXConverter.generate use V.fake_mode instead of _detect_fake_mode_from_gm (#166591)
Summary:
FXConverter configurs _node_metadata_hook passing in `fake_mode` explicitly, which is relevant for cases down the line like `_generate_triton_call` that inserts a `triton_kernel_wrapper_mutation` node.

This `fake_mode` is obtained from `_detect_fake_mode_from_gm`, which can be different from inductor set `V.fake_mode`.

For example, while `V.fake_mode` is not None, `_detect_fake_mode_from_gm` can be **None** for a parent graph containing only a submodule which has no input args and only constants
```
parent graph():
    %sub : [num_users=1] = call_module[target=sub](args = (), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%sub, slice(None, None, None)), kwargs = {})
    return (getitem,)

submodule graph():
    %randn : [num_users=1] = call_function[target=torch.ops.aten.randn.default](args = ([5, 10],), kwargs = {device: cuda, pin_memory: False})
    %add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%randn, 1), kwargs = {})
    return (add,)

```

Getting this discrepnancy is flawed, it makes `_node_metadata_hook` try running inputs in a different "fake_mode" or no fake_mode when the rest of lowering uses `V.fake_mode`. In some cases where input is placed on custom non-gpu device, it can even complain with "requires device to be started" or tensor device mismatch.

So this diff updates FXConverter.generate to use `V.fake_mode` which is populated by inductor properly.

Test Plan:
added a test `test_const_folded_subgraph` in `test_fxir_backend.py`, this test:
- creates a graph module that calls a subgraph with no inputs and containing only const-foldable ops
- const fold the subgraph
- run FXConverter.generate, expect `fake_mode` used to code-generate is not None

On the prior implementation when `_detect_fake_mode_from_gm` was used, this test would fail as fake_mode would be `None`.

With this change, the test passes, `fake_mode` is properly collected from `V.fake_mode` which is not None.

Differential Revision: D85767475

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166591
Approved by: https://github.com/blaine-rister, https://github.com/mlazos, https://github.com/eellison
2025-10-31 05:52:07 +00:00
85b035ca9c [nativert] Downcast triton double arguments to floats (#166620)
This diff tries to fix a limitation in Sigmoid + Triton interaction, where float arguments are not correctly passed. NativeRT passes float arguments as double, while triton kernels were reading as a float, resulting in wrong values.

---

## Limitations in (de)seriazliation

In triton, float arguments to a kernel are encoded as "fp32" ([code](https://github.com/triton-lang/triton-cpu/blob/main-merged/python/triton/runtime/jit.py#L310-L326)):
```
        elif isinstance(arg, float):
            return ("fp32", None)
```
But it seems like that torch export serde uses double ([code](d2eff5d454/torch/_export/serde/export_schema.thrift (L149))) because Thrift only has the double type:
```
union Argument {
  10: bool as_none;
  20: TensorArgument as_tensor;
  30: list<TensorArgument> as_tensors;
  50: i64 as_int;
  70: list<i64> as_ints;
  80: double as_float;   ===> actually double
...
```
`TritonKernel` constructor loads attributes from a node, where `Constant` represents the variant type. And it only has `double` ([code](d2eff5d454/torch/nativert/graph/Graph.h (L86))):
```
using Constant = std::variant<
    None,
    int64_t,
    std::vector<int64_t>,
    double,    ===> triton float is loaded as double
```

So, NativeRT passes float arguments (originally in Triton) as double to triton kernels. But, all of the triton backends (nvidia, amd and cpu) are reading them as float because the signature still says `fp32`.

D84423898 was the current workaround: wrapping float arguments with tensors.

## The Fix

Fixing the thrift definition isn't viable because Thrift only supports double type. It's also possible to fix on the triton side: it can downcast from double to float. But I needed to fix all backends.

Instead, I think this diff would be the most effective way: when building `TritonKernel`, have downcasted float values, right after loading double arguments.

Test Plan:
```
buck test fbcode//mode/opt-amd-gpu fbcode//caffe2/test:test_export --
```

Differential Revision: D85747160

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166620
Approved by: https://github.com/XueningXu
2025-10-31 03:52:20 +00:00
267d0197bf [dynamo] fix error_on_graph_break bug where non-empty checkpoint results in unwanted graph break resumption (#166586)
Fixes https://github.com/pytorch/pytorch/issues/166589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166586
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #166476, #166477
2025-10-31 03:36:27 +00:00
1dec8a67a8 [dynamo, nested graph breaks] add disable_nested_graph_breaks decorator/context manager (#166477)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166477
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
ghstack dependencies: #166476
2025-10-31 03:36:27 +00:00
797cd80b26 [dynamo, nested graph breaks] codegen dead nested cells correctly (#166476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166476
Approved by: https://github.com/Lucaskabela
2025-10-31 03:36:27 +00:00
7d39401fa0 Revert "[BE][Typing][Dynamo] Type misc files in torch/_dynamo/variables/ (#166569)"
This reverts commit f1e4c42b6ef3d3cea08ab3babb693e3ce42cf08b.

Reverted https://github.com/pytorch/pytorch/pull/166569 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/166569#issuecomment-3471180280))
2025-10-31 03:31:01 +00:00
e3ae0594d1 Add CUDA MXFP4 scaled mm support via. FBGEMM (#166526)
Summary:

* Pull in `f4f4bf16` from FBGemm to provide MXFP4 support for CUDA
* Add testing

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166526
Approved by: https://github.com/drisspg, https://github.com/ngimel
2025-10-31 03:17:27 +00:00
f1e4c42b6e [BE][Typing][Dynamo] Type misc files in torch/_dynamo/variables/ (#166569)
Provides type coverage to ~3000 LOC and 200 methods in  `torch/_dynamo/variables/`

This is the first part of the final step to having 100% strict type coverage in dynamo - see previous comments in https://github.com/pytorch/pytorch/pull/166535 (combined into this one PR because ghstack was giving issues...)

### Coverage report:
```
mypy torch_dynamo/variables --linecount-report /tmp/coverage_log
```
Compare before to after - we go from 3826 to 7221 lines covered

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166569
Approved by: https://github.com/williamwen42
2025-10-31 02:57:59 +00:00
d3e511f07c [Inductor] support masked vectorization for the tail_loop for fp8 datatype (#163324)
**Summary:**
Support masked vectorization for the tail_loop for fp8 datatype.

**Example:**
```
import torch

def fn(
    x,
    scale,
    zero_point,
    quant_min,
    quant_max,
    dtype,
):
    x = torch.ops.quantized_decomposed.dequantize_per_tensor(
        x,
        scale,
        zero_point,
        quant_min,
        quant_max,
        dtype,
    )
    x = torch.relu(x)
    x = torch.ops.quantized_decomposed.quantize_per_tensor(
        x, scale, zero_point, quant_min, quant_max, dtype
    )
    return x

quant_min = -128
quant_max = 127
dtype = torch.float8_e4m3fn
x = torch.clamp(torch.randn((1, 7, 7, 9), dtype=torch.float32) * 100, quant_min, quant_max).to(dtype)
zero_point = 100
scale = 0.01

with torch.no_grad():
    compiled_fn = torch.compile(fn)
    compiled_fn(x, scale, zero_point, quant_min, quant_max, dtype)
```

**Generated code:**

- Before
```
cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0 = async_compile.cpp_pybinding(['const at::Float8_e4m3fn*', 'at::Float8_e4m3fn*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const at::Float8_e4m3fn* in_ptr0,
                       at::Float8_e4m3fn* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(441L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(432L)))
                {
                    auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = at::vec::convert<float>(tmp0);
                    auto tmp2 = static_cast<float>(100.0);
                    auto tmp3 = at::vec::Vectorized<float>(tmp2);
                    auto tmp4 = tmp1 - tmp3;
                    auto tmp5 = static_cast<float>(0.01);
                    auto tmp6 = at::vec::Vectorized<float>(tmp5);
                    auto tmp7 = tmp4 * tmp6;
                    auto tmp8 = (tmp7);
                    auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0));
                    auto tmp10 = tmp9 * tmp3;
                    auto tmp11 = tmp10.round();
                    auto tmp12 = tmp11 + tmp3;
                    auto tmp13 = static_cast<float>(-128.0);
                    auto tmp14 = at::vec::Vectorized<float>(tmp13);
                    auto tmp15 = at::vec::maximum(tmp12, tmp14);
                    auto tmp16 = static_cast<float>(127.0);
                    auto tmp17 = at::vec::Vectorized<float>(tmp16);
                    auto tmp18 = at::vec::minimum(tmp15, tmp17);
                    auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18);
                    tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(432L) && x0 < static_cast<int64_t>(441L)))
                {
                    for (int64_t x0_tail = static_cast<int64_t>(432L);x0_tail < static_cast<int64_t>(441L); x0_tail++)
                    {
                        auto tmp0 = in_ptr0[static_cast<int64_t>(x0_tail)];
                        auto tmp1 = c10::convert<float>(tmp0);
                        auto tmp2 = static_cast<float>(100.0);
                        auto tmp3 = float(tmp1 - tmp2);
                        auto tmp4 = static_cast<float>(0.01);
                        auto tmp5 = float(tmp3 * tmp4);
                        auto tmp6 = c10::convert<float>(tmp5);
                        auto tmp7 = std::max(tmp6, decltype(tmp6)(0));
                        auto tmp8 = float(tmp7 * tmp2);
                        auto tmp9 = std::nearbyint(tmp8);
                        auto tmp10 = float(tmp9 + tmp2);
                        auto tmp11 = static_cast<float>(-128.0);
                        auto tmp12 = max_propagate_nan(tmp10, tmp11);
                        auto tmp13 = static_cast<float>(127.0);
                        auto tmp14 = min_propagate_nan(tmp12, tmp13);
                        auto tmp15 = c10::convert<at::Float8_e4m3fn>(tmp14);
                        out_ptr0[static_cast<int64_t>(x0_tail)] = tmp15;
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

class Runner:
    def __init__(self, partitions):
        self.partitions = partitions

    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables

    def call(self, args):
        arg0_1, = args
        args.clear()
        assert_size_stride(arg0_1, (1, 7, 7, 9), (441, 63, 9, 1))
        buf0 = empty_strided_cpu((1, 7, 7, 9), (441, 63, 9, 1), torch.float8_e4m3fn)
        # [Provenance debug handles] cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0:1
        cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0(arg0_1, buf0)
        del arg0_1
        return (buf0, )
```
- After
```
cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0 = async_compile.cpp_pybinding(['const at::Float8_e4m3fn*', 'at::Float8_e4m3fn*'], r'''
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C"  void  kernel(const at::Float8_e4m3fn* in_ptr0,
                       at::Float8_e4m3fn* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(441L); x0+=static_cast<int64_t>(16L))
        {
            {
                if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(432L)))
                {
                    auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                    auto tmp1 = at::vec::convert<float>(tmp0);
                    auto tmp2 = static_cast<float>(100.0);
                    auto tmp3 = at::vec::Vectorized<float>(tmp2);
                    auto tmp4 = tmp1 - tmp3;
                    auto tmp5 = static_cast<float>(0.01);
                    auto tmp6 = at::vec::Vectorized<float>(tmp5);
                    auto tmp7 = tmp4 * tmp6;
                    auto tmp8 = (tmp7);
                    auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0));
                    auto tmp10 = tmp9 * tmp3;
                    auto tmp11 = tmp10.round();
                    auto tmp12 = tmp11 + tmp3;
                    auto tmp13 = static_cast<float>(-128.0);
                    auto tmp14 = at::vec::Vectorized<float>(tmp13);
                    auto tmp15 = at::vec::maximum(tmp12, tmp14);
                    auto tmp16 = static_cast<float>(127.0);
                    auto tmp17 = at::vec::Vectorized<float>(tmp16);
                    auto tmp18 = at::vec::minimum(tmp15, tmp17);
                    auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18);
                    tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
                }
                if(C10_UNLIKELY(x0 >= static_cast<int64_t>(432L) && x0 < static_cast<int64_t>(441L)))
                {
                    auto tmp0 = at::vec::Vectorized<at::Float8_e4m3fn>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
                    auto tmp1 = at::vec::convert<float>(tmp0);
                    auto tmp2 = static_cast<float>(100.0);
                    auto tmp3 = at::vec::Vectorized<float>(tmp2);
                    auto tmp4 = tmp1 - tmp3;
                    auto tmp5 = static_cast<float>(0.01);
                    auto tmp6 = at::vec::Vectorized<float>(tmp5);
                    auto tmp7 = tmp4 * tmp6;
                    auto tmp8 = (tmp7);
                    auto tmp9 = at::vec::clamp_min(tmp8, decltype(tmp8)(0));
                    auto tmp10 = tmp9 * tmp3;
                    auto tmp11 = tmp10.round();
                    auto tmp12 = tmp11 + tmp3;
                    auto tmp13 = static_cast<float>(-128.0);
                    auto tmp14 = at::vec::Vectorized<float>(tmp13);
                    auto tmp15 = at::vec::maximum(tmp12, tmp14);
                    auto tmp16 = static_cast<float>(127.0);
                    auto tmp17 = at::vec::Vectorized<float>(tmp16);
                    auto tmp18 = at::vec::minimum(tmp15, tmp17);
                    auto tmp19 = at::vec::convert<at::Float8_e4m3fn>(tmp18);
                    tmp19.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

class Runner:
    def __init__(self, partitions):
        self.partitions = partitions

    def recursively_apply_fns(self, fns):
        new_callables = []
        for fn, c in zip(fns, self.partitions):
            new_callables.append(fn(c))
        self.partitions = new_callables

    def call(self, args):
        arg0_1, = args
        args.clear()
        assert_size_stride(arg0_1, (1, 7, 7, 9), (441, 63, 9, 1))
        buf0 = empty_strided_cpu((1, 7, 7, 9), (441, 63, 9, 1), torch.float8_e4m3fn)
        # [Provenance debug handles] cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0:1
        cpp_fused_dequantize_per_tensor_quantize_per_tensor_relu_0(arg0_1, buf0)
        del arg0_1
        return (buf0, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163324
Approved by: https://github.com/Xia-Weiwen, https://github.com/mingfeima, https://github.com/jansel
2025-10-31 02:53:56 +00:00
d3be06cbdc [MTIAGraph][Pytorch][2/n] Add binding for Python to C++, and hook for Pytorch to Fbcode (#165963)
Summary:
This diff is the binding and hook layer for MTIA Graph, including
1. binding between Python and C++
2. hook between Pytorch and mtia fbcode
<img width="1780" height="754" alt="image" src="https://github.com/user-attachments/assets/31e24e5b-8324-42d8-8d3b-59536bc18340" />

[Doc](https://docs.google.com/document/d/1Q3xdZAIqhBvuy2HxGDfJyXVmxYXUEeYSZSwsp7bcJF8/edit?tab=t.osb46a42t6wb#heading=h.ayp9tkk08x00)

Test Plan: Will be tested in the python implementation which will use the binding and hook

Differential Revision: D84457757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165963
Approved by: https://github.com/malfet, https://github.com/albanD
2025-10-31 02:52:51 +00:00
1129605415 [ROCm][CI] create ROCm 7.1 images for binary builds (#166665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166665
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-31 02:52:37 +00:00
a6b1ef1717 [GraphPartition] cache get_free_symbol_uses (#166338)
Graph partition relies on `get_free_symbol_uses()` to collect symbol inputs.
ee7434be82/torch/_inductor/scheduler.py (L4869-L4885)

I empirically observed that `get_free_symbol_uses()` becomes slower for larger graphs. Specifically, I tried to aten fallback for torchtitan which results in 10k+ aten nodes. When processing the 600-th node, it takes seconds to `get_free_symbol_uses()` for 1 node.

Why? Because `get_free_symbol_uses()` may recursively call another `get_free_symbol_uses()`, which could recursively run many times.
ee7434be82/torch/_inductor/ir.py (L4541-L4543)

This PR fixes the issue by caching the results of `get_free_symbol_uses()`. I validated on torchtitan that the issue is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166338
Approved by: https://github.com/eellison
2025-10-31 02:50:10 +00:00
12577064dd [MPS] Fix crash when max/min ops called for complex types (#166214)
Raise an exception, as it's meaningless and results in segfault otherwise:
```
% python -c "import torch;torch.rand(10, dtype=torch.cfloat, device='mps').amax()"
(mpsFileLoc): /AppleInternal/Library/BuildRoots/4~B6shugDBannYeMBGCfhw7wjvNJOfy4BrawZ7TdI/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:176:0: error: 'mps.reduction_max' op operand #0 must be tensor of mps native type values, but got 'tensor<10xcomplex<f32>>'
(mpsFileLoc): /AppleInternal/Library/BuildRoots/4~B6shugDBannYeMBGCfhw7wjvNJOfy4BrawZ7TdI/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:176:0: note: see current operation: %2 = "mps.reduction_max"(%arg0, %1) <{keep_dims, propagate_nans}> : (tensor<10xcomplex<f32>>, tensor<1xsi32>) -> tensor<1xcomplex<f32>>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/4~B6shugDBannYeMBGCfhw7wjvNJOfy4BrawZ7TdI/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:176:0: error: 'mps.reduction_max' op operand #0 must be tensor of mps native type values, but got 'tensor<10xcomplex<f32>>'
(mpsFileLoc): /AppleInternal/Library/BuildRoots/4~B6shugDBannYeMBGCfhw7wjvNJOfy4BrawZ7TdI/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:176:0: note: see current operation: %2 = "mps.reduction_max"(%arg0, %1) <{keep_dims, propagate_nans}> : (tensor<10xcomplex<f32>>, tensor<1xsi32>) -> tensor<1xcomplex<f32>>
/AppleInternal/Library/BuildRoots/4~B6shugDBannYeMBGCfhw7wjvNJOfy4BrawZ7TdI/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:1347: failed assertion `original module failed verification'
zsh: abort      python -c
```

To be tested by `test_ops.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166214
Approved by: https://github.com/dcci, https://github.com/kulinseth, https://github.com/Skylion007
ghstack dependencies: #166272
2025-10-31 02:37:20 +00:00
24b6eb7727 [Inductor] Enable Custom op Autotune Decompositions and Parameter Tuning (#164212)
This PR introduces CustomOp autotuning. It allows user to provide a CustomOpConfig:
(1) to register (optional) multiple decomposition implementations for custom operations and
(2) to register parameter tuning knobs and values they want to tune for the decompositions
so that inductor automatically select the best-performing variant through Inductor's autotune benchmarking.

Example:
```python
 register_custom_op_autotuning(
            custom_op=my_attention_op,
            configs=[
                CustomOpConfig(attention_impl, head_dim=32, method='chunked'),
                CustomOpConfig(attention_impl, head_dim=64, method='tiled'),
                CustomOpConfig(head_dim=128), # no decompositions
            ],
            input_gen_fns={
                "query": lambda fake: torch.randn_like(fake, device='cuda'),
                "key": lambda fake: torch.randn_like(fake, device='cuda'),
                "value": lambda fake: torch.randn_like(fake, device='cuda'),
            }
    )
```

**CustomOpConfig**: Each CustomOpConfig defines exactly one autotuning variant with specific parameter values and optional decomposition implementation with PyTorch aten ops. Users can register their own tuning knobs and optional decomposition functions for the same custom operation. The system automatically benchmarks all variants to select the best performing. If no decomposition is provided in the config, the CustomOp's default implementation will be used.

**Custom Input Generation**: Users can provide custom input generators via an optional `input_gen_fns` to control how synthetic inputs are created during benchmarking. This enables more realistic performance testing by generating inputs that match expected data distributions and characteristics for each tensor argument.

**More Examples with autotune logs:**:
1. Allow user to register customOp decompositions with tuning parameters for autotuning. Example usage:
```python
from torch._inductor.kernel.custom_op import CustomOpConfig, register_custom_op_autotuning

def decompose_k_implementation(a: torch.Tensor, b: torch.Tensor, k_splits: int = 4) -> torch.Tensor:
    """Matrix multiply with k-way decomposition."""
         # Implementation...with k_splits

@torch.library.custom_op("my_lib::decompose_k", mutates_args=())
def test_decompose_k_op(
        a: torch.Tensor, b: torch.Tensor, k_splits: int
    ) -> torch.Tensor:
        return decompose_k_implementation(a, b, k_splits)

# Register autotuning with different k_splits values
register_custom_op_autotuning(
    custom_op=test_decompose_k_op,
    configs=[
        CustomOpConfig(decompose_k_implementation, k_splits=2),
        CustomOpConfig(decompose_k_implementation, k_splits=32),
        CustomOpConfig(decompose_k_implementation, k_splits=64),
        CustomOpConfig(k_splits=128), # can make decomposition optional, then use default impl test_decompose_k_op
        CustomOpConfig(k_splits=256)
    ],
    input_gen_fns={
        "a": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
        "b": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
    }
)
```

Example result:
```
{"num_choices": 6, "num_triton_choices": 0, "best_kernel": "test_decompose_k_autotuned_fallback_default", "best_time": 0.09980800002813339}
AUTOTUNE test_decompose_k_autotuned(256x65536, 65536x1024)
strides: [65536, 1], [1024, 1]
dtypes: torch.float16, torch.float16
  test_decompose_k_autotuned_fallback_default 0.0998 ms 100.0%
  test_decompose_k_autotuned_decompose_k_implementation_k_splits_2_0 0.1096 ms 91.0% CustomOp decompose_k_implementation_k_splits_2
  test_decompose_k_autotuned_decompose_k_implementation_k_splits_32_1 0.1277 ms 78.2% CustomOp decompose_k_implementation_k_splits_32
  test_decompose_k_autotuned_decompose_k_implementation_k_splits_64_2 0.1454 ms 68.6% CustomOp decompose_k_implementation_k_splits_64
  test_decompose_k_autotuned_decompose_k_implementation_k_splits_128_3 0.1536 ms 65.0% CustomOp decompose_k_implementation_k_splits_128
  test_decompose_k_autotuned_decompose_k_implementation_k_splits_256_4 0.2084 ms 47.9% CustomOp decompose_k_implementation_k_splits_256
```

2. Allow user to tune parameter knob by passing the parameter and values in the CustomOpConfig.
**Example**
```python
def mlp_variants(input_tensor, gate_weight, up_weight, down_weight, method):
    """MLP implementation with different computational approaches."""
    if method == 0:
        # Standard separate matmuls
        # ... implementation
    elif method == 1:
        # Batched approach with torch.mm
        # ... implementation
    elif method == 2:
        # Fused weights approach
        # ... implementation

@torch.library.custom_op("my_lib::mlp_op", mutates_args=())
        def mlp_op(
            input_tensor: torch.Tensor,
            gate_weight: torch.Tensor,
            up_weight: torch.Tensor,
            down_weight: torch.Tensor,
            method: int,
        ) -> torch.Tensor:
            return mlp_variants(
                input_tensor, gate_weight, up_weight, down_weight, method=method
            )

register_custom_op_autotuning(
    custom_op=mlp_op,
    configs=[
        CustomOpConfig(method=0),
        CustomOpConfig(method=1),
        CustomOpConfig(method=2),
        # method=0 is the default fallback in the original op
    ],
    input_gen_fns={
        "input_tensor": lambda fake: torch.randn_like(fake, device='cuda') * 0.1,
        "gate_weight": lambda fake: torch.randn_like(fake, device='cuda') * 0.05,
        # ... other input generators
    }
)

```

Example result:
```
AUTOTUNE test_mlp_autotuned(4x32x512, 512x1024, 512x1024, 1024x256)
  test_mlp_autotuned_mlp_variants_method_2 0.0181 ms 100.0% CustomOp mlp_variants_method_2
  test_mlp_autotuned_mlp_variants_method_1 0.0185 ms 97.8% CustomOp mlp_variants_method_1
  test_mlp_autotuned_mlp_default_fallback_method_0 0.0198 ms 91.4% CustomOp fallback
```

### Test Suite (`test/inductor/test_custom_op_autotune.py`)

*   **RMSNorm autotuning**: Tests different RMSNorm implementations with dynamic input shapes
*   **MLP autotuning**: Tests different MLP decomposition and tuning "method" parameter
*   **DecomposeK**: Tests different k_splits values for matrix multiplication decomposition with k dim split
*   **Multi-parameter tuning**: Tests configs with multiple tuning parameters (scale_mode, chunk_size)

### Next Step:
- Enable Max-autotune with user passed in max-autotune config. https://github.com/pytorch/pytorch/pull/165526/files
- Support inline epilogue fusion for selected best customop decomposition with surrounding elementwise ops. https://github.com/pytorch/pytorch/pull/165952/files
- Support customop autotune considering fusion with multiTemplateBuffer. WIP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164212
Approved by: https://github.com/zou3519
2025-10-31 02:28:00 +00:00
32066772b3 Fix torch.full with dynamic tensor fill_value in torch.compile (#166554)
Fixes #166253

## Summary
When `torch.full` is called with a 0-D tensor as `fill_value` inside a `torch.compile`'d function, the value was being incorrectly cached, causing subsequent calls with different values to return the first value.

## Root Cause
The Dynamo handler for `torch.full` was calling `aten._local_scalar_dense` to convert tensor fill_values to Python scalars at compile time, which baked the value into the compiled graph as a constant.

## Solution
Modified the Dynamo handler to decompose `torch.full(size, tensor_fill_value)` into `empty(size).fill_(tensor_fill_value)` when `fill_value` is a `TensorVariable`, keeping the fill value dynamic in the compiled graph.

## Testing
Added test case that verifies torch.full works correctly with dynamic tensor fill_values across multiple calls and dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166554
Approved by: https://github.com/Lucaskabela
2025-10-31 00:56:02 +00:00
47f0024310 [CI][BE] Factor out repeated test code (#166481)
Into `_run_single_arg_fwd`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166481
Approved by: https://github.com/Skylion007
2025-10-31 00:52:50 +00:00
98d640bb11 Remove AT_USE_HIPSPARSE_GENERIC_API (#166393)
This macro is not used in OSS anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166393
Approved by: https://github.com/ezyang
2025-10-31 00:49:09 +00:00
5d288bc3f7 [BE] Move GreenContext implementation details to cpp (#166462)
- Remove all complex defines logic from the header
- Make GreenContext constructor private, as  it should only be created via the static method as singleton
- Delete unused `getContext` and `getGreenContext` methods
- Rename `CUDA_HAS_GREEN_CONTEXT` to `HAS_CUDA_GREEN_CONTEXT()`, which results in compilation error if one accidentally makes a typo
- Suppress `-Wunused-private-field` is GreenContext is not available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166462
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-31 00:48:01 +00:00
bfb47ec50e [dynamo] support tracing new typing union syntax X | Y (#166599)
To do in a followup - I think there's an approach to reconstruct typing variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166599
Approved by: https://github.com/SherlockNoMad, https://github.com/anijain2305, https://github.com/Skylion007
2025-10-30 23:59:27 +00:00
7a0cd8ed09 [ROCm] Disable __builtin_amdgcn_rcpf for gfx90a (#166454)
Improves accuracy for some failing tests.

test/distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py::TestClipGradNormWorldSize4::test_clip_grad_norm_2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/18930221123/job/54046876467) [HUD commit link](f20bf77874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166454
Approved by: https://github.com/jerrymannil, https://github.com/jeffdaily
2025-10-30 23:39:00 +00:00
984e64b2cd [inductor] Fix constant folder (#166655)
Fixes https://fb.workplace.com/groups/1028545332188949/permalink/1351999569843522/ where the resulting graph of constant folder uses a sym node which has been created later. Graph diff: https://www.internalfb.com/intern/diffing/?paste_number=2014609054

Before:
```
    %full_65 : [num_users=1] = call_function[target=torch.ops.aten.full.default](args = ([%sym_size_int_47, 768], 1), kwargs = {dtype: torch.int64, layout: torch.strided, device: cuda:0, pin_memory: False})
    %select_18 : [num_users=1] = call_function[target=torch.ops.aten.select.int](args = (%full_65, 1, 0), kwargs = {})
    %mul_2792 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%select_18, 0), kwargs = {})
    %embedding_4 : [num_users=1] = call_function[target=torch.ops.aten.embedding.default](args = (%_uv__surface_embeddings_weight, %mul_2792), kwargs = {})
```

After:
```
    %full_65 : [num_users=1] = call_function[target=torch.ops.aten.full.default](args = ([%sym_size_int_47, 768], 1), kwargs = {dtype: torch.int64, layout: torch.strided, device: cuda:0, pin_memory: False})
    %full_default_1 : [num_users=1] = call_function[target=torch.ops.aten.full.default](args = ([%sym_size_int_150], 0), kwargs = {dtype: torch.int64, layout: torch.strided, device: cuda:0, pin_memory: False})
    %embedding_4 : [num_users=1] = call_function[target=torch.ops.aten.embedding.default](args = (%_uv__surface_embeddings_weight, %full_default_1), kwargs = {})
    ...
    %sym_size_int_150 : [num_users=7] = call_function[target=torch.ops.aten.sym_size.int](args = (%view_193, 0), kwargs = {})
```

I couldn't figure out a small repro for this :/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166655
Approved by: https://github.com/eellison
2025-10-30 22:51:28 +00:00
b9bcb37f40 [DebugMode] store stringify args by default (#166347)
DebugMode currently stores dispatch call args & kwargs, which is all intermediate tensors and more. This quickly OOMed on GPU when trying to debug some torchtitan / llama 8b models.

This defaults to storing the stringified version, adding a flag `DebugMode(store_original_args=True)` if users want to store the original args as-is (and for BC).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166347
Approved by: https://github.com/yushangdi
2025-10-30 22:12:23 +00:00
7e3b9d105e [CP][BE][2/2] Refactor the code structure (#166501)
Our CP codebase now contains several files and we are adding more. This
PR refactors the code to consolidate the files into a context_parallel
folder but keep the import so that the existing users of CP won't be
affected.

Unfortunately, we have to split this PR into two PRs as the PyTorch
infra cannot accept a PR with 3000+ LoC change and git cannot recognize
that _context_parallel/_attention.py is moved from _attention.py because
we want to keep BC.

This is the second PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166501
Approved by: https://github.com/Skylion007
ghstack dependencies: #166456
2025-10-30 22:07:07 +00:00
45c3f02d69 [ROCm] moved gfx1100 back to experimental status for AOTriton (#166397)
According to next commit to AOTriton:
8625c4faee

These changes missed in 0.11b release:
https://github.com/pytorch/pytorch/pull/161754

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166397
Approved by: https://github.com/jeffdaily
2025-10-30 21:43:01 +00:00
f5543e3741 [wip] fix searchsorted non dense (#165064)
Fix for https://github.com/pytorch/pytorch/issues/163528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165064
Approved by: https://github.com/benjaminglass1, https://github.com/mlazos
2025-10-30 21:21:24 +00:00
5fc2c7a2a1 [ROCm][inductor] More configs for pointwise kernels. (#166470)
This config improves performance by 250% on some kernels that contain `t1.atomic_add(...)`. Again, we conditionalize for ROCm/HIP, so there is no impact to NV.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166470
Approved by: https://github.com/PaulZhang12, https://github.com/mlazos, https://github.com/eellison, https://github.com/jansel
2025-10-30 21:20:12 +00:00
7692fa09cd [Code Clean] Clean asserts in torch/ao/quantization/fx/* (#165420)
Replace assert statements with explicit if/raise patterns in:

- torch/ao/quantization/fx/* (177 errors)

fix partialy #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165420
Approved by: https://github.com/RohitRathore1, https://github.com/fffrog, https://github.com/albanD
2025-10-30 20:53:36 +00:00
df71b70727 [cuDNN][conv] Re-enable cuDNN for 3D convolutions (fixed in 9.15+) (#166480)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166480
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-10-30 20:47:20 +00:00
80ba6e458f Add warning when users have incomplete setup for type checking (#166603)
Looking for feedback on this approach.
Received user reports of spurious pyrefly errors for users using hg instead of git. I think this was due to the fact that when using a venv and git, `make setup-env` installs requirements and pulls from a nightly torch wheel, which is needed for pyrefly to type check properly.

Initial documentation for `make setup-env` I found here: https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#developing-pytorch

Testing:
```
hg clone --git ssh://git@github.com/pytorch/pytorch.git
conda create -n pytorch_env python=3.10 # (or manually create venv instead of using script)
cd pytorch
pip install -r requirements.txt
pip install -r requirements-build.txt
lintrunner init
# check how many pyrefly errors - 15,709 errors (11,693 ignored)
lintrunner # confirm error message / warning appears
>>> General linter failure:
  Warning (PYREFLY) nightly-wheel-not-run
    pytorch-nightly.pth not found. You may need to run make setup-env or make
    setup-env-conda to install nightly binaries and type stubs.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166603
Approved by: https://github.com/aorenste
2025-10-30 20:37:44 +00:00
0d50e5d8d4 [3/N] Fix unused loop variables (#166509)
This PR removes unused loop variables in tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166509
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
2025-10-30 20:13:51 +00:00
99b05d1b78 Better 1x128, 128x128 error handling on non-Hopper (#166639)
Summary:

Blockwise 1x128 and 128x128 scaling is only available on CUDA >= 12.9
and only on Hopper GPUs. Attempting to run on B200 would give a
hard-to-debug `CUBLAS_STATUS_NOT_SUPPORTED`.

Add a more helpful `NotImplementedError` to catch this case.

Also more explicitly disable ROCm builds for relevant methods, based on
lack of support per [hipBLASlt
docs](https://rocm.docs.amd.com/projects/hipBLASLt/en/latest/reference/datatypes.html#_CPPv4N28hipblasLtMatmulMatrixScale_t40HIPBLASLT_MATMUL_MATRIX_SCALE_VEC128_32FE).

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166639
Approved by: https://github.com/drisspg
2025-10-30 20:13:06 +00:00
f911d64750 [CUDA] xFail max-autotune grouped gemm tests on devices with insufficient SM count (#165921)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165921
Approved by: https://github.com/ngimel
2025-10-30 20:05:07 +00:00
52db60170d Enable verify_dynamo on Python 3.13 (#166497)
Dynamo now supports Python 3.13.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166497
Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42
2025-10-30 19:52:32 +00:00
56838bad5f [CP][BE][1/2] Refactor the code structure (#166456)
Our CP codebase now contains several files and we are adding more. This PR refactors the code to consolidate the files into a context_parallel folder but keep the import so that the existing users of CP won't be affected.

Unfortunately, we have to split this PR into two PRs as the PyTorch infra cannot accept a PR with 3000+ LoC change and git cannot recognize that _context_parallel/_attention.py is moved from _attention.py because we want to keep BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166456
Approved by: https://github.com/Skylion007
2025-10-30 19:46:49 +00:00
ad3a56ab98 Add a compile-time flag to trigger verbose logging for device-side asserts (#166171)
Summary:
Using `CUDA_KERNEL_ASSERT_PRINTF` inside kernels allows us to log invalid values to the console (that can be in turn used to surface _hopefully_ more clearer error messages).

This does have an impact in the number of registers needed for the values being logged (I confirmed via diffing PTX that there is no other impact relative to using `__assert_fail`)

To avoid causing perf bottlenecks, this change adds a compile-time switch to enable more verbose errors in some of the common kernels that cause DSAs. There is also a Buck config that can be used to configure this switch more conveniently.

## Alternatives considered
I considered making the behavior of `CUDA_KERNEL_ASSERT_PRINTF` controllable via a compile-time macro instead of writing another wrapper for it but there are kernels where the extra register pressure is not as severe and in those cases, having more useful error messages by default is pretty useful.

Test Plan:
## Simple Python Driver:
```
# scatter_errors.py
import torch
def main() -> None:
    a = torch.rand(128, device="cuda:0")
    idx = torch.randint(0, 128, (100,), device="cuda:0")
    idx[0] = 9999
    b = torch.scatter(a, 0, idx, 555.0)
    print(b)
```

When running normally via:
```
$ buck2 run @//mode/opt  :scatter_errors
```
we see the followng DSA message:
```
fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
```

Running via:
```
$  buck2 run @//mode/opt -c fbcode.c10_enable_verbose_assert=1 :scatter_errors
```
however produces:
```
[CUDA_KERNEL_ASSERT] fbcode/caffe2/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:410: operator(): block: [0,0,0], thread: [0,0,0]: Assertion failed: `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"`: Expected 0 <= idx_dim < index_size (128), but got idx_dim = 9999
```

Differential Revision: D85185987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166171
Approved by: https://github.com/ngimel
2025-10-30 19:43:46 +00:00
a7fd0b4001 [ROCm][CI] fix disk space message (#166645)
Fixes diskspace cutoff to say that the machine does not have difference=100 - diskspace_cutoff_int space available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166645
Approved by: https://github.com/jeffdaily
2025-10-30 19:38:34 +00:00
181ee3bd42 fix: Add missing signals_to_handle to launcher logging (#166631)
Fixes #166630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166631
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-10-30 19:31:25 +00:00
0ec0549823 Introduce a new API torch.xpu.get_per_process_memory_fraction (#165511)
# Motivation
Aligned with other backends, this PR introduces a new API torch.xpu.get_per_process_memory_fraction to allow user to retrieve the allowed memory fraction per a single process.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165511
Approved by: https://github.com/EikanWang, https://github.com/ezyang
ghstack dependencies: #165508, #165509, #165510
2025-10-30 19:30:09 +00:00
8221ee6db9 [xpu] Fix type annotation for ProcessGroupXCCL (#166418)
After #163049, this PR fixes the type annotations to match the actual implementation for ProcessGroupXCCL::Options.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166418
Approved by: https://github.com/guangyey, https://github.com/ezyang
2025-10-30 19:29:06 +00:00
b939de26d1 Avoid writing temporary modules to disk (#157713)
In some cases the warning from #147744 still gets emitted because [atexit hooks aren't called](https://github.com/python/cpython/pull/114279).

Even in those cases, if the atexit hooks _were_ called you could end up with issues due to the directory being deleted in one process, but still being used elsewhere.

It's better all round to load these modules entirely in-memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157713
Approved by: https://github.com/xush6528
2025-10-30 19:11:16 +00:00
694db5f549 Use 'is' in callable comparisons (#166624)
Just like we use `is/is not` for class comparisons, it is generally advised to use `is/is not` for comparisons against torch functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166624
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007
2025-10-30 19:00:09 +00:00
639a0b1239 Remove torch.distributed.tensor.OpSchema.has_symints (#163667)
It appears to be unused based on `cd torch; rg has_symints`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163667
Approved by: https://github.com/xmfan, https://github.com/azahed98, https://github.com/albanD
ghstack dependencies: #162990
2025-10-30 18:57:17 +00:00
398775a43e [CodeClean] Replace std::runtime_error with TORCH_CHECK (#165119)
As the title stated.

**Changes**:
- torch/csrc/inductor(Part 2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165119
Approved by: https://github.com/janeyx99
ghstack dependencies: #165139
2025-10-30 18:43:58 +00:00
fcd5f8c352 [CodeClean] Remove the Unused MACRO for AOT Inductor Runtime (#165139)
As the title stated.

- AOTI_TORCH_CHECK depend on TORCH_CHECK_MSG which located in c10/util/Exception.h, which maybe break BC
- AOTI_TORCH_CHECK is not used everywhere
- STD_TORCH_CHECK have ABI check tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165139
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-10-30 18:43:58 +00:00
4acc66f119 Make PT2 compile backprop through custom op without autograd key a hard error (#166367)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166367
Approved by: https://github.com/bdhirsh
2025-10-30 18:43:07 +00:00
8f40a0c634 Revert "address DDE in matmul decomp (#166541)"
This reverts commit 90519402c2006237f891289a0afdec804515aa73.

Reverted https://github.com/pytorch/pytorch/pull/166541 on behalf of https://github.com/atalman due to breaks internal test ([comment](https://github.com/pytorch/pytorch/pull/166541#issuecomment-3469382334))
2025-10-30 18:11:33 +00:00
a5c3c08d10 [Pytorch] Use exp_u20 for aarch64's erf (#166594)
Summary:
After a precision study, we concluded it is ok to use ACL's exp function on f32's erf()
We can keep erf inline this way.

Benchmarks show about 91% higher throughput when processing a tensor of 1M elements, compiling with clang-19:

Before:
f32 erf: 2539.179us
After:
f32 erf: 1329.063us

Test Plan:
Correctness:

buck2 test mode/opt //caffe2/test:test_ops
buck2 test mode/opt //caffe2/test:torch

Performance:

buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test

Differential Revision: D85730452

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166594
Approved by: https://github.com/mcfi, https://github.com/fadara01
2025-10-30 18:09:05 +00:00
a553ea9ea4 Fix missing symbol when printing guards (#165723)
Fixes #165177

When converting guards to sources if we were unable to get the expected symbol from symbol_to_source then try to get it from var_to_sources.

I was unable to make a simpler repro than what was described in the issue (which relies on llama3 - so inappropriate for a unit test).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165723
Approved by: https://github.com/bobrenjc93
2025-10-30 18:03:51 +00:00
ba71e9ca9a [DeviceMesh] Isolate pg creation logic in Device Mesh into a separate func _init_one_process_group (#166614)
To makes pg cache change easier and code modularization, we isolate the logic of process group creation into a separate function named `_init_one_process_group`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166614
Approved by: https://github.com/lw
2025-10-30 17:57:41 +00:00
694d205143 Revert "shrink_group implementation to expose ncclCommShrink API (#164518)"
This reverts commit 311ea0dec0c50f395e6dac7b3875e81ee243fceb.

Reverted https://github.com/pytorch/pytorch/pull/164518 on behalf of https://github.com/atalman due to breaks internal builds Error: from logging_utils import ( ModuleNotFoundError: No module named 'logging_utils' ([comment](https://github.com/pytorch/pytorch/pull/164518#issuecomment-3469308568))
2025-10-30 17:52:29 +00:00
629293f568 bucket all reduce (#166528)
Bucket all reduce in bucketer, thanks to @IvanKobzarev's earlier pr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166528
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #166527
2025-10-30 17:12:34 +00:00
53dc8a0875 Update
[ghstack-poisoned]
2025-10-10 17:00:28 +00:00
9e119dd8c4 Update
[ghstack-poisoned]
2025-09-29 17:16:55 +00:00
93a6e99edc Update (base update)
[ghstack-poisoned]
2025-09-26 07:34:19 +00:00
d0892c7792 Update
[ghstack-poisoned]
2025-09-26 07:34:19 +00:00
521 changed files with 13259 additions and 8409 deletions

View File

@ -195,13 +195,16 @@ case "$tag" in
NINJA_VERSION=1.9.0
TRITON=yes
;;
pytorch-linux-jammy-xpu-n-py3)
pytorch-linux-jammy-xpu-n-py3 | pytorch-linux-jammy-xpu-n-py3-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
VISION=yes
XPU_VERSION=2025.2
NINJA_VERSION=1.9.0
TRITON=yes
if [[ $tag =~ "benchmarks" ]]; then
INDUCTOR_BENCHMARKS=yes
fi
;;
pytorch-linux-jammy-py3-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.10

View File

@ -3,7 +3,7 @@
set -eux
ACL_VERSION=${ACL_VERSION:-"v25.02"}
ACL_VERSION=${ACL_VERSION:-"v52.6.0"}
ACL_INSTALL_DIR="/acl"
# Clone ACL

View File

@ -12,8 +12,8 @@ function do_install() {
rocm_version_nodot=${rocm_version//./}
# https://github.com/icl-utk-edu/magma/pull/65
MAGMA_VERSION=d6e4117bc88e73f06d26c6c2e14f064e8fc3d1ec
# post merge of https://github.com/icl-utk-edu/magma/pull/65
MAGMA_VERSION=c0792ae825fb36872784892ea643dd6f3456bc5f
magma_archive="magma-rocm${rocm_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
rocm_dir="/opt/rocm"

View File

@ -97,7 +97,7 @@ case ${image} in
manylinux2_28-builder:xpu)
TARGET=xpu_final
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13"
MANY_LINUX_VERSION="2_28"
;;
*)

View File

@ -54,12 +54,15 @@ ENV OPENSSL_DIR /opt/openssl
RUN rm install_openssl.sh
ARG INDUCTOR_BENCHMARKS
ARG ANACONDA_PYTHON_VERSION
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
COPY ./common/install_inductor_benchmark_deps.sh install_inductor_benchmark_deps.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/huggingface-requirements.txt huggingface-requirements.txt
COPY ci_commit_pins/timm.txt timm.txt
COPY ci_commit_pins/torchbench.txt torchbench.txt
RUN if [ -n "${INDUCTOR_BENCHMARKS}" ]; then bash ./install_inductor_benchmark_deps.sh; fi
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt
RUN rm install_inductor_benchmark_deps.sh common_utils.sh timm.txt huggingface-requirements.txt torchbench.txt
# Install XPU Dependencies
ARG XPU_VERSION

View File

@ -1,7 +1,7 @@
SHELL=/usr/bin/env bash
DOCKER_CMD ?= docker
DESIRED_ROCM ?= 7.0
DESIRED_ROCM ?= 7.1
DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))
PACKAGE_NAME = magma-rocm
# inherit this from underlying docker image, do not pass this env var to docker
@ -16,6 +16,7 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
magma-rocm/build_magma.sh
.PHONY: all
all: magma-rocm71
all: magma-rocm70
all: magma-rocm64
@ -24,6 +25,11 @@ clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-rocm71
magma-rocm71: DESIRED_ROCM := 7.1
magma-rocm71:
$(DOCKER_RUN)
.PHONY: magma-rocm70
magma-rocm70: DESIRED_ROCM := 7.0
magma-rocm70:

View File

@ -6,8 +6,8 @@ set -eou pipefail
# The script expects DESIRED_CUDA and PACKAGE_NAME to be set
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
# https://github.com/icl-utk-edu/magma/pull/65
MAGMA_VERSION=d6e4117bc88e73f06d26c6c2e14f064e8fc3d1ec
# post merge of https://github.com/icl-utk-edu/magma/pull/65
MAGMA_VERSION=c0792ae825fb36872784892ea643dd6f3456bc5f
# Folders for the build
PACKAGE_FILES=${ROOT_DIR}/magma-rocm/package_files # metadata
@ -20,7 +20,7 @@ mkdir -p ${PACKAGE_DIR} ${PACKAGE_OUTPUT}/linux-64 ${PACKAGE_BUILD} ${PACKAGE_RE
# Fetch magma sources and verify checksum
pushd ${PACKAGE_DIR}
git clone https://github.com/jeffdaily/magma
git clone https://github.com/icl-utk-edu/magma
pushd magma
git checkout ${MAGMA_VERSION}
popd

View File

@ -426,7 +426,7 @@ fi
if [[ "$BUILD_ENVIRONMENT" != *libtorch* && "$BUILD_ENVIRONMENT" != *bazel* ]]; then
# export test times so that potential sharded tests that'll branch off this build will use consistent data
# don't do this for libtorch as libtorch is C++ only and thus won't have python tests run on its build
python tools/stats/export_test_times.py
PYTHONPATH=. python tools/stats/export_test_times.py
fi
# don't do this for bazel or s390x or riscv64 as they don't use sccache
if [[ "$BUILD_ENVIRONMENT" != *s390x* && "$BUILD_ENVIRONMENT" != *riscv64* && "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then

View File

@ -572,6 +572,8 @@ fi
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--device cpu)
elif [[ "${TEST_CONFIG}" == *xpu* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--device xpu)
else
DYNAMO_BENCHMARK_FLAGS+=(--device cuda)
fi
@ -665,6 +667,8 @@ test_perf_for_dashboard() {
device=cuda_b200
elif [[ "${TEST_CONFIG}" == *rocm* ]]; then
device=rocm
elif [[ "${TEST_CONFIG}" == *xpu* ]]; then
device=xpu
fi
for mode in "${modes[@]}"; do
@ -1757,7 +1761,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
else
# Do this after checkout_install_torchbench to ensure we clobber any
# nightlies that torchbench may pull in
if [[ "${TEST_CONFIG}" != *cpu* ]]; then
if [[ "${TEST_CONFIG}" != *cpu* && "${TEST_CONFIG}" != *xpu* ]]; then
install_torchrec_and_fbgemm
fi
PYTHONPATH=/torchbench test_dynamo_benchmark torchbench "$id"

View File

@ -0,0 +1,319 @@
---
name: add-uint-support
description: Add unsigned integer (uint) type support to PyTorch operators by updating AT_DISPATCH macros. Use when adding support for uint16, uint32, uint64 types to operators, kernels, or when user mentions enabling unsigned types, barebones unsigned types, or uint support.
---
# Add Unsigned Integer (uint) Support to Operators
This skill helps add support for unsigned integer types (uint16, uint32, uint64) to PyTorch operators by updating their AT_DISPATCH macros.
## When to use this skill
Use this skill when:
- Adding uint16, uint32, or uint64 support to an operator
- User mentions "unsigned types", "uint support", "barebones unsigned types"
- Enabling support for kUInt16, kUInt32, kUInt64 in kernels
- Working with operator implementations that need expanded type coverage
## Quick reference
**Add unsigned types to existing dispatch:**
```cpp
// Before
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_ALL_TYPES));
// After (method 1: add unsigned types explicitly)
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_ALL_TYPES), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES));
// After (method 2: use V2 integral types if AT_INTEGRAL_TYPES present)
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_INTEGRAL_TYPES_V2), AT_EXPAND(AT_FLOATING_TYPES));
```
## Type group reference
**Unsigned type groups:**
- `AT_BAREBONES_UNSIGNED_TYPES`: kUInt16, kUInt32, kUInt64
- `AT_INTEGRAL_TYPES_V2`: AT_INTEGRAL_TYPES + AT_BAREBONES_UNSIGNED_TYPES
**Relationship:**
```cpp
AT_INTEGRAL_TYPES // kByte, kChar, kInt, kLong, kShort
AT_BAREBONES_UNSIGNED_TYPES // kUInt16, kUInt32, kUInt64
AT_INTEGRAL_TYPES_V2 // INTEGRAL_TYPES + BAREBONES_UNSIGNED_TYPES
```
## Instructions
### Step 1: Determine if conversion to V2 is needed
Check if the file uses AT_DISPATCH_V2:
**If using old AT_DISPATCH:**
- First convert to AT_DISPATCH_V2 using the at-dispatch-v2 skill
- Then proceed with adding uint support
**If already using AT_DISPATCH_V2:**
- Proceed directly to Step 2
### Step 2: Analyze the current dispatch macro
Identify what type groups are currently in use:
```cpp
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
// body
}), AT_EXPAND(AT_ALL_TYPES), kHalf, kBFloat16);
^^^^^^^^^^^^^^^^^^^^^^^^^
Current type coverage
```
Common patterns:
- `AT_EXPAND(AT_ALL_TYPES)` → includes AT_INTEGRAL_TYPES + AT_FLOATING_TYPES
- `AT_EXPAND(AT_INTEGRAL_TYPES)` → signed integers only
- `AT_EXPAND(AT_FLOATING_TYPES)` → floating point types
### Step 3: Choose the uint addition method
Two approaches:
**Method 1: Add AT_BAREBONES_UNSIGNED_TYPES explicitly**
- Use when: You want to be explicit about adding uint support
- Add `AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES)` to the type list
**Method 2: Substitute AT_INTEGRAL_TYPES with AT_INTEGRAL_TYPES_V2**
- Use when: The dispatch already uses `AT_EXPAND(AT_INTEGRAL_TYPES)`
- More concise: replaces one type group with its superset
- Only applicable if AT_INTEGRAL_TYPES is present
### Step 4: Apply the transformation
**Method 1 example:**
```cpp
// Before
AT_DISPATCH_V2(
dtype,
"min_values_cuda",
AT_WRAP([&]() {
kernel_impl<scalar_t>(iter);
}),
AT_EXPAND(AT_ALL_TYPES),
kBFloat16, kHalf, kBool
);
// After (add unsigned types)
AT_DISPATCH_V2(
dtype,
"min_values_cuda",
AT_WRAP([&]() {
kernel_impl<scalar_t>(iter);
}),
AT_EXPAND(AT_ALL_TYPES),
AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES),
kBFloat16, kHalf, kBool
);
```
**Method 2 example:**
```cpp
// Before
AT_DISPATCH_V2(
dtype,
"integral_op",
AT_WRAP([&]() {
kernel<scalar_t>();
}),
AT_EXPAND(AT_INTEGRAL_TYPES)
);
// After (substitute with V2)
AT_DISPATCH_V2(
dtype,
"integral_op",
AT_WRAP([&]() {
kernel<scalar_t>();
}),
AT_EXPAND(AT_INTEGRAL_TYPES_V2)
);
```
### Step 5: Handle AT_ALL_TYPES vs individual type groups
If the dispatch uses `AT_EXPAND(AT_ALL_TYPES)`:
- `AT_ALL_TYPES` = `AT_INTEGRAL_TYPES` + `AT_FLOATING_TYPES`
- To add uint: add `AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES)` to the list
If the dispatch separately lists INTEGRAL and FLOATING:
```cpp
// Before
AT_EXPAND(AT_INTEGRAL_TYPES), AT_EXPAND(AT_FLOATING_TYPES)
// After (Method 2 preferred)
AT_EXPAND(AT_INTEGRAL_TYPES_V2), AT_EXPAND(AT_FLOATING_TYPES)
```
### Step 6: Verify all dispatch sites
Check the file for ALL dispatch macros that need uint support:
- Some operators have multiple dispatch sites (CPU, CUDA, different functions)
- Apply the transformation consistently across all sites
- Ensure each gets the same type coverage updates
### Step 7: Validate the changes
Check that:
- [ ] AT_DISPATCH_V2 format is used (not old AT_DISPATCH)
- [ ] Unsigned types are added via one of the two methods
- [ ] All relevant dispatch sites in the file are updated
- [ ] Type groups use `AT_EXPAND()`
- [ ] Arguments are properly formatted and comma-separated
## Common patterns
### Pattern 1: AT_ALL_TYPES + extras
```cpp
// Before
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_ALL_TYPES), kHalf, kBFloat16);
// After
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_ALL_TYPES), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES), kHalf, kBFloat16);
```
### Pattern 2: Separate INTEGRAL + FLOATING
```cpp
// Before
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_INTEGRAL_TYPES), AT_EXPAND(AT_FLOATING_TYPES));
// After
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_INTEGRAL_TYPES_V2), AT_EXPAND(AT_FLOATING_TYPES));
```
### Pattern 3: Old dispatch needs conversion first
```cpp
// Before (needs v2 conversion first)
AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, dtype, "op", [&]() {
kernel<scalar_t>();
});
// After v2 conversion
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_ALL_TYPES), kHalf, kBFloat16);
// After adding uint support
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_ALL_TYPES), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES), kHalf, kBFloat16);
```
## Multiple dispatch sites example
For a file with multiple functions:
```cpp
void min_values_kernel_cuda(TensorIterator& iter) {
AT_DISPATCH_V2(iter.dtype(), "min_values_cuda", AT_WRAP([&]() {
impl<scalar_t>(iter);
}), AT_EXPAND(AT_ALL_TYPES), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES), kBFloat16, kHalf);
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
// Added uint support
}
void min_launch_kernel(TensorIterator &iter) {
AT_DISPATCH_V2(iter.input_dtype(), "min_cuda", AT_WRAP([&]() {
gpu_reduce_kernel<scalar_t>(iter);
}), AT_EXPAND(AT_ALL_TYPES), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES), kBFloat16, kHalf);
// ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
// Added uint support here too
}
```
## Decision tree
Use this decision tree to determine the approach:
```
Is the file using AT_DISPATCH_V2?
├─ No → Use at-dispatch-v2 skill first, then continue
└─ Yes
└─ Does it use AT_EXPAND(AT_INTEGRAL_TYPES)?
├─ Yes → Replace with AT_EXPAND(AT_INTEGRAL_TYPES_V2)
└─ No → Add AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES) to type list
```
## Edge cases
### Case 1: Dispatch with only floating types
If the operator only supports floating point types, don't add uint support:
```cpp
// Leave as-is - floating point only operator
AT_DISPATCH_V2(dtype, "float_op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_FLOATING_TYPES), kHalf);
```
### Case 2: Complex types present
Unsigned types work alongside complex types:
```cpp
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_ALL_TYPES),
AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES),
AT_EXPAND(AT_COMPLEX_TYPES),
kHalf, kBFloat16);
```
### Case 3: Already has uint support
Check if uint types are already present:
- If `AT_INTEGRAL_TYPES_V2` is used → already has uint support
- If `AT_BAREBONES_UNSIGNED_TYPES` is already in list → already has uint support
- Skip the file if uint support is already present
## Workflow
When asked to add uint support:
1. Read the target file
2. Check if using AT_DISPATCH_V2:
- If not → use at-dispatch-v2 skill first
3. Identify all dispatch macro sites
4. For each dispatch:
- Analyze current type groups
- Choose method (add BAREBONES_UNSIGNED or upgrade to V2)
- Apply transformation with Edit tool
5. Show the user the changes
6. Explain what was modified
## Important notes
- Always check if v2 conversion is needed first
- Apply changes consistently across all dispatch sites in the file
- Method 2 (AT_INTEGRAL_TYPES_V2) is cleaner when applicable
- Method 1 (explicit AT_BAREBONES_UNSIGNED_TYPES) is more explicit
- Unsigned types are: kUInt16, kUInt32, kUInt64 (not kByte which is uint8)
- Some operators may not semantically support unsigned types - use judgment
## Testing
After adding uint support, the operator should accept uint16, uint32, and uint64 tensors. The user is responsible for functional testing.

View File

@ -0,0 +1,305 @@
---
name: at-dispatch-v2
description: Convert PyTorch AT_DISPATCH macros to AT_DISPATCH_V2 format in ATen C++ code. Use when porting AT_DISPATCH_ALL_TYPES_AND*, AT_DISPATCH_FLOATING_TYPES*, or other dispatch macros to the new v2 API. For ATen kernel files, CUDA kernels, and native operator implementations.
---
# AT_DISPATCH to AT_DISPATCH_V2 Converter
This skill helps convert PyTorch's legacy AT_DISPATCH macros to the new AT_DISPATCH_V2 format, as defined in `aten/src/ATen/Dispatch_v2.h`.
## When to use this skill
Use this skill when:
- Converting AT_DISPATCH_* macros to AT_DISPATCH_V2
- Porting ATen kernels to use the new dispatch API
- Working with files in `aten/src/ATen/native/` that use dispatch macros
- User mentions "AT_DISPATCH", "dispatch v2", "Dispatch_v2.h", or macro conversion
## Quick reference
**Old format:**
```cpp
AT_DISPATCH_ALL_TYPES_AND3(kBFloat16, kHalf, kBool, dtype, "kernel_name", [&]() {
// lambda body
});
```
**New format:**
```cpp
AT_DISPATCH_V2(dtype, "kernel_name", AT_WRAP([&]() {
// lambda body
}), AT_EXPAND(AT_ALL_TYPES), kBFloat16, kHalf, kBool);
```
## Key transformations
1. **Reorder arguments**: `scalar_type` and `name` come first, then lambda, then types
2. **Wrap the lambda**: Use `AT_WRAP(lambda)` to handle internal commas
3. **Expand type groups**: Use `AT_EXPAND(AT_ALL_TYPES)` instead of implicit expansion
4. **List individual types**: Add extra types (kHalf, kBFloat16, etc.) after expanded groups
5. **Add include**: `#include <ATen/Dispatch_v2.h>` near other Dispatch includes
## Instructions
### Step 1: Add the Dispatch_v2.h include
Add the v2 header near the existing `#include <ATen/Dispatch.h>`:
```cpp
#include <ATen/Dispatch.h>
#include <ATen/Dispatch_v2.h>
```
Keep the old Dispatch.h include for now (other code may still need it).
### Step 2: Identify the old dispatch pattern
Common patterns to convert:
- `AT_DISPATCH_ALL_TYPES_AND{2,3,4}(type1, type2, ..., scalar_type, name, lambda)`
- `AT_DISPATCH_FLOATING_TYPES_AND{2,3}(type1, type2, ..., scalar_type, name, lambda)`
- `AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND{2,3}(type1, ..., scalar_type, name, lambda)`
- `AT_DISPATCH_FLOATING_AND_COMPLEX_TYPES_AND{2,3}(type1, ..., scalar_type, name, lambda)`
### Step 3: Map the old macro to type groups
Identify which type group macro corresponds to the base types:
| Old macro base | AT_DISPATCH_V2 type group |
|----------------|---------------------------|
| `ALL_TYPES` | `AT_EXPAND(AT_ALL_TYPES)` |
| `FLOATING_TYPES` | `AT_EXPAND(AT_FLOATING_TYPES)` |
| `INTEGRAL_TYPES` | `AT_EXPAND(AT_INTEGRAL_TYPES)` |
| `COMPLEX_TYPES` | `AT_EXPAND(AT_COMPLEX_TYPES)` |
| `ALL_TYPES_AND_COMPLEX` | `AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX)` |
For combined patterns, use multiple `AT_EXPAND()` entries:
```cpp
// Old: AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(...)
// New: AT_EXPAND(AT_ALL_TYPES), AT_EXPAND(AT_COMPLEX_TYPES), type1, type2
```
### Step 4: Extract the individual types
From `AT_DISPATCH_*_AND2(type1, type2, ...)` or `AT_DISPATCH_*_AND3(type1, type2, type3, ...)`, extract the individual types (type1, type2, etc.).
These become the trailing arguments after the type group:
```cpp
AT_DISPATCH_V2(..., AT_EXPAND(AT_ALL_TYPES), kBFloat16, kHalf, kBool)
^^^^^^^^^^^^^^^^^^^^^^^^
Individual types from AND3
```
### Step 5: Transform to AT_DISPATCH_V2
Apply the transformation:
**Pattern:**
```cpp
AT_DISPATCH_V2(
scalar_type, // 1st: The dtype expression
"name", // 2nd: The debug string
AT_WRAP(lambda), // 3rd: The lambda wrapped in AT_WRAP
type_groups, // 4th+: Type groups with AT_EXPAND()
individual_types // Last: Individual types
)
```
**Example transformation:**
```cpp
// BEFORE
AT_DISPATCH_ALL_TYPES_AND3(
kBFloat16, kHalf, kBool,
iter.dtype(),
"min_values_cuda",
[&]() {
min_values_kernel_cuda_impl<scalar_t>(iter);
}
);
// AFTER
AT_DISPATCH_V2(
iter.dtype(),
"min_values_cuda",
AT_WRAP([&]() {
min_values_kernel_cuda_impl<scalar_t>(iter);
}),
AT_EXPAND(AT_ALL_TYPES),
kBFloat16, kHalf, kBool
);
```
### Step 6: Handle multi-line lambdas
For lambdas with internal commas or complex expressions, AT_WRAP is essential:
```cpp
AT_DISPATCH_V2(
dtype,
"complex_kernel",
AT_WRAP([&]() {
gpu_reduce_kernel<scalar_t, scalar_t>(
iter,
MinOps<scalar_t>{},
thrust::pair<scalar_t, int64_t>(upper_bound(), 0) // Commas inside!
);
}),
AT_EXPAND(AT_ALL_TYPES)
);
```
### Step 7: Verify the conversion
Check that:
- [ ] `AT_WRAP()` wraps the entire lambda
- [ ] Type groups use `AT_EXPAND()`
- [ ] Individual types don't have `AT_EXPAND()` (just `kBFloat16`, not `AT_EXPAND(kBFloat16)`)
- [ ] Argument order is: scalar_type, name, lambda, types
- [ ] Include added: `#include <ATen/Dispatch_v2.h>`
## Type group reference
Available type group macros (use with `AT_EXPAND()`):
```cpp
AT_INTEGRAL_TYPES // kByte, kChar, kInt, kLong, kShort
AT_FLOATING_TYPES // kDouble, kFloat
AT_COMPLEX_TYPES // kComplexDouble, kComplexFloat
AT_QINT_TYPES // kQInt8, kQUInt8, kQInt32
AT_ALL_TYPES // INTEGRAL_TYPES + FLOATING_TYPES
AT_ALL_TYPES_AND_COMPLEX // ALL_TYPES + COMPLEX_TYPES
AT_INTEGRAL_TYPES_V2 // INTEGRAL_TYPES + unsigned types
AT_BAREBONES_UNSIGNED_TYPES // kUInt16, kUInt32, kUInt64
AT_FLOAT8_TYPES // Float8 variants
```
## Common patterns
### Pattern: AT_DISPATCH_ALL_TYPES_AND2
```cpp
// Before
AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBFloat16, dtype, "op", [&]() {
kernel<scalar_t>(data);
});
// After
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>(data);
}), AT_EXPAND(AT_ALL_TYPES), kHalf, kBFloat16);
```
### Pattern: AT_DISPATCH_FLOATING_TYPES_AND3
```cpp
// Before
AT_DISPATCH_FLOATING_TYPES_AND3(kHalf, kBFloat16, kFloat8_e4m3fn,
tensor.scalar_type(), "float_op", [&] {
process<scalar_t>(tensor);
});
// After
AT_DISPATCH_V2(tensor.scalar_type(), "float_op", AT_WRAP([&] {
process<scalar_t>(tensor);
}), AT_EXPAND(AT_FLOATING_TYPES), kHalf, kBFloat16, kFloat8_e4m3fn);
```
### Pattern: AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2
```cpp
// Before
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(
kComplexHalf, kHalf,
self.scalar_type(),
"complex_op",
[&] {
result = compute<scalar_t>(self);
}
);
// After
AT_DISPATCH_V2(
self.scalar_type(),
"complex_op",
AT_WRAP([&] {
result = compute<scalar_t>(self);
}),
AT_EXPAND(AT_ALL_TYPES),
AT_EXPAND(AT_COMPLEX_TYPES),
kComplexHalf,
kHalf
);
```
## Edge cases
### Case 1: No extra types (rare)
```cpp
// Before
AT_DISPATCH_ALL_TYPES(dtype, "op", [&]() { kernel<scalar_t>(); });
// After
AT_DISPATCH_V2(dtype, "op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_ALL_TYPES));
```
### Case 2: Many individual types (AND4, AND5, etc.)
```cpp
// Before
AT_DISPATCH_FLOATING_TYPES_AND4(kHalf, kBFloat16, kFloat8_e4m3fn, kFloat8_e5m2,
dtype, "float8_op", [&]() { kernel<scalar_t>(); });
// After
AT_DISPATCH_V2(dtype, "float8_op", AT_WRAP([&]() {
kernel<scalar_t>();
}), AT_EXPAND(AT_FLOATING_TYPES), kHalf, kBFloat16, kFloat8_e4m3fn, kFloat8_e5m2);
```
### Case 3: Lambda with no captures
```cpp
// Before
AT_DISPATCH_ALL_TYPES_AND2(kHalf, kBool, dtype, "op", []() {
static_kernel<scalar_t>();
});
// After
AT_DISPATCH_V2(dtype, "op", AT_WRAP([]() {
static_kernel<scalar_t>();
}), AT_EXPAND(AT_ALL_TYPES), kHalf, kBool);
```
## Benefits of AT_DISPATCH_V2
1. **No arity in macro name**: Don't need different macros for AND2, AND3, AND4
2. **Composable type sets**: Mix and match type groups with `AT_EXPAND()`
3. **Extensible**: Easy to add more types without hitting macro limits
4. **Clearer**: Type groups are explicit, not implicit in macro name
## Important notes
- Keep `#include <ATen/Dispatch.h>` - other code may need it
- The `AT_WRAP()` is mandatory - prevents comma parsing issues in the lambda
- Type groups need `AT_EXPAND()`, individual types don't
- The v2 API is in `aten/src/ATen/Dispatch_v2.h` - refer to it for full docs
- See the header file for the Python script to regenerate the macro implementation
## Workflow
When asked to convert AT_DISPATCH macros:
1. Read the file to identify all AT_DISPATCH uses
2. Add `#include <ATen/Dispatch_v2.h>` if not present
3. For each dispatch macro:
- Identify the pattern and extract components
- Map the base type group
- Extract individual types
- Construct the AT_DISPATCH_V2 call
- Apply with Edit tool
4. Show the user the complete converted file
5. Explain what was changed
Do NOT compile or test the code - focus on accurate conversion only.

View File

@ -27,7 +27,9 @@ runs:
docker system prune -af
diskspace_new=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')
if [[ "$diskspace_new" -gt "$diskspace_cutoff" ]] ; then
echo "Error: Available diskspace is less than $diskspace_cutoff percent. Not enough diskspace."
diskspace_cutoff_int=$((diskspace_cutoff + 0))
difference=$((100 - diskspace_cutoff_int))
echo "Error: Available diskspace is less than $difference percent. Not enough diskspace."
echo "$msg"
exit 1
else

View File

@ -1 +1 @@
218d2ab791d437309f91e0486eb9fa7f00badc17
cfbc5c2f1c798991715a6b06bb3ce46478c4487c

View File

@ -19,6 +19,7 @@ ciflow_push_tags:
- ciflow/inductor-perf-test-nightly-rocm-mi300
- ciflow/inductor-perf-test-nightly-rocm-mi355
- ciflow/inductor-perf-test-nightly-x86-zen
- ciflow/inductor-perf-test-nightly-xpu
- ciflow/inductor-periodic
- ciflow/inductor-rocm
- ciflow/linux-aarch64

View File

@ -11,11 +11,17 @@ architectures:
* Latest XPU
"""
import json
import os
import re
from pathlib import Path
from typing import Optional
# NOTE: Please also update the CUDA sources in `PIP_SOURCES` in tools/nightly.py when changing this
SCRIPT_DIR = Path(__file__).absolute().parent
REPO_ROOT = SCRIPT_DIR.parent.parent
CUDA_ARCHES = ["12.6", "12.8", "12.9", "13.0"]
CUDA_STABLE = "12.8"
CUDA_ARCHES_FULL_VERSION = {
@ -31,8 +37,7 @@ CUDA_ARCHES_CUDNN_VERSION = {
"13.0": "9",
}
# NOTE: Please also update the ROCm sources in `PIP_SOURCES` in tools/nightly.py when changing this
ROCM_ARCHES = ["6.4", "7.0"]
ROCM_ARCHES = ["7.0", "7.1"]
XPU_ARCHES = ["xpu"]
@ -137,9 +142,48 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
}
def get_nccl_wheel_version(arch_version: str) -> str:
import re
# Used by tools/nightly.py
PYTORCH_NIGHTLY_PIP_INDEX_URL = "https://download.pytorch.org/whl/nightly"
NIGHTLY_SOURCE_MATRIX = {
"cpu": dict(
name="cpu",
index_url=f"{PYTORCH_NIGHTLY_PIP_INDEX_URL}/cpu",
supported_platforms=["Linux", "macOS", "Windows"],
accelerator="cpu",
)
}
CUDA_NIGHTLY_SOURCE_MATRIX = {
f"cuda-{major}.{minor}": dict(
name=f"cuda-{major}.{minor}",
index_url=f"{PYTORCH_NIGHTLY_PIP_INDEX_URL}/cu{major}{minor}",
supported_platforms=["Linux", "Windows"],
accelerator="cuda",
)
for major, minor in (map(int, version.split(".")) for version in CUDA_ARCHES)
}
ROCM_NIGHTLY_SOURCE_MATRIX = {
f"rocm-{major}.{minor}": dict(
name=f"rocm-{major}.{minor}",
index_url=f"{PYTORCH_NIGHTLY_PIP_INDEX_URL}/rocm{major}.{minor}",
supported_platforms=["Linux"],
accelerator="rocm",
)
for major, minor in (map(int, version.split(".")) for version in ROCM_ARCHES)
}
XPU_NIGHTLY_SOURCE_MATRIX = {
"xpu": dict(
name="xpu",
index_url=f"{PYTORCH_NIGHTLY_PIP_INDEX_URL}/xpu",
supported_platforms=["Linux"],
accelerator="xpu",
)
}
NIGHTLY_SOURCE_MATRIX.update(CUDA_NIGHTLY_SOURCE_MATRIX)
NIGHTLY_SOURCE_MATRIX.update(ROCM_NIGHTLY_SOURCE_MATRIX)
NIGHTLY_SOURCE_MATRIX.update(XPU_NIGHTLY_SOURCE_MATRIX)
def get_nccl_wheel_version(arch_version: str) -> str:
requirements = map(
str.strip, re.split("[;|]", PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version])
)
@ -147,17 +191,14 @@ def get_nccl_wheel_version(arch_version: str) -> str:
def read_nccl_pin(arch_version: str) -> str:
from pathlib import Path
nccl_pin_path = os.path.join(
Path(__file__).absolute().parents[2],
".ci",
"docker",
"ci_commit_pins",
f"nccl-cu{arch_version[:2]}.txt",
nccl_pin_path = (
REPO_ROOT
/ ".ci"
/ "docker"
/ "ci_commit_pins"
/ f"nccl-cu{arch_version[:2]}.txt"
)
with open(nccl_pin_path) as f:
return f.read().strip()
return nccl_pin_path.read_text().strip()
def validate_nccl_dep_consistency(arch_version: str) -> None:
@ -165,7 +206,8 @@ def validate_nccl_dep_consistency(arch_version: str) -> None:
wheel_ver = get_nccl_wheel_version(arch_version)
if not nccl_release_tag.startswith(f"v{wheel_ver}"):
raise RuntimeError(
f"{arch_version} NCCL release tag version {nccl_release_tag} does not correspond to wheel version {wheel_ver}"
f"{arch_version} NCCL release tag version {nccl_release_tag} "
f"does not correspond to wheel version {wheel_ver}"
)
@ -412,7 +454,14 @@ def generate_wheels_matrix(
return ret
validate_nccl_dep_consistency("13.0")
validate_nccl_dep_consistency("12.9")
validate_nccl_dep_consistency("12.8")
validate_nccl_dep_consistency("12.6")
arch_version = ""
for arch_version in CUDA_ARCHES:
validate_nccl_dep_consistency(arch_version)
del arch_version
if __name__ == "__main__":
# Used by tools/nightly.py
(SCRIPT_DIR / "nightly_source_matrix.json").write_text(
json.dumps(NIGHTLY_SOURCE_MATRIX, indent=4) + "\n"
)

View File

@ -38,6 +38,10 @@ on:
default: ""
description: |
List of tests to include (empty string implies default list)
dashboard-tag:
required: false
type: string
default: ""
disable-monitor:
description: |
[Experimental] Disable utilization monitoring for tests.
@ -58,6 +62,11 @@ on:
required: false
type: number
default: 1
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
permissions:
id-token: write
contents: read
@ -196,6 +205,8 @@ jobs:
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}
PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
TESTS_TO_INCLUDE: ${{ inputs.tests-to-include }}
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
timeout-minutes: ${{ fromJson(steps.test-timeout.outputs.timeout) }}
run: |
# Fetch aws credential from IMDs
@ -246,6 +257,8 @@ jobs:
-e PYTORCH_TEST_RERUN_DISABLED_TESTS \
-e TESTS_TO_INCLUDE \
-e ZE_AFFINITY_MASK \
-e HUGGING_FACE_HUB_TOKEN \
-e DASHBOARD_TAG \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--ulimit stack=10485760:83886080 \
--ulimit core=0 \

View File

@ -36,7 +36,7 @@ jobs:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
tag: ["cuda12.6", "cuda12.8", "cuda12.9", "cuda13.0", "rocm6.4", "rocm7.0", "cpu"]
tag: ["cuda12.6", "cuda12.8", "cuda12.9", "cuda13.0", "rocm7.0", "rocm7.1", "cpu"]
steps:
- name: Build docker image
uses: pytorch/pytorch/.github/actions/binary-docker-build@main

View File

@ -52,8 +52,8 @@ jobs:
{ tag: "cuda12.9" },
{ tag: "cuda12.8" },
{ tag: "cuda12.6" },
{ tag: "rocm6.4" },
{ tag: "rocm7.0" },
{ tag: "rocm7.1" },
{ tag: "cpu" },
]
steps:

View File

@ -34,7 +34,7 @@ jobs:
id-token: write
strategy:
matrix:
rocm_version: ["70", "64"]
rocm_version: ["71", "70"]
steps:
- name: Checkout PyTorch
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -54,8 +54,8 @@ jobs:
{ name: "manylinuxaarch64-builder", tag: "cuda12.9", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.8", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.6", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.4", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm7.0", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm7.1", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cpu", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28_aarch64-builder", tag: "cpu-aarch64", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "xpu", runner: "linux.9xlarge.ephemeral" },

View File

@ -55,7 +55,7 @@ jobs:
docker-image: ["pytorch/manylinux2_28-builder:cpu"]
include:
- device: "rocm"
rocm_version: "7.0"
rocm_version: "7.1"
runs_on: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge"
- device: "cuda"
rocm_version: ""
@ -159,12 +159,7 @@ jobs:
WITH_CLANG_LDD="--with-clang-ldd"
fi
if [[ "${BUILD_DEVICE}" == xpu ]]; then
docker exec -t "${container_name}" bash -c "dnf install -y gcc-toolset-13-gcc-c++"
docker exec -t "${container_name}" bash -c "source /opt/rh/gcc-toolset-13/enable && ${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE"
else
docker exec -t "${container_name}" bash -c "${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE $WITH_CLANG_LDD"
fi
docker exec -t "${container_name}" bash -c "${PYTHON_EXECUTABLE} /pytorch/.github/scripts/build_triton_wheel.py --device=$BUILD_DEVICE $RELEASE $WITH_CLANG_LDD"
if [[ ("${{ matrix.device }}" == "cuda" || "${{ matrix.device }}" == "xpu") ]]; then
docker exec -t "${container_name}" bash -c "auditwheel repair --plat ${PLATFORM} //artifacts/*.whl"

View File

@ -67,6 +67,7 @@ jobs:
pytorch-linux-jammy-py3.12-halide,
pytorch-linux-jammy-xpu-n-1-py3,
pytorch-linux-jammy-xpu-n-py3,
pytorch-linux-jammy-xpu-n-py3-inductor-benchmarks,
pytorch-linux-jammy-py3-clang18-asan,
pytorch-linux-jammy-py3-clang12-onnx,
pytorch-linux-jammy-linter,

View File

@ -384,124 +384,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_4-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: "6.4"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
timeout-minutes: 300
build_name: libtorch-rocm6_4-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm6_4-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-rocm6_4-shared-with-deps-release-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: "6.4"
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-rocm6_4-shared-with-deps-release
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: libtorch-cxx11-builder
custom-tag-prefix: rocm6.4
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm6_4-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm6_4-shared-with-deps-release-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: "6.4"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-rocm6_4-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm7_0-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -619,3 +501,121 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm7_1-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.1
GPU_ARCH_VERSION: "7.1"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
timeout-minutes: 300
build_name: libtorch-rocm7_1-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm7_1-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-rocm7_1-shared-with-deps-release-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.1
GPU_ARCH_VERSION: "7.1"
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-rocm7_1-shared-with-deps-release
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: libtorch-cxx11-builder
custom-tag-prefix: rocm7.1
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm7_1-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm7_1-shared-with-deps-release-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.1
GPU_ARCH_VERSION: "7.1"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-rocm7_1-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,148 @@
name: inductor-perf-nightly-xpu
on:
push:
tags:
- ciflow/inductor-perf-test-nightly-xpu/*
schedule:
- cron: 30 17 * * *
workflow_dispatch:
inputs:
training:
description: Run training (on by default)?
required: false
type: boolean
default: true
inference:
description: Run inference (on by default)?
required: false
type: boolean
default: true
default:
description: Run inductor_default?
required: false
type: boolean
default: false
dynamic:
description: Run inductor_dynamic_shapes?
required: false
type: boolean
default: false
cppwrapper:
description: Run inductor_cpp_wrapper?
required: false
type: boolean
default: false
cudagraphs:
description: Run inductor_cudagraphs?
required: false
type: boolean
default: false
freezing_cudagraphs:
description: Run inductor_cudagraphs with freezing for inference?
required: false
type: boolean
default: false
aotinductor:
description: Run aot_inductor for inference?
required: false
type: boolean
default: false
maxautotune:
description: Run inductor_max_autotune?
required: false
type: boolean
default: false
benchmark_configs:
description: The list of configs used the benchmark
required: false
type: string
default: inductor_huggingface_perf,inductor_timm_perf,inductor_torchbench_perf,cachebench
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
xpu-n-py3_10-inductor-benchmark-build:
name: xpu-n-py3.10-inductor-benchmark
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-xpu-n-py3.10
docker-image-name: ci-image:pytorch-linux-jammy-xpu-n-py3-inductor-benchmarks
runner: linux.c7i.12xlarge
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_xpu", shard: 1, num_shards: 5, runner: "linux.idc.xpu" },
{ config: "inductor_huggingface_perf_xpu", shard: 2, num_shards: 5, runner: "linux.idc.xpu" },
{ config: "inductor_huggingface_perf_xpu", shard: 3, num_shards: 5, runner: "linux.idc.xpu" },
{ config: "inductor_huggingface_perf_xpu", shard: 4, num_shards: 5, runner: "linux.idc.xpu" },
{ config: "inductor_huggingface_perf_xpu", shard: 5, num_shards: 5, runner: "linux.idc.xpu" },
{ config: "inductor_timm_perf_xpu", shard: 1, num_shards: 6, runner: "linux.idc.xpu" },
{ config: "inductor_timm_perf_xpu", shard: 2, num_shards: 6, runner: "linux.idc.xpu" },
{ config: "inductor_timm_perf_xpu", shard: 3, num_shards: 6, runner: "linux.idc.xpu" },
{ config: "inductor_timm_perf_xpu", shard: 4, num_shards: 6, runner: "linux.idc.xpu" },
{ config: "inductor_timm_perf_xpu", shard: 5, num_shards: 6, runner: "linux.idc.xpu" },
{ config: "inductor_timm_perf_xpu", shard: 6, num_shards: 6, runner: "linux.idc.xpu" },
{ config: "inductor_torchbench_perf_xpu", shard: 1, num_shards: 6, runner: "linux.idc.xpu" },
{ config: "inductor_torchbench_perf_xpu", shard: 2, num_shards: 6, runner: "linux.idc.xpu" },
{ config: "inductor_torchbench_perf_xpu", shard: 3, num_shards: 6, runner: "linux.idc.xpu" },
{ config: "inductor_torchbench_perf_xpu", shard: 4, num_shards: 6, runner: "linux.idc.xpu" },
{ config: "inductor_torchbench_perf_xpu", shard: 5, num_shards: 6, runner: "linux.idc.xpu" },
{ config: "inductor_torchbench_perf_xpu", shard: 6, num_shards: 6, runner: "linux.idc.xpu" },
]}
secrets: inherit
xpu-n-py3_10-inductor-benchmark-test-nightly:
permissions:
id-token: write
contents: read
if: github.event_name != 'workflow_dispatch'
name: xpu-n-py3.10-inductor-benchmark
uses: ./.github/workflows/_xpu-test.yml
needs: xpu-n-py3_10-inductor-benchmark-build
with:
build-environment: linux-jammy-xpu-n-py3.10
dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-false-cppwrapper-true-aotinductor-true-freezing_cudagraphs-false-cudagraphs_low_precision-false
docker-image: ${{ needs.xpu-n-py3_10-inductor-benchmark-build.outputs.docker-image }}
test-matrix: ${{ needs.xpu-n-py3_10-inductor-benchmark-build.outputs.test-matrix }}
timeout-minutes: 720
# Disable monitor in perf tests for more investigation
disable-monitor: true
monitor-log-interval: 10
monitor-data-collect-interval: 2
secrets: inherit
xpu-n-py3_10-inductor-benchmark-test:
permissions:
id-token: write
contents: read
if: github.event_name == 'workflow_dispatch'
name: xpu-n-py3.10-inductor-test
uses: ./.github/workflows/_xpu-test.yml
needs: xpu-n-py3_10-inductor-benchmark-build
with:
build-environment: linux-jammy-xpu-n-py3.10
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cudagraphs-${{ inputs.cudagraphs }}-cppwrapper-${{ inputs.cppwrapper }}-aotinductor-${{ inputs.aotinductor }}-maxautotune-${{ inputs.maxautotune }}-freezing_cudagraphs-${{ inputs.freezing_cudagraphs }}-cudagraphs_low_precision-${{ inputs.cudagraphs }}
docker-image: ${{ needs.xpu-n-py3_10-inductor-benchmark-build.outputs.docker-image }}
test-matrix: ${{ needs.xpu-n-py3_10-inductor-benchmark-build.outputs.test-matrix }}
timeout-minutes: 720
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4
secrets: inherit

2
.gitignore vendored
View File

@ -143,6 +143,7 @@ scripts/release_notes/*.json
sccache-stats*.json
lint.json
merge_record.json
.github/scripts/nightly_source_matrix.json
# These files get copied over on invoking setup.py
torchgen/packaged/*
@ -397,3 +398,4 @@ CLAUDE.local.md
/test_*.py
/debug_*.py
CLAUDE_CONTEXT/
/.claude/settings.local.json

View File

@ -11,7 +11,6 @@ aspects of contributing to PyTorch.
<!-- toc -->
- [Developing PyTorch](#developing-pytorch)
- [Setup the development environment](#setup-the-development-environment)
- [Tips and Debugging](#tips-and-debugging)
- [Nightly Checkout & Pull](#nightly-checkout--pull)
- [Codebase structure](#codebase-structure)
@ -67,23 +66,6 @@ aspects of contributing to PyTorch.
Follow the instructions for [installing PyTorch from source](https://github.com/pytorch/pytorch#from-source). If you get stuck when developing PyTorch on your machine, check out the [tips and debugging](#tips-and-debugging) section below for common solutions.
### Setup the development environment
First, you need to [fork the PyTorch project on GitHub](https://github.com/pytorch/pytorch/fork) and follow the instructions at [Connecting to GitHub with SSH](https://docs.github.com/en/authentication/connecting-to-github-with-ssh) to setup your SSH authentication credentials.
Then clone the PyTorch project and setup the development environment:
```bash
git clone git@github.com:<USERNAME>/pytorch.git
cd pytorch
git remote add upstream git@github.com:pytorch/pytorch.git
make setup-env
# Or run `make setup-env-cuda` for pre-built CUDA binaries
# Or run `make setup-env-rocm` for pre-built ROCm binaries
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
### Tips and Debugging
* If you want to have no-op incremental rebuilds (which are fast), see [Make no-op build fast](#make-no-op-build-fast) below.

View File

@ -825,6 +825,14 @@ void Context::setDisplayVmapFallbackWarnings(bool enabled) {
display_vmap_fallback_warnings_ = enabled;
}
bool Context::warnOnAccumulateGradStreamMismatch() const {
return warn_on_accumulate_grad_stream_mismatch_;
}
void Context::setWarnOnAccumulateGradStreamMismatch(bool enabled) {
warn_on_accumulate_grad_stream_mismatch_ = enabled;
}
bool Context::isDefaultMobileCPUAllocatorSet() {
return prev_allocator_ptr_ != nullptr;
}

View File

@ -404,6 +404,9 @@ class TORCH_API Context {
void setDisplayVmapFallbackWarnings(bool enabled);
bool areVmapFallbackWarningsEnabled() const;
void setWarnOnAccumulateGradStreamMismatch(bool enabled);
bool warnOnAccumulateGradStreamMismatch() const;
bool isDefaultMobileCPUAllocatorSet();
void setDefaultMobileCPUAllocator();
void unsetDefaultMobileCPUAllocator();
@ -494,6 +497,7 @@ class TORCH_API Context {
bool release_original_weights = false;
#endif
bool display_vmap_fallback_warnings_ = false;
bool warn_on_accumulate_grad_stream_mismatch_ = true;
std::atomic<at::QEngine> quantized_engine = at::QEngine::NoQEngine;
bool enable_sparse_tensor_invariant_checks = false;
bool allow_fp16_reduction_cpu = false;

View File

@ -19,6 +19,13 @@ inline namespace CPU_CAPABILITY {
#error "Big endian is not supported."
#endif
// GCC does not properly optimize bf16 operators
#if defined(__ARM_FEATURE_BF16) && (__clang_major__ >= 19)
#define BF16_ARITHMETIC_SUPPORTED() 1
#else
#define BF16_ARITHMETIC_SUPPORTED() 0
#endif
// Unlike the float16_t family of types, bfloat16_t is not available
// when we're not targeting bfloat16 hardware support on some
// platforms (but not Mac, so we have to be careful not to shadow the
@ -352,18 +359,72 @@ class Vectorized<c10::BFloat16> : public Vectorized16<
other, &Vectorized<float>::name); \
}
DEFINE_UNARY_ELEMENTWISE_FUNC_VIA_FLOAT_METHOD(abs)
Vectorized frac() const;
DEFINE_UNARY_ELEMENTWISE_FUNC_VIA_FLOAT_METHOD(neg)
DEFINE_UNARY_ELEMENTWISE_FUNC_VIA_FLOAT_METHOD(trunc)
DEFINE_UNARY_ELEMENTWISE_FUNC_VIA_FLOAT_METHOD(sqrt)
#ifdef __ARM_FEATURE_BF16
// Flip sign bit
Vectorized<c10::BFloat16> neg() const {
return vreinterpretq_bf16_s16(vreinterpretq_s16_bf16(values) ^ (-32768));
}
// Fast reciprocal is fine because we are truncating results
Vectorized<c10::BFloat16> reciprocal() const {
auto x = vcvtq_low_f32_bf16(values);
auto y = vcvtq_high_f32_bf16(values);
x = vrecpeq_f32(x);
y = vrecpeq_f32(y);
return vcvtq_high_bf16_f32(vcvtq_low_bf16_f32(x), y);
}
// Clearing the sign bit
Vectorized<c10::BFloat16> abs() const {
return vreinterpretq_bf16_u16(vreinterpretq_u16_bf16(values) & 0x7FFF);
}
#else
DEFINE_UNARY_ELEMENTWISE_FUNC_VIA_FLOAT_METHOD(abs)
DEFINE_UNARY_ELEMENTWISE_FUNC_VIA_FLOAT_METHOD(neg)
DEFINE_UNARY_ELEMENTWISE_FUNC_VIA_FLOAT_METHOD(reciprocal)
#endif
// These functions are optimized on clang-21+
#if BF16_ARITHMETIC_SUPPORTED() && (__clang_major__ >= 21)
Vectorized<c10::BFloat16> operator==(
const Vectorized<c10::BFloat16>& other) const {
return values == other.values;
}
Vectorized<c10::BFloat16> operator!=(
const Vectorized<c10::BFloat16>& other) const {
return values != other.values;
}
Vectorized<c10::BFloat16> operator<(
const Vectorized<c10::BFloat16>& other) const {
return values < other.values;
}
Vectorized<c10::BFloat16> operator<=(
const Vectorized<c10::BFloat16>& other) const {
return values <= other.values;
}
Vectorized<c10::BFloat16> operator>(
const Vectorized<c10::BFloat16>& other) const {
return values > other.values;
}
Vectorized<c10::BFloat16> operator>=(
const Vectorized<c10::BFloat16>& other) const {
return values >= other.values;
}
#else
DEFINE_BINARY_COMPARISON_OPERATOR_VIA_FLOAT_METHOD(operator==)
DEFINE_BINARY_COMPARISON_OPERATOR_VIA_FLOAT_METHOD(operator!=)
DEFINE_BINARY_COMPARISON_OPERATOR_VIA_FLOAT_METHOD(operator<)
DEFINE_BINARY_COMPARISON_OPERATOR_VIA_FLOAT_METHOD(operator<=)
DEFINE_BINARY_COMPARISON_OPERATOR_VIA_FLOAT_METHOD(operator>)
DEFINE_BINARY_COMPARISON_OPERATOR_VIA_FLOAT_METHOD(operator>=)
#endif
#undef DEFINE_UNARY_ELEMENTWISE_FUNC_VIA_FLOAT_METHOD
#undef DEFINE_BINARY_ELEMENTWISE_FUNC_VIA_FLOAT_METHOD
@ -412,28 +473,52 @@ template <>
Vectorized<c10::BFloat16> inline operator+(
const Vectorized<c10::BFloat16>& a,
const Vectorized<c10::BFloat16>& b) {
#if BF16_ARITHMETIC_SUPPORTED()
bfloat16x8_t x = a;
bfloat16x8_t y = b;
return x + y;
#else
return binary_operator_via_float(std::plus<Vectorized<float>>(), a, b);
#endif
}
template <>
Vectorized<c10::BFloat16> inline operator-(
const Vectorized<c10::BFloat16>& a,
const Vectorized<c10::BFloat16>& b) {
#if BF16_ARITHMETIC_SUPPORTED()
bfloat16x8_t x = a;
bfloat16x8_t y = b;
return x - y;
#else
return binary_operator_via_float(std::minus<Vectorized<float>>(), a, b);
#endif
}
template <>
Vectorized<c10::BFloat16> inline operator*(
const Vectorized<c10::BFloat16>& a,
const Vectorized<c10::BFloat16>& b) {
#if BF16_ARITHMETIC_SUPPORTED()
bfloat16x8_t x = a;
bfloat16x8_t y = b;
return x * y;
#else
return binary_operator_via_float(std::multiplies<Vectorized<float>>(), a, b);
#endif
}
template <>
Vectorized<c10::BFloat16> inline operator/(
const Vectorized<c10::BFloat16>& a,
const Vectorized<c10::BFloat16>& b) {
#if BF16_ARITHMETIC_SUPPORTED()
bfloat16x8_t x = a;
bfloat16x8_t y = b;
return x / y;
#else
return binary_operator_via_float(std::divides<Vectorized<float>>(), a, b);
#endif
}
// frac. Implement this here so we can use subtraction
@ -544,12 +629,19 @@ Vectorized<c10::BFloat16> inline fmadd(
const Vectorized<c10::BFloat16>& a,
const Vectorized<c10::BFloat16>& b,
const Vectorized<c10::BFloat16>& c) {
#if BF16_ARITHMETIC_SUPPORTED()
bfloat16x8_t x = a;
bfloat16x8_t y = b;
bfloat16x8_t z = c;
return x * y + z;
#else
// NOTE [BF16 FMA]: There isn't an FMA that accumulates into BF16! Also,
// vbfmlalbq_f32 and vbfmlaltq_f32 take the even and odd-numbered
// elements, not the bottom and top half, so they don't seem
// particularly useful here. Ideally we would include dot product in
// the Vectorized interface...
return a * b + c;
#endif
}
template <>
@ -557,8 +649,15 @@ Vectorized<c10::BFloat16> inline fnmadd(
const Vectorized<c10::BFloat16>& a,
const Vectorized<c10::BFloat16>& b,
const Vectorized<c10::BFloat16>& c) {
#if BF16_ARITHMETIC_SUPPORTED()
bfloat16x8_t x = a;
bfloat16x8_t y = b;
bfloat16x8_t z = c;
return (-x) * y + z;
#else
// See NOTE [BF16 FMA] above.
return -a * b + c;
#endif
}
template <>
@ -566,8 +665,15 @@ Vectorized<c10::BFloat16> inline fmsub(
const Vectorized<c10::BFloat16>& a,
const Vectorized<c10::BFloat16>& b,
const Vectorized<c10::BFloat16>& c) {
#if BF16_ARITHMETIC_SUPPORTED()
bfloat16x8_t x = a;
bfloat16x8_t y = b;
bfloat16x8_t z = c;
return x * y - z;
#else
// See NOTE [BF16 FMA] above.
return a * b - c;
#endif
}
template <>
@ -575,8 +681,15 @@ Vectorized<c10::BFloat16> inline fnmsub(
const Vectorized<c10::BFloat16>& a,
const Vectorized<c10::BFloat16>& b,
const Vectorized<c10::BFloat16>& c) {
#if BF16_ARITHMETIC_SUPPORTED()
bfloat16x8_t x = a;
bfloat16x8_t y = b;
bfloat16x8_t z = c;
return (-x) * y - z;
#else
// See NOTE [BF16 FMA] above.
return -a * b - c;
#endif
}
#endif // !defined(C10_MOBILE) && defined(__aarch64__)

View File

@ -6,9 +6,9 @@ namespace at::vec {
inline namespace CPU_CAPABILITY {
#if (defined(__aarch64__) && !defined(CPU_CAPABILITY_SVE256))
// Enable auto-vectorization for GCC-13+ and clang-17+
// Enable auto-vectorization for clang-17+
// GCC-12 has a bug: gcc.gnu.org/bugzilla/show_bug.cgi?id=117001
#if __GNUC__ > 12 || (defined(__clang__) && (__clang_major__ >= 17))
#if defined(__clang__) && (__clang_major__ >= 17)
template <typename from_type, typename to_type>
inline void convertImpl(

View File

@ -309,7 +309,7 @@ class Vectorized<float> {
DEFINE_SLEEF_COMPATIBLE_UNARY_ELEMENTWISE_FUNC(expm1)
// Implementation copied from Arm Optimized Routine
// https://github.com/ARM-software/optimized-routines/blob/master/math/aarch64/advsimd/expf.c
Vectorized<float> exp_u20() const {
inline Vectorized<float> vexpq_f32_u20() const {
// bail out to sleef if it's a special case:
// i.e. there's an input s.t. |input| > 87.3....
const float32x4_t special_bound = vdupq_n_f32(0x1.5d5e2ap+6f);
@ -348,6 +348,9 @@ class Vectorized<float> {
return vfmaq_f32(scale, poly, scale);
}
Vectorized<float> exp_u20() const {
return vexpq_f32_u20();
}
Vectorized<float> fexp_u20() const {
return exp_u20();
}
@ -634,7 +637,7 @@ inline Vectorized<float> Vectorized<float>::erf() const {
// - exp(- x * x)
auto pow_2 = (*this) * (*this);
auto neg_pow_2 = pow_2 ^ neg_zero_vec;
auto tmp4 = neg_pow_2.exp();
auto tmp4 = neg_pow_2.vexpq_f32_u20();
auto tmp5 = tmp4 ^ neg_zero_vec;
// erf(x) = sign(x) * (1 - r * t * exp(- x * x))
auto tmp6 = t * tmp5;

View File

@ -1,78 +1,90 @@
#include <ATen/cuda/CUDAGreenContext.h>
namespace at::cuda {
GreenContext::GreenContext(uint32_t device_id, uint32_t num_sms) {
#if CUDA_HAS_GREEN_CONTEXT
int driver_version;
C10_CUDA_CHECK(cudaDriverGetVersion(&driver_version));
TORCH_CHECK(
driver_version >= 12080, "cuda driver too old to use green context!");
CUcontext pctx = nullptr;
C10_CUDA_DRIVER_CHECK(c10::cuda::DriverAPI::get()->cuCtxGetCurrent_(&pctx));
if (C10_UNLIKELY(!pctx)) {
TORCH_WARN(
"Attempted to create a green context but"
" there was no primary context! Creating a primary context...");
cudaFree(0);
}
CUdevice device;
device_id_ = device_id;
C10_CUDA_DRIVER_CHECK(
c10::cuda::DriverAPI::get()->cuDeviceGet_(&device, device_id));
// Get device resources
CUdevResource device_resource;
C10_CUDA_DRIVER_CHECK(c10::cuda::DriverAPI::get()->cuDeviceGetDevResource_(
device, &device_resource, CU_DEV_RESOURCE_TYPE_SM));
// Split resources
std::vector<CUdevResource> result(1);
auto result_data = result.data();
unsigned int nb_groups = 1;
CUdevResource remaining;
C10_CUDA_DRIVER_CHECK(
c10::cuda::DriverAPI::get()->cuDevSmResourceSplitByCount_(
result_data,
&nb_groups,
&device_resource,
&remaining,
0, // default flags
num_sms));
TORCH_CHECK(nb_groups == 1, "Failed to create single resource group");
// Generate resource descriptor
CUdevResourceDesc desc;
C10_CUDA_DRIVER_CHECK(
c10::cuda::DriverAPI::get()->cuDevResourceGenerateDesc_(
&desc, result_data, 1));
// Create green context
// CU_GREEN_CTX_DEFAULT_STREAM is required per docs:
// https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__GREEN__CONTEXTS.html
C10_CUDA_DRIVER_CHECK(c10::cuda::DriverAPI::get()->cuGreenCtxCreate_(
&green_ctx_, desc, device, CU_GREEN_CTX_DEFAULT_STREAM));
// Convert to regular context
C10_CUDA_DRIVER_CHECK(
c10::cuda::DriverAPI::get()->cuCtxFromGreenCtx_(&context_, green_ctx_));
TORCH_CHECK(context_, "Green ctx conversion to regular ctx failed!");
#if defined(CUDA_VERSION) && !defined(USE_ROCM) && defined(PYTORCH_C10_DRIVER_API_SUPPORTED)
#include <c10/cuda/driver_api.h>
#include <stdexcept>
#include <vector>
#define HAS_CUDA_GREEN_CONTEXT() 1
#else
TORCH_CHECK(false, "Green Context is only supported on CUDA 12.8+!");
#define HAS_CUDA_GREEN_CONTEXT() 0
// Suppress unsued private field warnings as this class is not supposed to be called
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-private-field")
#endif
namespace at::cuda {
GreenContext::GreenContext(uint32_t device_id, uint32_t num_sms) {
#if HAS_CUDA_GREEN_CONTEXT()
int driver_version;
C10_CUDA_CHECK(cudaDriverGetVersion(&driver_version));
TORCH_CHECK(
driver_version >= 12080, "cuda driver too old to use green context!");
CUcontext pctx = nullptr;
C10_CUDA_DRIVER_CHECK(c10::cuda::DriverAPI::get()->cuCtxGetCurrent_(&pctx));
if (C10_UNLIKELY(!pctx)) {
TORCH_WARN(
"Attempted to create a green context but"
" there was no primary context! Creating a primary context...");
cudaFree(0);
}
CUdevice device;
device_id_ = device_id;
C10_CUDA_DRIVER_CHECK(
c10::cuda::DriverAPI::get()->cuDeviceGet_(&device, device_id));
// Get device resources
CUdevResource device_resource;
C10_CUDA_DRIVER_CHECK(c10::cuda::DriverAPI::get()->cuDeviceGetDevResource_(
device, &device_resource, CU_DEV_RESOURCE_TYPE_SM));
// Split resources
std::vector<CUdevResource> result(1);
auto result_data = result.data();
unsigned int nb_groups = 1;
CUdevResource remaining;
C10_CUDA_DRIVER_CHECK(
c10::cuda::DriverAPI::get()->cuDevSmResourceSplitByCount_(
result_data,
&nb_groups,
&device_resource,
&remaining,
0, // default flags
num_sms));
TORCH_CHECK(nb_groups == 1, "Failed to create single resource group");
// Generate resource descriptor
CUdevResourceDesc desc;
C10_CUDA_DRIVER_CHECK(
c10::cuda::DriverAPI::get()->cuDevResourceGenerateDesc_(
&desc, result_data, 1));
// Create green context
// CU_GREEN_CTX_DEFAULT_STREAM is required per docs:
// https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__GREEN__CONTEXTS.html
C10_CUDA_DRIVER_CHECK(c10::cuda::DriverAPI::get()->cuGreenCtxCreate_(
&green_ctx_, desc, device, CU_GREEN_CTX_DEFAULT_STREAM));
// Convert to regular context
C10_CUDA_DRIVER_CHECK(
c10::cuda::DriverAPI::get()->cuCtxFromGreenCtx_(&context_, green_ctx_));
TORCH_CHECK(context_, "Green ctx conversion to regular ctx failed!");
#else
TORCH_CHECK(false, "Green Context is only supported on CUDA 12.8+!");
#endif
}
std::unique_ptr<GreenContext> GreenContext::create(
uint32_t num_sms,
std::optional<uint32_t> device_id) {
#if CUDA_HAS_GREEN_CONTEXT
#if HAS_CUDA_GREEN_CONTEXT()
if (!device_id.has_value()) {
device_id = at::cuda::current_device();
}
return std::make_unique<GreenContext>(device_id.value(), num_sms);
return std::unique_ptr<GreenContext>(new GreenContext(device_id.value(), num_sms));
#else
TORCH_CHECK(false, "Green Context is only supported on CUDA 12.8+!");
#endif
@ -80,7 +92,7 @@ namespace at::cuda {
// Implement move operations
GreenContext::GreenContext(GreenContext&& other) noexcept{
#if CUDA_HAS_GREEN_CONTEXT
#if HAS_CUDA_GREEN_CONTEXT()
device_id_ = std::exchange(other.device_id_, -1);
green_ctx_ = std::exchange(other.green_ctx_, nullptr);
context_ = std::exchange(other.context_, nullptr);
@ -91,7 +103,7 @@ namespace at::cuda {
}
GreenContext& GreenContext::operator=(GreenContext&& other) noexcept{
#if CUDA_HAS_GREEN_CONTEXT
#if HAS_CUDA_GREEN_CONTEXT()
if (this != &other) {
// Clean up current resources
if (green_ctx_) {
@ -120,7 +132,7 @@ namespace at::cuda {
}
GreenContext::~GreenContext() noexcept{
#if CUDA_HAS_GREEN_CONTEXT
#if HAS_CUDA_GREEN_CONTEXT()
C10_CUDA_DRIVER_CHECK(
c10::cuda::DriverAPI::get()->cuGreenCtxDestroy_(green_ctx_));
#else
@ -128,25 +140,9 @@ namespace at::cuda {
#endif
}
// Get the underlying CUDA context
CUcontext GreenContext::getContext() const {
#if CUDA_HAS_GREEN_CONTEXT
return context_;
#else
TORCH_CHECK(false, "Green Context is only supported on CUDA 12.8+!");
#endif
}
// Get the underlying green context
#if CUDA_HAS_GREEN_CONTEXT
CUgreenCtx GreenContext::getGreenContext() const {
return green_ctx_;
}
#endif
// Make this context current
void GreenContext::setContext() {
#if CUDA_HAS_GREEN_CONTEXT
#if HAS_CUDA_GREEN_CONTEXT()
auto current_stream = c10::cuda::getCurrentCUDAStream();
parent_stream_ = current_stream.stream();
@ -175,7 +171,7 @@ namespace at::cuda {
}
void GreenContext::popContext() {
#if CUDA_HAS_GREEN_CONTEXT
#if HAS_CUDA_GREEN_CONTEXT()
// see above note about stream being hardcoded to the default stream
at::cuda::CUDAEvent ev;
ev.record(c10::cuda::getCurrentCUDAStream());

View File

@ -1,53 +1,38 @@
#pragma once
#include <ATen/cuda/CUDAEvent.h>
#if defined(CUDA_VERSION) && !defined(USE_ROCM) && defined(PYTORCH_C10_DRIVER_API_SUPPORTED)
#include <c10/cuda/driver_api.h>
#include <cuda.h>
#include <memory>
#include <stdexcept>
#include <vector>
#define CUDA_HAS_GREEN_CONTEXT 1
#else
#define CUDA_HAS_GREEN_CONTEXT 0
#endif
// Forward declare green context as opaque ptr
typedef struct CUgreenCtx_st* CUgreenCtx;
namespace at::cuda {
class TORCH_CUDA_CPP_API GreenContext {
public:
GreenContext(uint32_t device_id, uint32_t num_sms);
static std::unique_ptr<GreenContext> create(uint32_t num_sms, std::optional<uint32_t> device_id);
// Green context creation
static std::unique_ptr<GreenContext> create(
uint32_t num_sms,
std::optional<uint32_t> device_id);
~GreenContext() noexcept;
// Delete copy constructor and assignment
GreenContext(const GreenContext&) = delete;
GreenContext& operator=(const GreenContext&) = delete;
// Implement move operations
GreenContext(GreenContext&& other) noexcept;
GreenContext& operator=(GreenContext&& other) noexcept;
~GreenContext() noexcept;
// Get the underlying CUDA context
CUcontext getContext() const;
// Get the underlying green context
#if CUDA_HAS_GREEN_CONTEXT
CUgreenCtx getGreenContext() const;
#endif
// Make this context current
void setContext();
void popContext();
private:
#if CUDA_HAS_GREEN_CONTEXT
GreenContext(uint32_t device_id, uint32_t num_sms);
// Implement move operations
GreenContext(GreenContext&& other) noexcept;
GreenContext& operator=(GreenContext&& other) noexcept;
int32_t device_id_ = -1;
CUgreenCtx green_ctx_ = nullptr;
CUcontext context_ = nullptr;
cudaStream_t parent_stream_ = nullptr;
#endif
};
} // namespace at::cuda

View File

@ -7,17 +7,6 @@
#endif
#if defined(USE_ROCM)
// hipSparse const API added in v2.4.0
#if HIPSPARSE_VERSION >= 200400
#define AT_USE_HIPSPARSE_GENERIC_API() 1
#else
#define AT_USE_HIPSPARSE_GENERIC_API() 1
#endif
#else // USE_ROCM
#define AT_USE_HIPSPARSE_GENERIC_API() 0
#endif // USE_ROCM
// cuSparse Generic API spsv function was added in CUDA 11.3.0
#if defined(CUDART_VERSION) && defined(CUSPARSE_VERSION) && (CUSPARSE_VERSION >= 11500)
#define AT_USE_CUSPARSE_GENERIC_SPSV() 1

View File

@ -2,8 +2,6 @@
#include <ATen/Tensor.h>
#include <ATen/cuda/Exceptions.h>
#include <mutex>
namespace at {
namespace cuda {
namespace detail {
@ -12,39 +10,36 @@ __device__ __constant__ float cublas_one_device;
__device__ __constant__ float cublas_zero_device;
float *get_cublas_device_one() {
static c10::once_flag init_flag;
c10::call_once(init_flag, []() {
static float *ptr = nullptr;
static auto init_flag = [&]() {
const float one = 1.f;
AT_CUDA_CHECK(cudaMemcpyToSymbol(cublas_one_device, &one, sizeof(float)));
});
AT_CUDA_CHECK(cudaGetSymbolAddress(reinterpret_cast<void**>(&ptr), cublas_one_device));
return true;
}();
float *ptr;
AT_CUDA_CHECK(cudaGetSymbolAddress(reinterpret_cast<void**>(&ptr), cublas_one_device));
return ptr;
}
float *get_cublas_device_zero() {
static c10::once_flag init_flag;
c10::call_once(init_flag, []() {
static float *ptr = nullptr;
static auto init_flag = [&]() {
const float zero = 0.f;
AT_CUDA_CHECK(cudaMemcpyToSymbol(cublas_zero_device, &zero, sizeof(float)));
});
AT_CUDA_CHECK(cudaGetSymbolAddress(reinterpret_cast<void**>(&ptr), cublas_zero_device));
return true;
}();
float *ptr;
AT_CUDA_CHECK(cudaGetSymbolAddress(reinterpret_cast<void**>(&ptr), cublas_zero_device));
return ptr;
}
float *get_user_alpha_ptr() {
static float *alpha_ptr;
static c10::once_flag init_flag;
c10::call_once(init_flag, []() {
static bool init_flag [[maybe_unused]] = []() {
AT_CUDA_CHECK(cudaMalloc(&alpha_ptr, sizeof(float)));
});
return true;
}();
return alpha_ptr;
}

View File

@ -1,5 +1,6 @@
#pragma once
#include <c10/core/CachingDeviceAllocator.h>
#include <c10/core/Device.h>
#include <c10/util/Exception.h>
@ -151,6 +152,36 @@ struct TORCH_API MTIAHooksInterface : AcceleratorHooksInterface {
}
virtual bool isAvailable() const override;
/* MTIAGraph related APIs */
virtual int64_t mtiagraphCreate(bool keep_graph = false) const {
FAIL_MTIAHOOKS_FUNC(__func__);
return -1;
}
virtual void mtiagraphCaptureBegin(int64_t handle, MempoolId_t pool) const {
FAIL_MTIAHOOKS_FUNC(__func__);
}
virtual void mtiagraphCaptureEnd(int64_t handle) const {
FAIL_MTIAHOOKS_FUNC(__func__);
}
virtual void mtiagraphInstantiate(int64_t handle) const {
FAIL_MTIAHOOKS_FUNC(__func__);
}
virtual void mtiagraphReplay(int64_t handle) const {
FAIL_MTIAHOOKS_FUNC(__func__);
}
virtual void mtiagraphReset(int64_t handle) const {
FAIL_MTIAHOOKS_FUNC(__func__);
}
virtual MempoolId_t mtiagraphPool(int64_t handle) const {
FAIL_MTIAHOOKS_FUNC(__func__);
}
};
struct TORCH_API MTIAHooksArgs {};

View File

@ -534,20 +534,20 @@ Tensor trace_decomp(const Tensor& tensor) {
std::tuple<Tensor, std::optional<int64_t>> tril_batch_rule(
const Tensor& self,
std::optional<int64_t> self_bdim,
int64_t diagonal = 0) {
c10::SymInt diagonal = 0) {
TORCH_CHECK(self.dim() >= 2, "tril: The input tensor must have at least 2 dimensions.");
auto self_ = moveBatchDimToFront(self, self_bdim);
auto result = at::tril(self_, diagonal);
auto result = at::tril_symint(self_, std::move(diagonal));
return std::make_tuple(std::move(result), 0);
}
std::tuple<Tensor, std::optional<int64_t>> triu_batch_rule(
const Tensor& self,
std::optional<int64_t> self_bdim,
int64_t diagonal = 0) {
c10::SymInt diagonal = 0) {
TORCH_CHECK(self.dim() >= 2, "triu: The input tensor must have at least 2 dimensions.");
auto self_ = moveBatchDimToFront(self, self_bdim);
auto result = at::triu(self_, diagonal);
auto result = at::triu_symint(self_, std::move(diagonal));
return std::make_tuple(std::move(result), 0);
}

View File

@ -1,7 +1,5 @@
// Copyright © 2022 Apple Inc.
#include <c10/util/CallOnce.h>
#include <ATen/mps/IndexKernels.h>
#include <ATen/mps/MPSAllocatorInterface.h>
#include <ATen/mps/MPSDevice.h>
@ -10,9 +8,6 @@
namespace at::mps {
static std::unique_ptr<MPSDevice> mps_device;
static c10::once_flag mpsdev_init;
static inline MTLLanguageVersion getMetalLanguageVersion(const id<MTLDevice>& device) {
// MPS Advanced Indexing needs at least Metal 2.0 (support for Argument Buffers and function constants)
// host_name attribute needs at least Metal 2.2 and ulong needs Metal 2.3 (supported on MacOS 11+
@ -21,8 +16,8 @@ static inline MTLLanguageVersion getMetalLanguageVersion(const id<MTLDevice>& de
}
MPSDevice* MPSDevice::getInstance() {
c10::call_once(mpsdev_init, [] { mps_device = std::unique_ptr<MPSDevice>(new MPSDevice()); });
return mps_device.get();
static MPSDevice mps_device;
return &mps_device;
}
MPSDevice::~MPSDevice() {

View File

@ -25,18 +25,19 @@ TORCH_PRECOMPUTE_META_FUNC(avg_pool2d)
// #20866, #22032: Guarantee this for the official C++ API?
TORCH_CHECK(kernel_size.size() == 1 || kernel_size.size() == 2,
"avg_pool2d: kernel_size must either be a single int, or a tuple of two ints");
const int64_t kH = kernel_size[0];
const int64_t kW = kernel_size.size() == 1 ? kH : kernel_size[1];
const int kH = safe_downcast<int, int64_t>(kernel_size[0]);
const int kW = kernel_size.size() == 1 ? kH : safe_downcast<int, int64_t>(kernel_size[1]);
TORCH_CHECK(stride.empty() || stride.size() == 1 || stride.size() == 2,
"avg_pool2d: stride must either be omitted, a single int, or a tuple of two ints");
const int64_t dH = stride.empty() ? kH : stride[0];
const int64_t dW = stride.empty() ? kW : stride.size() == 1 ? dH : stride[1];
const int dH = stride.empty() ? kH : safe_downcast<int, int64_t>(stride[0]);
const int dW = stride.empty() ? kW :
stride.size() == 1 ? dH : safe_downcast<int, int64_t>(stride[1]);
TORCH_CHECK(padding.size() == 1 || padding.size() == 2,
"avg_pool2d: padding must either be a single int, or a tuple of two ints");
const int64_t padH = padding[0];
const int64_t padW = padding.size() == 1 ? padH : padding[1];
const int padH = safe_downcast<int, int64_t>(padding[0]);
const int padW = padding.size() == 1 ? padH : safe_downcast<int, int64_t>(padding[1]);
TORCH_CHECK(!divisor_override.has_value() || divisor_override.value() != 0,
"divisor must be not zero");

View File

@ -410,8 +410,8 @@ struct ConvParams {
return false;
}
static long cudnn_version = detail::getCUDAHooks().versionCuDNN();
// broken on cuDNN 9.8
if (cudnn_version >= 90800) {
// broken on cuDNN 9.8 - 9.14
if (cudnn_version >= 90800 && cudnn_version < 91500) {
if (cudnn_conv_suggest_memory_format(input, weight) == at::MemoryFormat::Contiguous &&
(input.scalar_type() == at::kBFloat16 || input.scalar_type() == at::kHalf) &&
weight.dim() == 5) {

View File

@ -640,7 +640,7 @@ Tensor einsum(std::string_view equation, TensorList operands, at::OptionalIntArr
}
}
return ops[0];
return std::move(ops[0]);
}
// _trilinear computes a trilinear einstein sum with an unrolled dimension

View File

@ -139,7 +139,7 @@ void smooth_l1_backward_cpu_kernel(TensorIterator& iter, const Scalar& norm, dou
}
);
} else {
AT_DISPATCH_ALL_TYPES(dtype, "smooth_l1_backward_cpu_out", [&] {
AT_DISPATCH_ALL_TYPES_AND(kHalf, dtype, "smooth_l1_backward_cpu_out", [&] {
auto norm_val = norm.to<scalar_t>();
scalar_t beta_val(beta);
auto norm_val_vec = Vectorized<scalar_t>(norm_val);

View File

@ -170,10 +170,14 @@ static bool isInputCompliesAddmmCudaLt(Tensor& result, const Tensor& self, const
#if defined(CUDA_VERSION) || defined(USE_ROCM)
const auto scalar_type = mat1.scalar_type();
return (beta.toComplexDouble() == 1.0
// self.dim() == 1 && result.dim() == 2 && self.sizes()[0] == mat2_sizes[1]
// is to use lt interface only when self is bias.
&& self.dim() == 1 && self.sizes()[0] == mat2_sizes[1] && self.is_contiguous()
&& result.dim() == 2 && result.is_contiguous()
// Conditions for bias to be fusable
&& (
self.is_contiguous() &&
// NOTE: fine to have 1-len dims to the left from the right-most one
(self.dim() == 1 || self.squeeze().dim() == 1) &&
self.sizes().back() == mat2_sizes[1]
)
&& ( // some dtype restrictions
#ifndef USE_ROCM
scalar_type == at::ScalarType::Double ||

View File

@ -13,7 +13,7 @@ __global__ void vectorized_gather_kernel(char * out, char * inp, index_t * idx,
if (allow_neg_indices) {
ind = (ind < 0) ? ind + ind_dim_size : ind;
}
CUDA_KERNEL_ASSERT(ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds");
CUDA_KERNEL_ASSERT_VERBOSE(ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds", "Expected 0 <= index < ind_dim_size(%ld), but got index = %ld", ind_dim_size, ind);
int32_t off = (blockDim.x * blockIdx.y + threadIdx.x) * Alignment; // off is guaranteed to be within int32 limits
if (off >= slice_size) return;
auto vec = at::native::memory::ld_vec<Alignment>(inp + ind * inp_stride + off);

View File

@ -794,6 +794,24 @@ void _check_deepseek_scale_stride(const Tensor& scale, const Tensor& t, const Sc
}
}
void
_check_deepseek_support() {
#ifndef USE_ROCM
auto dprops = at::cuda::getCurrentDeviceProperties();
if (dprops->major != 9) {
// Only on Hopper GPUs
TORCH_CHECK_NOT_IMPLEMENTED(
dprops->major == 9,
"DeepSeek style (1x128, 128x128) scaling only supported in CUDA for SM90")
}
// Only in cublasLt >= 12.9
TORCH_CHECK_NOT_IMPLEMENTED(
CUBLAS_VERSION < 120900 || cublasLtGetVersion() < 120900,
"DeepSeek style (1x128, 128x128) scaling requires cublasLt >= 12.9"
);
#endif
}
Tensor&
_scaled_block1x128_block1x128(
const Tensor& mat_a, const Tensor& mat_b,
@ -802,8 +820,12 @@ _scaled_block1x128_block1x128(
const c10::ScalarType out_dtype,
const bool use_fast_accum,
Tensor& out) {
#ifndef USE_ROCM
// Restrictions:
// A, B are FP8, scales are fp32, shape K//128
// CUDA: Only Hopper GPUs
_check_deepseek_support();
TORCH_CHECK_VALUE(isFloat8Type(mat_a.scalar_type()) && isFloat8Type(mat_b.scalar_type()), "mat_a and mat_b must be fp8 types, got: ",
mat_a.scalar_type(), mat_b.scalar_type());
TORCH_CHECK_VALUE(scale_a.sizes()[0] == mat_a.sizes()[0] && scale_a.sizes()[1] == mat_a.sizes()[1] / 128 && scale_a.scalar_type() == kFloat,
@ -821,6 +843,12 @@ _scaled_block1x128_block1x128(
_scaled_gemm(mat_a, mat_b, scale_a, scale_b, scaling_choice_a, scaling_choice_b, bias, use_fast_accum, out);
return out;
#else
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"1x128 and 128x128 scaling not available with ROCm"
);
#endif
}
Tensor&
@ -831,10 +859,12 @@ _scaled_block128x128_block1x128(
const c10::ScalarType out_dtype,
const bool use_fast_accum,
Tensor& out) {
#ifndef USE_ROCM
// Restrictions:
// A, B are FP8, scales are fp32, shape K//128
std::cout << "mat_b: " << mat_b.dim() << ", " << mat_b.sizes() << ", " << mat_b.strides() << std::endl;
std::cout << "scale_b: " << scale_b.dim() << ", " << scale_b.sizes() << ", " << scale_b.strides() << std::endl;
// CUDA: Only Hopper GPUs
_check_deepseek_support();
TORCH_CHECK_VALUE(isFloat8Type(mat_a.scalar_type()) && isFloat8Type(mat_b.scalar_type()), "mat_a and mat_b must be fp8 types, got: ",
mat_a.scalar_type(), mat_b.scalar_type());
TORCH_CHECK_VALUE(scale_a.sizes()[0] == ceil_div<int64_t>(mat_a.sizes()[0], 128) && scale_a.sizes()[1] == ceil_div<int64_t>(mat_a.sizes()[1], 128) && scale_a.scalar_type() == kFloat,
@ -852,6 +882,12 @@ _scaled_block128x128_block1x128(
_scaled_gemm(mat_a, mat_b, scale_a, scale_b, scaling_choice_a, scaling_choice_b, bias, use_fast_accum, out);
return out;
#else
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"1x128 and 128x128 scaling not available with ROCm"
);
#endif
}
Tensor&
@ -862,8 +898,12 @@ _scaled_block1x128_block128x128(
const c10::ScalarType out_dtype,
const bool use_fast_accum,
Tensor& out) {
#ifndef USE_ROCM
// Restrictions:
// A, B are FP8, scales are fp32, A: shape K//128, B: K//128, N//128
// CUDA: Only Hopper GPUs
_check_deepseek_support();
TORCH_CHECK_VALUE(isFloat8Type(mat_a.scalar_type()) && isFloat8Type(mat_b.scalar_type()), "mat_a and mat_b must be fp8 types, got: ",
mat_a.scalar_type(), mat_b.scalar_type());
TORCH_CHECK_VALUE(scale_a.sizes()[0] == mat_a.sizes()[0] && scale_a.sizes()[1] == mat_a.sizes()[1] / 128 && scale_a.scalar_type() == kFloat,
@ -881,6 +921,12 @@ _scaled_block1x128_block128x128(
_scaled_gemm(mat_a, mat_b, scale_a, scale_b, scaling_choice_a, scaling_choice_b, bias, use_fast_accum, out);
return out;
#else
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"1x128 and 128x128 scaling not available with ROCm"
);
#endif
}
Tensor&

View File

@ -160,8 +160,8 @@ struct _cuda_scatter_gather_internal_kernel {
auto offsets = offset_calc.get(i);
int64_t idx_dim = *(index_t*)(index_ptr + offsets[2]);
CUDA_KERNEL_ASSERT(idx_dim >= 0 && idx_dim < index_size
&& "scatter gather kernel index out of bounds");
CUDA_KERNEL_ASSERT_VERBOSE(idx_dim >= 0 && idx_dim < index_size
&& "scatter gather kernel index out of bounds", "Expected 0 <= idx_dim < index_size (%ld), but got idx_dim = %ld", index_size, idx_dim);
f(
(scalar_t*)(self_ptr + offsets[0]),
@ -406,9 +406,8 @@ struct _cuda_scatter_fill_internal_kernel {
auto offsets = offset_calc.get(i);
int64_t idx_dim = *(index_t*)(index_ptr + offsets[1]);
CUDA_KERNEL_ASSERT(idx_dim >= 0 && idx_dim < index_size
&& "index out of bounds"
);
CUDA_KERNEL_ASSERT_VERBOSE(idx_dim >= 0 && idx_dim < index_size
&& "index out of bounds", "Expected 0 <= idx_dim < index_size (%ld), but got idx_dim = %ld", index_size, idx_dim);
f(
(scalar_t*)(self_ptr + offsets[0]),

View File

@ -141,7 +141,8 @@ WelfordDataLN cuWelfordOnlineSum(
if constexpr (!rms_norm){
U delta = val - curr_sum.mean;
U new_count = curr_sum.count + 1.f;
#if defined(USE_ROCM) && defined(USE_LAYERNORM_FAST_RECIPROCAL)
//Due to low CU count, we run into accuracy issues on gfx90a with `__builtin_amdgcn_rcpf`
#if defined(USE_ROCM) && !defined(__gfx90a__) && defined(USE_LAYERNORM_FAST_RECIPROCAL)
U new_mean = curr_sum.mean + delta * __builtin_amdgcn_rcpf(new_count);
#else
U new_mean = curr_sum.mean + delta * (1.f/new_count); //proper division is slow, this is less accurate but noticeably faster
@ -163,7 +164,8 @@ WelfordDataLN cuWelfordCombine(
U count = dataA.count + dataB.count;
U mean, sigma2;
if (count > decltype(dataB.count){0}) {
#if defined(USE_ROCM) && defined(USE_LAYERNORM_FAST_RECIPROCAL)
//Due to low CU count, we run into accuracy issues on gfx90a with `__builtin_amdgcn_rcpf`
#if defined(USE_ROCM) && !defined(__gfx90a__) && defined(USE_LAYERNORM_FAST_RECIPROCAL)
auto coef = __builtin_amdgcn_rcpf(count);
#else
auto coef = 1.f/count; //NB we don't use --use_fast_math, but this is emulation, 1./count goes to intrinsic, `* coef` is multiplication, instead of slow fp division

View File

@ -86,6 +86,28 @@ struct zeta_functor {
}
};
struct logaddexp_functor {
template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true>
inline T operator()(const T a, const T b) {
return c10::metal::logaddexp(a, b);
}
template <typename T, enable_if_t<is_integral_v<T>, bool> = true>
inline float operator()(const T a, const T b) {
return c10::metal::logaddexp(float(a), float(b));
}
};
struct logaddexp2_functor {
template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true>
inline T operator()(const T a, const T b) {
return c10::metal::logaddexp2(a, b);
}
template <typename T, enable_if_t<is_integral_v<T>, bool> = true>
inline float operator()(const T a, const T b) {
return c10::metal::logaddexp2(float(a), float(b));
}
};
struct xlog1py_functor {
template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true>
inline T operator()(const T a, const T b) {
@ -377,6 +399,10 @@ REGISTER_FLOAT_BINARY_OP(fmin);
REGISTER_FLOAT_BINARY_OP(nextafter);
REGISTER_FLOAT_BINARY_OP(zeta);
REGISTER_INT2FLOAT_BINARY_OP(zeta);
REGISTER_FLOAT_BINARY_OP(logaddexp);
REGISTER_INT2FLOAT_BINARY_OP(logaddexp);
REGISTER_FLOAT_BINARY_OP(logaddexp2);
REGISTER_INT2FLOAT_BINARY_OP(logaddexp2);
REGISTER_FLOAT_BINARY_OP(xlog1py);
REGISTER_INT2FLOAT_BINARY_OP(xlog1py);
REGISTER_FLOAT_BINARY_OP(chebyshev_polynomial_t);
@ -463,6 +489,8 @@ REGISTER_BINARY_OP(add, float2, float2);
REGISTER_BINARY_OP(add, half2, half2);
REGISTER_BINARY_OP(sub, float2, float2);
REGISTER_BINARY_OP(sub, half2, half2);
REGISTER_BINARY_OP(logaddexp, float2, float2);
REGISTER_BINARY_OP(logaddexp, half2, half2);
REGISTER_BINARY_ALPHA_OP(add_alpha, float2, float2, float2);
REGISTER_BINARY_ALPHA_OP(add_alpha, half2, half2, half2);
REGISTER_BINARY_ALPHA_OP(sub_alpha, float2, float2, float2);

View File

@ -89,6 +89,14 @@ static void zeta_mps_kernel(TensorIteratorBase& iter) {
lib.exec_binary_kernel(iter, "zeta");
}
static void logaddexp_mps_kernel(TensorIteratorBase& iter) {
lib.exec_binary_kernel(iter, "logaddexp");
}
static void logaddexp2_mps_kernel(TensorIteratorBase& iter) {
lib.exec_binary_kernel(iter, "logaddexp2");
}
static void xlog1py_mps_kernel(TensorIteratorBase& iter) {
TORCH_CHECK_TYPE(isFloatingType(iter.common_dtype()), "xlog1py_mps not implemented for non-floating types");
lib.exec_binary_kernel(iter, "xlog1py");
@ -211,6 +219,8 @@ REGISTER_DISPATCH(fmin_stub, &fmin_mps_kernel)
REGISTER_DISPATCH(copysign_stub, &copysign_mps_kernel)
REGISTER_DISPATCH(nextafter_stub, &nextafter_mps_kernel)
REGISTER_DISPATCH(zeta_stub, &zeta_mps_kernel)
REGISTER_DISPATCH(logaddexp_stub, &logaddexp_mps_kernel);
REGISTER_DISPATCH(logaddexp2_stub, &logaddexp2_mps_kernel);
REGISTER_DISPATCH(xlog1py_stub, &xlog1py_mps_kernel)
REGISTER_DISPATCH(chebyshev_polynomial_t_stub, &chebyshev_polynomial_t_mps_kernel)
REGISTER_DISPATCH(chebyshev_polynomial_u_stub, &chebyshev_polynomial_u_mps_kernel)

View File

@ -17,8 +17,6 @@
#include <ATen/ops/ge_native.h>
#include <ATen/ops/gt_native.h>
#include <ATen/ops/le_native.h>
#include <ATen/ops/logaddexp2_native.h>
#include <ATen/ops/logaddexp_native.h>
#include <ATen/ops/logical_and_native.h>
#include <ATen/ops/logical_or_native.h>
#include <ATen/ops/logical_xor_native.h>
@ -277,30 +275,6 @@ TORCH_IMPL_FUNC(pow_Scalar_out_mps)(const Scalar& base, const Tensor& exp, const
}
}
TORCH_IMPL_FUNC(logaddexp_out_mps)(const Tensor& self, const Tensor& other, const Tensor& output) {
mps::BinaryOpBlock logaddexp_op_block = ^BinaryOpFn(cachedGraph, primaryCastTensor, secondaryCastTensor) {
MPSGraph* mpsGraph = cachedGraph->graph();
MPSGraphTensor* sumTensor =
[mpsGraph additionWithPrimaryTensor:[mpsGraph exponentWithTensor:primaryCastTensor name:nil]
secondaryTensor:[mpsGraph exponentWithTensor:secondaryCastTensor name:nil]
name:nil];
return [mpsGraph logarithmWithTensor:sumTensor name:nil];
};
mps::binaryOpTensor(self, other, output, "logaddexp_out_mps", logaddexp_op_block);
}
TORCH_IMPL_FUNC(logaddexp2_out_mps)(const Tensor& self, const Tensor& other, const Tensor& output) {
mps::BinaryOpBlock logaddexp2_op_block = ^BinaryOpFn(cachedGraph, primaryCastTensor, secondaryCastTensor) {
MPSGraph* mpsGraph = cachedGraph->graph();
MPSGraphTensor* sumTensor =
[mpsGraph additionWithPrimaryTensor:[mpsGraph exponentBase2WithTensor:primaryCastTensor name:nil]
secondaryTensor:[mpsGraph exponentBase2WithTensor:secondaryCastTensor name:nil]
name:nil];
return [mpsGraph logarithmBase2WithTensor:sumTensor name:nil];
};
mps::binaryOpTensor(self, other, output, "logaddexp2_out_mps", logaddexp2_op_block);
}
TORCH_IMPL_FUNC(xlogy_out_mps)(const Tensor& self, const Tensor& other, const Tensor& output) {
mps::BinaryOpBlock xlogy_op_block = ^BinaryOpFn(cachedGraph, primaryCastTensor, secondaryCastTensor) {
MPSGraph* mpsGraph = cachedGraph->graph();

View File

@ -370,7 +370,7 @@ static void nllnd_loss_backward_impl(Tensor& grad_input_arg,
onValue:-1.0f
offValue:0.0f
name:nil];
oneHotTensor = castMPSTensor(mpsGraph, oneHotTensor, inputTensor.dataType);
oneHotTensor = castMPSTensor(mpsGraph, oneHotTensor, [inputTensor dataType]);
if (isWeightsArrayValid) {
oneHotTensor = [mpsGraph multiplicationWithPrimaryTensor:oneHotTensor
secondaryTensor:weightTensor
@ -705,6 +705,7 @@ static void smooth_l1_loss_template(const Tensor& input,
TORCH_CHECK(beta >= 0, "smooth_l1_loss does not support negative values for beta.");
TORCH_CHECK(input.is_mps());
TORCH_CHECK(target.is_mps());
TORCH_CHECK_NOT_IMPLEMENTED(input.scalar_type() != kLong, "MPS doesn't know how to do square_i64");
if ((input.numel() == 0) || (target.numel() == 0)) {
reduction == Reduction::Mean ? output.fill_(std::numeric_limits<float>::quiet_NaN()) : output.zero_();
return;
@ -771,7 +772,7 @@ static void smooth_l1_loss_backward_impl(const Tensor& grad_output,
MPSGraphTensor* targetTensor = mpsGraphRankedPlaceHolder(mpsGraph, target);
MPSGraphTensor* gradOutputTensor = mpsGraphRankedPlaceHolder(mpsGraph, grad_output);
MPSGraphTensor* betaTensor = [mpsGraph constantWithScalar:beta dataType:MPSDataTypeFloat32];
MPSGraphTensor* betaTensor = [mpsGraph constantWithScalar:beta dataType:[inputTensor dataType]];
// xn - yn
MPSGraphTensor* diffTensor = [mpsGraph subtractionWithPrimaryTensor:inputTensor
secondaryTensor:targetTensor
@ -797,7 +798,8 @@ static void smooth_l1_loss_backward_impl(const Tensor& grad_output,
name:@"lossTensor"];
MPSGraphTensor* outputTensor = lossTensor;
if (reduction == Reduction::Mean) {
MPSGraphTensor* numelTensor = [mpsGraph constantWithScalar:(double)input.numel() dataType:MPSDataTypeFloat32];
MPSGraphTensor* numelTensor = [mpsGraph constantWithScalar:(double)input.numel()
dataType:[lossTensor dataType]];
outputTensor = [mpsGraph divisionWithPrimaryTensor:lossTensor secondaryTensor:numelTensor name:nil];
}
MPSGraphTensor* gradInputTensor = [mpsGraph multiplicationWithPrimaryTensor:outputTensor

View File

@ -84,6 +84,9 @@ std::tuple<Tensor&, Tensor&, Tensor&> batch_norm_mps_out(const Tensor& self,
Tensor& output,
Tensor& save_mean,
Tensor& save_var) {
TORCH_CHECK_NOT_IMPLEMENTED(self.scalar_type() != kLong, "Long batch norm is not supported with MPS");
TORCH_CHECK_NOT_IMPLEMENTED(!c10::isComplexType(self.scalar_type()),
"Batch norm for complex is not supported for MPS");
using namespace at::native::mps;
struct CachedGraph : public MPSCachedGraph {
CachedGraph(MPSGraph* graph) : MPSCachedGraph(graph) {}
@ -918,6 +921,7 @@ std::tuple<Tensor, Tensor, Tensor> layer_norm_mps(const Tensor& input,
// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)
const int axis = input_ndim - normalized_ndim;
MPSStream* stream = getCurrentMPSStream();
TORCH_CHECK_NOT_IMPLEMENTED(input.scalar_type() != kLong, "Not implemented for long on MPS");
@autoreleasepool {
mps::dispatch_sync_with_rethrow(stream->queue(), ^() {
// which kernel variant to use based on the normalized axis N size

View File

@ -10,6 +10,7 @@
#include <ATen/Functions.h>
#include <ATen/NativeFunctions.h>
#else
#include <ATen/ops/aminmax.h>
#include <ATen/ops/avg_pool2d.h>
#include <ATen/ops/avg_pool2d_backward.h>
#include <ATen/ops/avg_pool2d_backward_native.h>
@ -544,8 +545,9 @@ static void max_unpool_out_mps_template(const Tensor& input,
if (indices.defined() && indices.numel() > 0) {
auto output_image_size = c10::multiply_integers(output_size_);
int64_t min_idx = indices.min().item<int64_t>();
int64_t max_idx = indices.max().item<int64_t>();
auto [min_idx_tensor, max_idx_tensor] = indices.aminmax();
int64_t min_idx = min_idx_tensor.item<int64_t>();
int64_t max_idx = max_idx_tensor.item<int64_t>();
if (min_idx < 0 || max_idx >= output_image_size) {
int64_t error_idx = (min_idx < 0) ? min_idx : max_idx;

View File

@ -1028,15 +1028,18 @@ TORCH_IMPL_FUNC(prod_out_mps)
}
TORCH_IMPL_FUNC(amax_out_mps)(const Tensor& input_t, IntArrayRef dim, bool keepdim, const Tensor& output_t) {
TORCH_CHECK(!c10::isComplexType(input_t.scalar_type()), "amax is not defined for complex types");
reduction_out_mps(input_t, dim, keepdim, std::nullopt, output_t, MPSReductionType::AMAX, "amax_out_mps");
}
TORCH_IMPL_FUNC(amin_out_mps)(const Tensor& input_t, IntArrayRef dim, bool keepdim, const Tensor& output_t) {
TORCH_CHECK(!c10::isComplexType(input_t.scalar_type()), "amin is not defined for complex types");
reduction_out_mps(input_t, dim, keepdim, std::nullopt, output_t, MPSReductionType::AMIN, "amin_out_mps");
}
TORCH_IMPL_FUNC(aminmax_out_mps)
(const Tensor& input_t, std::optional<int64_t> dim_opt, bool keepdim, const Tensor& min_t, const Tensor& max_t) {
TORCH_CHECK(!c10::isComplexType(input_t.scalar_type()), "aminmax is not defined for complex types");
reduction_out_mps(input_t,
dim_opt.has_value() ? OptionalIntArrayRef({*dim_opt}) : std::nullopt,
keepdim,

View File

@ -83,6 +83,31 @@ std::string get_type_str<int32_t>() {
return "int32_t";
}
// If all tensors are contiguous with the same dtype and the cat dimension is 0,
// then we can simply copy each tensor's underlying buffer contiguously into the
// output.
static void cat_out_mps_contiguous_impl(const ITensorListRef& inputs, const Tensor& output) {
MPSStream* stream = getCurrentMPSStream();
id<MTLBuffer> output_buffer = getMTLBufferStorage(output);
size_t output_offset = output.storage_offset() * output.itemsize();
for (const Tensor& input : inputs) {
if (cat_should_skip_tensor(input)) {
continue;
}
id<MTLBuffer> input_buffer = getMTLBufferStorage(input);
size_t input_offset = input.storage_offset() * input.itemsize();
auto nbytes = input.nbytes();
auto profile_id =
getMPSProfiler().beginProfileCopy(input_buffer, output_buffer, input, output, nbytes, /*non_blocking=*/true);
stream->copy(input_buffer, output_buffer, nbytes, input_offset, output_offset, profile_id, SyncType::NONE);
output_offset += nbytes;
}
}
// NOTE: `output` is expected to already have the correct size.
template <typename idx_type_t>
static void cat_out_mps_impl(const ITensorListRef& inputs, int64_t dimension, const Tensor& output) {
@ -105,7 +130,7 @@ static void cat_out_mps_impl(const ITensorListRef& inputs, int64_t dimension, co
// copy all the input tensor data into a packed buffer, which would not be
// ideal.
for (const Tensor& input : inputs) {
if (input.numel() == 0) {
if (cat_should_skip_tensor(input)) {
continue;
}
@ -243,101 +268,16 @@ TORCH_IMPL_FUNC(cat_out_mps)
if (out.numel() == 0) {
return;
}
auto materialized_inputs = inputs.materialize();
auto out_dtype = at::native::result_type(inputs);
bool has_large_tensor =
isTooLargeForMPSGraph(out) || std::any_of(materialized_inputs.begin(), materialized_inputs.end(), [](auto& t) {
return !cat_should_skip_tensor(t) && isTooLargeForMPSGraph(t);
});
int idx = 0;
for (const Tensor& t : materialized_inputs) {
TORCH_CHECK(t.dim() > 0, "zero-dimensional tensor (at position ", idx, ") cannot be concatenated");
auto lap = at::get_overlap_status(out, t);
TORCH_CHECK(lap != at::MemOverlapStatus::Partial && lap != at::MemOverlapStatus::Full,
"torch.cat(): unsupported operation: the input tensors cannot refer to any "
"of the output memory locations. Found overlap in input tensor ",
idx);
idx++;
}
// Check for type promotion
TORCH_CHECK(canCast(out_dtype, out.scalar_type()),
"torch.cat(): input types can't be cast to the desired output type ",
out.scalar_type());
TORCH_CHECK(!inputs.empty(), "torch.cat(): invalid number of inputs ", inputs.size());
dimension = legacy_cat_wrap_dim(dimension, materialized_inputs);
TORCH_CHECK(dimension >= 0, "torch.cat(): invalid dimension ", dimension);
// previously, size [0] tensors were the only possible empty tensors; thus, it
// wasn't possible to cat empty tensors unless all the other tensors were
// 1-dimensional, so we allowed these tensors to be "skipped". We maintain
// this behavior for backwards compatibility, but only for this specific size
// (i.e. other empty sizes are not skipped).
// FIXME: warn if this is the case
auto should_skip = [](const Tensor& t) { return t.dim() == 1 && t.size(0) == 0; };
at::assert_no_internal_overlap(out);
Tensor notSkippedTensor;
// Indices of tensors to be skipped because they're empty
std::vector<int64_t> skipped_tensor_indices;
// Tensors to be read
std::vector<Tensor> input_tensors;
int tensor_idx = 0;
for (const Tensor& t : materialized_inputs) {
if (t.numel() == 0 || should_skip(t)) {
skipped_tensor_indices.push_back(tensor_idx);
tensor_idx++;
continue;
}
input_tensors.push_back(t);
// TODO: Is this OK?
notSkippedTensor = t;
tensor_idx++;
}
// If all inputs are empty tensors, return an empty tensor
if (!notSkippedTensor.defined()) {
return;
}
for (const Tensor& t : inputs) {
TORCH_CHECK(t.device() == notSkippedTensor.device(),
"torch.cat(): all input tensors must be on the same device. Received ",
t.device(),
" and ",
notSkippedTensor.device());
}
TORCH_CHECK(out.device() == notSkippedTensor.device(),
"torch.cat(): all input tensors and out must be on the same device, but inputs are on ",
notSkippedTensor.device(),
" and out is on ",
out.device());
std::vector<int64_t> size(notSkippedTensor.sizes().vec());
// Compute size of the result in the cat dimension
int64_t cat_dim_size = 0;
idx = 0;
bool has_large_tensor = false;
for (const Tensor& tensor : materialized_inputs) {
if (isTooLargeForMPSGraph(tensor)) {
has_large_tensor |= true;
}
if (!should_skip(tensor)) {
// TODO: Factor out `check_shape_except_dim`
check_shape_except_dim(notSkippedTensor, tensor, dimension, idx);
cat_dim_size += tensor.size(dimension);
idx++;
}
}
// Compute the size of the result
size[dimension] = cat_dim_size;
// skip resizing if size of result is same as expected
if (out.sizes() != size) {
out.resize_(size, MemoryFormat::Contiguous);
}
if (out.numel() == 0) {
return;
}
has_large_tensor |= isTooLargeForMPSGraph(out);
if (has_large_tensor) {
if (all_contiguous && all_same_dtype && (memory_format == MemoryFormat::Contiguous) && (dimension == 0)) {
return mps::cat_out_mps_contiguous_impl(materialized_inputs, out);
} else if (has_large_tensor) {
return mps::cat_out_mps_impl<int64_t>(materialized_inputs, dimension, out);
} else {
return mps::cat_out_mps_impl<int32_t>(materialized_inputs, dimension, out);

View File

@ -31,6 +31,7 @@ void kthvalue_out_mps_impl(const Tensor& self, int64_t k, int64_t dim, Tensor& v
indices.copy_(values.toType(at::ScalarType::Long));
return;
}
TORCH_CHECK_NOT_IMPLEMENTED(!c10::isComplexType(self.scalar_type()), "kthvalue is not implemented for complex types");
// issue #154890, raising error to prevent crash within MPSGraph until
// workaround is implemented.
TORCH_CHECK(self.dim() - dim <= 4, "On-going issue on MPSGraph topk when ndims() - axis > 4, see issue #154890");

View File

@ -2602,12 +2602,16 @@
device_check: NoCheck # TensorIterator
structured_delegate: exp.out
variants: function, method
dispatch:
SparseCPU, SparseCUDA, SparseMPS: exp_sparse
tags: [core, pointwise]
- func: exp_(Tensor(a!) self) -> Tensor(a!)
device_check: NoCheck # TensorIterator
structured_delegate: exp.out
variants: function, method
dispatch:
SparseCPU, SparseCUDA, SparseMPS: exp_sparse_
tags: pointwise
- func: exp.out(Tensor self, *, Tensor(a!) out) -> Tensor(a!)
@ -2616,6 +2620,7 @@
structured_inherits: TensorIteratorBase
dispatch:
CPU, CUDA, MPS, MTIA: exp_out
SparseCPU, SparseCUDA, SparseMPS: exp_sparse_out
tags: pointwise
- func: exp2(Tensor self) -> Tensor
@ -3622,8 +3627,7 @@
structured: True
structured_inherits: TensorIteratorBase
dispatch:
CPU, CUDA: logaddexp_out
MPS: logaddexp_out_mps
CPU, CUDA, MPS: logaddexp_out
tags: pointwise
- func: logaddexp(Tensor self, Tensor other) -> Tensor
@ -3635,8 +3639,7 @@
structured: True
structured_inherits: TensorIteratorBase
dispatch:
CPU, CUDA: logaddexp2_out
MPS: logaddexp2_out_mps
CPU, CUDA, MPS: logaddexp2_out
tags: pointwise
- func: logaddexp2(Tensor self, Tensor other) -> Tensor
@ -8867,11 +8870,11 @@
autogen: bitwise_right_shift.Scalar_Tensor_out
tags: pointwise
- func: tril_(Tensor(a!) self, int diagonal=0) -> Tensor(a!)
- func: tril_(Tensor(a!) self, SymInt diagonal=0) -> Tensor(a!)
structured_delegate: tril.out
variants: method
- func: triu_(Tensor(a!) self, int diagonal=0) -> Tensor(a!)
- func: triu_(Tensor(a!) self, SymInt diagonal=0) -> Tensor(a!)
structured_delegate: triu.out
variants: method
@ -8995,25 +8998,25 @@
- func: cross(Tensor self, Tensor other, int? dim=None) -> Tensor
variants: method, function
- func: triu.out(Tensor self, int diagonal=0, *, Tensor(a!) out) -> Tensor(a!)
- func: triu.out(Tensor self, SymInt diagonal=0, *, Tensor(a!) out) -> Tensor(a!)
structured: True
dispatch:
CPU: triu_cpu
CUDA: triu_cuda
MPS: triu_mps_out
- func: triu(Tensor self, int diagonal=0) -> Tensor
- func: triu(Tensor self, SymInt diagonal=0) -> Tensor
structured_delegate: triu.out
variants: method, function
- func: tril.out(Tensor self, int diagonal=0, *, Tensor(a!) out) -> Tensor(a!)
- func: tril.out(Tensor self, SymInt diagonal=0, *, Tensor(a!) out) -> Tensor(a!)
structured: True
dispatch:
CPU: tril_cpu
CUDA: tril_cuda
MPS: tril_mps_out
- func: tril(Tensor self, int diagonal=0) -> Tensor
- func: tril(Tensor self, SymInt diagonal=0) -> Tensor
structured_delegate: tril.out
variants: method, function

View File

@ -467,6 +467,28 @@ Tensor sparse_coo_tensor(const Tensor& indices, const Tensor& values, IntArrayRe
!options.has_layout() || options.layout() == kSparse,
"expected sparse layout, but got layout ",
options.layout());
if (indices.numel() > 0) {
Tensor min_indices =
std::get</* values */ 0>(indices.min(/* dim */ 1, /* keepdim */ false));
Tensor cpu_min_indices;
if (!indices.is_cpu()) {
cpu_min_indices = min_indices.to(at::DeviceType::CPU);
} else {
cpu_min_indices = min_indices;
}
auto cpu_min_indices_accessor = cpu_min_indices.accessor<int64_t, 1>();
for (const auto d : c10::irange(indices.size(0))) {
int64_t min_index_in_dim = cpu_min_indices_accessor[d];
TORCH_CHECK(
min_index_in_dim >= 0,
"found negative index ",
min_index_in_dim,
" for dim ",
d);
}
}
return at::native::_sparse_coo_tensor_unsafe(
indices,
values,

View File

@ -26,6 +26,8 @@
#include <ATen/ops/erf_native.h>
#include <ATen/ops/erfinv.h>
#include <ATen/ops/erfinv_native.h>
#include <ATen/ops/exp.h>
#include <ATen/ops/exp_native.h>
#include <ATen/ops/expm1.h>
#include <ATen/ops/expm1_native.h>
#include <ATen/ops/floor.h>
@ -175,6 +177,7 @@ COALESCED_UNARY_UFUNC(atanh)
COALESCED_UNARY_UFUNC(ceil)
COALESCED_UNARY_UFUNC(deg2rad)
COALESCED_UNARY_UFUNC(erf)
COALESCED_UNARY_UFUNC(exp)
COALESCED_UNARY_UFUNC(erfinv)
COALESCED_UNARY_UFUNC(expm1)
COALESCED_UNARY_UFUNC(floor)

View File

@ -1837,6 +1837,10 @@ class BenchmarkRunner:
def skip_models_for_cuda(self):
return set()
@property
def skip_models_for_xpu(self):
return set()
@property
def skip_models_for_cpu(self):
return set()
@ -3927,6 +3931,8 @@ def run(runner, args, original_dir=None):
runner.skip_models.update(runner.skip_models_for_cpu_aarch64)
elif args.devices == ["cuda"]:
runner.skip_models.update(runner.skip_models_for_cuda)
elif args.devices == ["xpu"]:
runner.skip_models.update(runner.skip_models_for_xpu)
if not args.multiprocess:
runner.skip_models.update(runner.skip_multiprocess_models)

View File

@ -56,6 +56,20 @@ def list_benchmarks():
print(f"Available benchmarks: {list(BENCHMARK_REGISTRY.keys())}")
def _run_benchmark(
benchmark_cls,
script_args,
):
benchmark = benchmark_cls(script_args)
benchmark.benchmark()
benchmark.report_geomean_speedup()
if script_args.print_benchmark_result:
print(f"Benchmarking results {benchmark.name}:")
print(benchmark.profiling_results)
if script_args.visualize:
benchmark.visualize()
def run_benchmark(
benchmark_name: str,
script_args,
@ -71,10 +85,7 @@ def run_benchmark(
print("=" * 60)
benchmark_class = BENCHMARK_REGISTRY[benchmark_name]
benchmark = benchmark_class(script_args)
benchmark.benchmark()
if script_args.visualize:
benchmark.visualize()
_run_benchmark(benchmark_class, script_args)
return True
@ -87,10 +98,7 @@ def run_all_benchmarks(script_args):
for name, cls in BENCHMARK_REGISTRY.items():
print(f"\n{'=' * 20} {name.upper()} {'=' * 20}")
benchmark = cls(script_args)
benchmark.benchmark()
if script_args.visualize:
benchmark.visualize()
_run_benchmark(cls, script_args)
print()
@ -149,8 +157,43 @@ Examples:
help="Whether to exit with an error message for accuracy failure",
)
parser.add_argument(
"--print-benchmark-result",
action="store_true",
help="Whether to print the raw benchmarking result. Easier to quickly check the benchmark results on a server without GUI",
)
parser.add_argument(
"--custom-compile-name",
type=str,
default=None,
help="Name for the curve with customized compilation options",
)
parser.add_argument(
"--custom-compile-options",
type=str,
default=None,
help="Json string for the custom compile options.",
)
args = parser.parse_args()
if args.custom_compile_options:
import json
try:
args.custom_compile_options = json.loads(args.custom_compile_options)
except json.decoder.JSONDecodeError as e:
raise RuntimeError(
f"Invalid json string for --custom-compile-options: {args.custom_compile_options}"
) from e
if not args.custom_compile_options:
raise RuntimeError("Found no options for --custom-compile-options")
if not args.custom_compile_name:
raise RuntimeError("Missing label name for the custom compilation")
# Handle list option
if args.list:
list_benchmarks()

View File

@ -8,6 +8,15 @@ import torch
import torch.nn.functional as F
# more important shapes used by internal models
extra_shapes_for_norm = (
(1152 * 500, 384),
(1152 * 500, 512),
(1152 * 1000, 384),
(1152 * 1000, 512),
)
class CrossEntropyForward(BenchmarkKernel):
def __init__(self, script_args):
super().__init__(script_args)
@ -346,7 +355,7 @@ class RMSNormForward(BenchmarkKernel):
(32768, 65536),
(16384, 131072),
(8192, 262144),
)
) + extra_shapes_for_norm
def get_memory_bytes(self, args, kwargs) -> int:
x, w = args
@ -438,8 +447,7 @@ class RMSNormBackward(BenchmarkKernel):
(32768, 4096),
(32768, 8192),
(32768, 16384),
(32768, 32768),
)
) + extra_shapes_for_norm
def get_memory_bytes(self, args, kwargs) -> int:
x, w, dy = args
@ -553,7 +561,7 @@ class LayerNormForward(BenchmarkKernel):
(32768, 16384),
(32768, 32768),
(32768, 65536),
)
) + extra_shapes_for_norm
def get_memory_bytes(self, args, kwargs) -> int:
x, w = args
@ -627,7 +635,7 @@ class LayerNormBackward(BenchmarkKernel):
(32768, 16384),
(32768, 32768),
(32768, 65536),
)
) + extra_shapes_for_norm
def get_memory_bytes(self, args, kwargs) -> int:
x, w, dy = args

View File

@ -6,6 +6,7 @@ from dataclasses import dataclass
from typing import Any, Optional
import matplotlib.pyplot as plt
from scipy.stats import gmean
import torch
from torch._inductor.runtime.benchmarking import benchmarker
@ -107,6 +108,18 @@ class BenchmarkKernel:
for backend in self.available_backends:
args_ref, kwargs_ref = self.clone_inputs(args, kwargs)
res[backend] = getattr(self, backend)(args_ref, kwargs_ref)()
if (
"compiled" in self.available_backends
and self.script_args.custom_compile_options
):
torch._dynamo.reset() # cause recompile
with torch._inductor.config.patch(self.script_args.custom_compile_options):
args_ref, kwargs_ref = self.clone_inputs(args, kwargs)
res[self.script_args.custom_compile_name] = self.compiled(
args_ref, kwargs_ref
)()
gold = res["eager"]
tol = {}
@ -115,7 +128,7 @@ class BenchmarkKernel:
"atol": self.script_args.tolerance,
"rtol": self.script_args.tolerance,
}
for backend in self.available_backends:
for backend in res:
if backend == "eager":
continue
try:
@ -134,37 +147,83 @@ class BenchmarkKernel:
print("Exit right away since --exit-on-accuracy-failure is set")
sys.exit(1)
def benchmark_single_shape_for_backend(
self, backend, args, kwargs, setting, fn=None
) -> bool:
if fn is None:
fn = getattr(self, backend)
args_ref, kwargs_ref = self.clone_inputs(args, kwargs)
try:
avg_time = benchmark_kernel_in_milliseconds(fn(args_ref, kwargs_ref))
except Exception as e:
print(
f"Failed to run {backend} backend on {self.name} kernel for {setting} due to {e}"
)
self.available_backends.remove(backend) # noqa: B909
return False
mem_bytes = self.get_memory_bytes(args_ref, kwargs_ref)
perf = Performance(setting, avg_time, mem_bytes)
print(f"{self.name} kernel on {backend} backend. {perf}")
self.profiling_results[backend].append(perf)
return True
def benchmark_single_shape(
self, args, kwargs=None, should_check_accuracy=True, setting: str = ""
):
for backend in self.available_backends:
args_ref, kwargs_ref = self.clone_inputs(args, kwargs)
try:
avg_time = benchmark_kernel_in_milliseconds(
getattr(self, backend)(args_ref, kwargs_ref)
self.benchmark_single_shape_for_backend(backend, args, kwargs, setting)
if (
"compiled" in self.available_backends
and self.script_args.custom_compile_options
):
torch._dynamo.reset() # cause recompile
with torch._inductor.config.patch(self.script_args.custom_compile_options):
status = self.benchmark_single_shape_for_backend(
self.script_args.custom_compile_name,
args,
kwargs,
setting,
fn=self.compiled,
)
except Exception as e:
print(
f"Failed to run {backend} backend on {self.name} kernel for {setting} due to {e}"
if not status:
self.script_args.custom_compile_options = (
None # once fail, don't run again
)
self.available_backends.remove(backend) # noqa: B909
continue
mem_bytes = self.get_memory_bytes(args_ref, kwargs_ref)
perf = Performance(setting, avg_time, mem_bytes)
print(f"{self.name} kernel on {backend} backend. {perf}")
self.profiling_results[backend].append(perf)
if should_check_accuracy:
self.check_accuracy(args, kwargs)
def visualize(self) -> None:
device_name = torch.cuda.get_device_name(0)
visualize_comparison(
self.profiling_results,
title=f"{self.name}",
title=f"{self.name} ({device_name})",
output_path=f"{self.name}_bench",
)
return
def report_geomean_speedup(self) -> None:
print(f"Geomean speedup for benchmark {self.name}")
eager_result = {
result.setting: result for result in self.profiling_results["eager"]
}
print(f" eager {len(eager_result)} data points")
for backend, backend_result in self.profiling_results.items():
if backend == "eager":
continue
speeduplist = []
for result in backend_result:
eager_latency = eager_result[result.setting].latency
backend_latency = result.latency
speeduplist.append(
eager_latency / backend_latency if backend_latency != 0 else 0.0
)
if len(speeduplist) > 0:
print(
f" {backend} {len(speeduplist)} data points, {gmean(speeduplist):.2f}x speedup"
)
def get_backend_colors() -> dict[str, str]:
"""Get consistent color scheme for different backends."""
@ -252,5 +311,6 @@ def visualize_comparison(
os.makedirs("pics", exist_ok=True)
full_path = os.path.join("pics", output_path + ".png")
plt.savefig(full_path, dpi=300, bbox_inches="tight", facecolor="white")
print(f"Chart saved to {full_path}")
plt.close()

View File

@ -74,7 +74,8 @@ REQUIRE_HIGHER_TOLERANCE = {
REQUIRE_HIGHER_TOLERANCE_AMP = {}
REQUIRE_EVEN_HIGHER_TOLERANCE = {
"beit_base_patch16_224",
"deit_base_distilled_patch16_224",
"vit_base_patch16_siglip_256",
}
# These models need higher tolerance in MaxAutotune mode
@ -354,7 +355,9 @@ class TimmRunner(BenchmarkRunner):
if is_training:
from torch._inductor import config as inductor_config
if name in REQUIRE_EVEN_HIGHER_TOLERANCE or (
if name == "beit_base_patch16_224":
tolerance = 16 * 1e-2
elif name in REQUIRE_EVEN_HIGHER_TOLERANCE or (
inductor_config.max_autotune
and name in REQUIRE_EVEN_HIGHER_TOLERANCE_MAX_AUTOTUNE
):

View File

@ -124,6 +124,10 @@ class TorchBenchmarkRunner(BenchmarkRunner):
def skip_models_for_cuda(self):
return self._skip["device"]["cuda"]
@property
def skip_models_for_xpu(self):
return self._skip["device"]["xpu"]
@property
def skip_models_for_freezing_cuda(self):
return self._skip["freezing"]["cuda"]

View File

@ -217,6 +217,9 @@ skip:
cuda: []
xpu:
- *DETECTRON2_MODELS
test:
training:
- *DETECTRON2_MODELS

View File

@ -15,7 +15,6 @@ namespace c10::cuda {
namespace {
// Global stream state and constants
c10::once_flag init_flag;
DeviceIndex num_gpus = -1;
constexpr int kStreamsPerPoolBits = 5;
constexpr int kStreamsPerPool = 1 << kStreamsPerPoolBits;
@ -226,7 +225,10 @@ void initDeviceStreamState(DeviceIndex device_index) {
// Init front-end to ensure initialization only occurs once
void initCUDAStreamsOnce() {
// Inits default streams (once, globally)
c10::call_once(init_flag, initGlobalStreamState);
auto static init_flag [[maybe_unused]] = [] {
initGlobalStreamState();
return true;
}();
if (current_streams) {
return;

View File

@ -624,6 +624,64 @@ inline T spherical_bessel_j0(T x) {
return static_cast<T>(::metal::sin(x) / x);
}
template <typename T>
inline ::metal::enable_if_t<is_scalar_floating_point_v<T>, T> logaddexp(
T a,
T b) {
float a0 = static_cast<float>(a);
float b0 = static_cast<float>(b);
if (::metal::isinf(a0) && a0 == b0) {
return static_cast<T>(a0);
} else {
float m0 = ::metal::max(a0, b0);
return static_cast<T>(
m0 + ::c10::metal::log1p(::metal::exp(-::metal::abs(a0 - b0))));
}
}
// The function is ported from mlx
template <typename T>
inline ::metal::enable_if_t<is_complex_v<T>, T> logaddexp(T a, T b) {
if (::metal::isnan(a.x) || ::metal::isnan(a.y) || ::metal::isnan(b.x) ||
::metal::isnan(b.y)) {
return T(NAN, NAN);
}
T maxval = a.x > b.x ? a : b;
T minval = a.x < b.x ? a : b;
constexpr auto inf = ::metal::numeric_limits<T>::infinity().x;
if (minval.x == -inf || maxval.x == inf) {
return maxval;
}
float2 maxval_ = static_cast<float2>(maxval);
float2 minval_ = static_cast<float2>(minval);
float m = ::metal::exp(minval_.x - maxval_.x);
float2 dexp{
m * ::metal::cos(minval_.y - maxval_.y),
m * ::metal::sin(minval_.y - maxval_.y),
};
return static_cast<T>(maxval_ + ::c10::metal::log1p(dexp));
}
template <typename T>
inline T logaddexp2(T a, T b) {
constexpr auto log_2 = float(0.693147180559945309417232121458176);
constexpr auto inv_log_2 = float(1) / log_2;
float a0 = static_cast<float>(a);
float b0 = static_cast<float>(b);
if (::metal::isinf(a0) && a0 == b0) {
return static_cast<T>(a0);
} else {
float m0 = ::metal::max(a0, b0);
return static_cast<T>(
m0 +
::c10::metal::log1p(::metal::pow(float(2), -::metal::abs(a0 - b0))) *
inv_log_2);
}
}
template <typename T>
inline float xlog1py(T x, T y) {
if (::metal::isnan(y)) {

View File

@ -322,6 +322,24 @@ inline float log1p(float x) {
return rc;
}
// The function is ported from mlx
inline float2 log1p(float2 in) {
float x = in.x;
float y = in.y;
float zabs = ::metal::precise::sqrt(x * x + y * y);
float theta = ::metal::atan2(y, x + 1);
if (zabs < 0.5f) {
float r = x * (2 + x) + y * y;
if (r == 0) { // handle underflow
return {x, theta};
}
return {0.5f * log1p(r), theta};
} else {
auto z0 = ::metal::sqrt((x + 1) * (x + 1) + y * y);
return {::metal::log(z0), theta};
}
}
template <typename T1, typename T2 = T1>
struct pair {
T1 first;

View File

@ -239,7 +239,7 @@ struct Class2 {
struct mapper_call_func {
template <class T>
decltype(auto) operator()(T) {
auto operator()(T) {
return T::type::func();
}
};
@ -254,7 +254,7 @@ TEST(TypeListTest, MapTypesToValues_members) {
struct mapper_call_nonexistent_function {
template <class T>
decltype(auto) operator()(T) {
auto operator()(T) {
return T::type::this_doesnt_exist();
}
};

View File

@ -53,7 +53,7 @@ namespace guts {
// member functions.
namespace detail {
template <class F, class Tuple, std::size_t... INDEX>
C10_HOST_DEVICE constexpr decltype(auto) apply_impl(
C10_HOST_DEVICE constexpr auto apply_impl(
F&& f,
Tuple&& t,
std::index_sequence<INDEX...>) {
@ -62,7 +62,7 @@ C10_HOST_DEVICE constexpr decltype(auto) apply_impl(
} // namespace detail
template <class F, class Tuple>
C10_HOST_DEVICE constexpr decltype(auto) apply(F&& f, Tuple&& t) {
C10_HOST_DEVICE constexpr auto apply(F&& f, Tuple&& t) {
return detail::apply_impl(
std::forward<F>(f),
std::forward<Tuple>(t),

View File

@ -469,7 +469,7 @@ C10_API std::string GetExceptionString(const std::exception& e);
namespace c10::detail {
template <typename... Args>
decltype(auto) torchCheckMsgImpl(const char* /*msg*/, const Args&... args) {
auto torchCheckMsgImpl(const char* /*msg*/, const Args&... args) {
return ::c10::str(args...);
}
inline C10_API const char* torchCheckMsgImpl(const char* msg) {

View File

@ -135,7 +135,7 @@ struct _str_wrapper<> final {
// Convert a list of string-like arguments into a single string.
template <typename... Args>
inline decltype(auto) str(const Args&... args) {
inline auto str(const Args&... args) {
return detail::_str_wrapper<
typename detail::CanonicalizeStrTypes<Args>::type...>::call(args...);
}

View File

@ -507,7 +507,7 @@ struct map_types_to_values<typelist<Types...>> final {
} // namespace detail
template <class TypeList, class Func>
decltype(auto) map_types_to_values(Func&& func) {
auto map_types_to_values(Func&& func) {
return detail::map_types_to_values<TypeList>::call(std::forward<Func>(func));
}

View File

@ -554,6 +554,17 @@ class DeviceCachingAllocator {
}
}
double getMemoryFraction() {
if (!set_fraction) {
return 1.0;
}
c10::xpu::DeviceProp device_prop;
c10::xpu::get_device_properties(&device_prop, device_index);
return static_cast<double>(allowed_memory_maximum) /
static_cast<double>(device_prop.global_mem_size);
}
void setMemoryFraction(double fraction) {
c10::xpu::DeviceProp device_prop;
c10::xpu::get_device_properties(&device_prop, device_index);
@ -724,6 +735,11 @@ class XPUAllocator : public DeviceAllocator {
device_allocators[device]->resetAccumulatedStats();
}
double getMemoryFraction(DeviceIndex device) {
assertValidDevice(device);
return device_allocators[device]->getMemoryFraction();
}
void setMemoryFraction(double fraction, DeviceIndex device) {
assertValidDevice(device);
TORCH_CHECK_VALUE(
@ -777,6 +793,10 @@ void recordStream(const DataPtr& dataPtr, XPUStream stream) {
return allocator.recordStream(dataPtr, stream);
}
double getMemoryFraction(DeviceIndex device) {
return allocator.getMemoryFraction(device);
}
void setMemoryFraction(double fraction, DeviceIndex device) {
return allocator.setMemoryFraction(fraction, device);
}

View File

@ -25,6 +25,8 @@ C10_XPU_API void raw_delete(void* ptr);
C10_XPU_API void recordStream(const DataPtr& dataPtr, XPUStream stream);
C10_XPU_API double getMemoryFraction(DeviceIndex device);
C10_XPU_API void setMemoryFraction(double fraction, DeviceIndex device);
} // namespace c10::xpu::XPUCachingAllocator

View File

@ -1,4 +1,3 @@
#include <c10/util/CallOnce.h>
#include <c10/util/Exception.h>
#include <c10/xpu/XPUFunctions.h>
@ -33,7 +32,6 @@ namespace {
* one iGPU and enumerate all iGPUs on that platform.
* 3. If neither dGPUs nor iGPUs are found, conclude that no GPUs are available.
*/
c10::once_flag init_flag;
thread_local DeviceIndex curDeviceIndex = 0;
struct DevicePool {
@ -149,7 +147,10 @@ inline void initGlobalDevicePoolState() {
}
inline void initDevicePoolCallOnce() {
c10::call_once(init_flag, initGlobalDevicePoolState);
auto static init_flag [[maybe_unused]] = [] {
initGlobalDevicePoolState();
return true;
}();
}
void initDeviceProperties(DeviceProp* device_prop, DeviceIndex device) {

View File

@ -12,7 +12,6 @@ namespace c10::xpu {
namespace {
// Global stream state and constants
c10::once_flag init_flag;
DeviceIndex num_gpus = -1;
constexpr int kStreamsPerPoolBits = 5;
constexpr int kStreamsPerPool = 1 << kStreamsPerPoolBits;
@ -163,7 +162,10 @@ void initDeviceStreamState(DeviceIndex device) {
}
void initXPUStreamsOnce() {
c10::call_once(init_flag, initGlobalStreamState);
auto static init_flag [[maybe_unused]] = [] {
initGlobalStreamState();
return true;
}();
if (current_streams) {
return;

View File

@ -423,8 +423,10 @@ Also see {ref}`saved-tensors-hooks-doc`.
```{eval-rst}
.. autofunction:: torch.autograd.graph.get_gradient_edge
```
```{eval-rst}
.. autofunction:: torch.autograd.graph.set_warn_on_accumulate_grad_stream_mismatch
```
% This module needs to be documented. Adding here in the meantime

View File

@ -394,10 +394,6 @@ an opaque group handle that can be given as a `group` argument to all collective
.. autofunction:: new_group
```
```{eval-rst}
.. autofunction:: torch.distributed.distributed_c10d.shrink_group
```
```{eval-rst}
.. autofunction:: get_group_rank
```

View File

@ -2,9 +2,9 @@
## Overview
The LibTorch Stable ABI (Application Binary Interface) provides an interface for extending PyTorch functionality without being tightly coupled to specific PyTorch versions. This enables the development of custom operators and extensions that remain compatible across PyTorch releases.
The LibTorch Stable ABI (Application Binary Interface) provides a limited interface for extending PyTorch functionality without being tightly coupled to specific PyTorch versions. This enables the development of custom operators and extensions that remain compatible across PyTorch releases. This limited set of APIs is not intended to replace existing LibTorch, but rather to provide a stable foundation for a majority of custom extension use cases. If there is any API you would like to see added to the stable ABI, please file a request through a [new issue on the PyTorch repo](https://github.com/pytorch/pytorch/issues).
The stable ABI consists of three main components:
The limited stable ABI consists of three main components:
1. **Stable C headers** - Low-level C API implemented by libtorch (primarily `torch/csrc/inductor/aoti_torch/c/shim.h`)
2. **Header-only C++ library** - Standalone utilities implemented in only headers such that there is no dependence on libtorch (`torch/headeronly/*`)
@ -14,8 +14,8 @@ We discuss each of these in detail
### `torch/headeronly`
This is a set of inlined C++ headers are completely decoupled from libtorch. The headers consist of certain utilities that might be familiar to custom extension writers. For example, the
`c10::ScalarType` enum lives here as `torch::headeronly::ScalarType`.
The inlined C++ headers living in [`torch/headeronly`](https://github.com/pytorch/pytorch/tree/main/torch/headeronly) are completely decoupled from LibTorch. The headers consist of certain utilities that might be familiar to custom extension writers. For example, the
`c10::ScalarType` enum lives here as `torch::headeronly::ScalarType`, as well as a libtorch-independent version of `TORCH_CHECK` that is `STD_TORCH_CHECK`. You can trust all APIs in the `torch::headeronly` namespace to not depend on `libtorch.so`. These APIs are also globally listed in [torch/header_only_apis.txt](https://github.com/pytorch/pytorch/blob/main/torch/header_only_apis.txt).
### `torch/csrc/stable`
@ -34,8 +34,14 @@ We are continuing to improve coverage in our `torch/csrc/stable` APIs. Please fi
### Stable C headers
The stable C headers used by AOTInductor form the foundation of the stable ABI. However, this is **use at your own risk**. For example, users must handle the memory lifecycle of objects returned by certain APIs.
Further, the stack-based APIs discussed below which allow the user to call the PyTorch dispatcher don't provide strong guarantees on forward and backward compatibility.
The stable C headers started by AOTInductor form the foundation of the stable ABI. Presently, the available C headers include:
- [torch/csrc/inductor/aoti_torch/c/shim.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/inductor/aoti_torch/c/shim.h): Includes C-style shim APIs for commonly used regarding Tensors, dtypes, CUDA, and the like.
- [torch/csrc/inductor/aoti_torch/generated/c_shim_aten.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/inductor/aoti_torch/generated/c_shim_aten.h): Includes C-style shim APIs for ATen ops from `native_functions.yaml` (e.g. `aoti_torch_aten_new_empty`).
- [torch/csrc/inductor/aoti_torch/generated/c_shim_*.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/inductor/aoti_torch/generated): Includes C-style shim APIs for specific backend kernels dispatched from `native_functions.yaml` (e.g. `aoti_torch_cuda_pad`). These APIs should only be used for the specific backend they are named after (e.g. `aoti_torch_cuda_pad` should only be used within CUDA kernels), as they opt out of the dispatcher.
- [torch/csrc/stable/c/shim.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/stable/c/shim.h): We are building out more ABIs to logically live in `torch/csrc/stable/c` instead of continuing the AOTI naming that no longer makes sense for our general use case.
These headers are promised to be ABI stable across releases and adhere to a stronger backwards compatibility policy than LibTorch. Specifically, we promise not to modify them for at least 2 years after they are released. However, this is **use at your own risk**. For example, users must handle the memory lifecycle of objects returned by certain APIs. Further, the stack-based APIs discussed below which allow the user to call into the PyTorch dispatcher do not provide strong guarantees on forward and backward compatibility of the underlying op that is called.
Unless absolutely necessary, we recommend the high-level C++ API in `torch/csrc/stable`
which will handle all the rough edges of the C API for the user.

View File

@ -76,6 +76,7 @@
:nosignatures:
empty_cache
get_per_process_memory_fraction
max_memory_allocated
max_memory_reserved
mem_get_info

View File

@ -266,7 +266,7 @@ class TestFullyShardPostAccGradHookMultiThread(FSDPTestMultiThread):
model(inp).sum().backward()
param_names = {param_name for param_name, _ in model.named_parameters()}
self.assertEqual(param_names, set(param_name_to_hook_count.keys()))
for param_name, count in param_name_to_hook_count.items():
for count in param_name_to_hook_count.values():
self.assertEqual(count, 1)

View File

@ -827,7 +827,7 @@ class TestFullyShardShardPlacementFnMultiProcess(FSDPTest):
torch.manual_seed(42 + self.rank)
inp = torch.randint(0, model_args.vocab_size, (2, 16), device=device_type.type)
for iter_idx in range(5):
for _ in range(5):
ref_loss = ref_model(inp).sum()
loss = model(inp).sum()
self.assertEqual(ref_loss, loss)

View File

@ -800,6 +800,7 @@ if not (TEST_WITH_DEV_DBG_ASAN or IS_WINDOWS or IS_MACOS or IS_CI):
stderr_redirects={0: stderr_redir},
ret_vals={0: queue},
queue_finished_reading_event=worker_finished_event_mock,
numa_options=None,
)
self.assertEqual("hello_0", queue.get())
if stdout_redir:

View File

@ -514,18 +514,17 @@ class TestFSDPMiscMultiProcess(FSDPTest):
def test_fsdp_cpu_training(self):
"""Tests FSDP training on CPU."""
gloo_pg = dist.new_group(backend="gloo")
for ss in [ # noqa: F841
for ss in [
ShardingStrategy.NO_SHARD,
ShardingStrategy.FULL_SHARD,
ShardingStrategy.SHARD_GRAD_OP,
ShardingStrategy.HYBRID_SHARD,
ShardingStrategy._HYBRID_SHARD_ZERO2,
]:
torch.manual_seed(42)
model = MyModel()
ref_model = DDP(deepcopy(model), process_group=gloo_pg)
model = FSDP(
model,
sharding_strategy=ss,
auto_wrap_policy=always_wrap_policy,
process_group=gloo_pg,
device_id=torch.device("cpu"),

View File

@ -1,43 +0,0 @@
import logging
import time
_start_time = time.time()
_logger = logging.getLogger(__name__)
def _ts():
return time.time() - _start_time
def configure(level=logging.INFO, force=False):
try:
logging.basicConfig(
level=level,
format="%(asctime)s %(name)s %(levelname)s: %(message)s",
force=force,
)
except TypeError:
logging.basicConfig(
level=level, format="%(asctime)s %(name)s %(levelname)s: %(message)s"
)
def log_test_info(rank, message):
_logger.info("[%7.3fs][Rank %s] %s", _ts(), rank, message)
def log_test_success(rank, message):
_logger.info("[%7.3fs][Rank %s] ✅ %s", _ts(), rank, message)
def log_test_validation(rank, message):
_logger.info("[%7.3fs][Rank %s] ✓ %s", _ts(), rank, message)
def log_test_warning(rank, message):
_logger.warning("[%7.3fs][Rank %s] ⚠️ %s", _ts(), rank, message)
def log_test_error(rank, message):
_logger.error("[%7.3fs][Rank %s] ✗ %s", _ts(), rank, message)

View File

@ -2,7 +2,6 @@
# Owner(s): ["oncall: distributed"]
import sys
from pathlib import Path
import torch
import torch.distributed as dist
@ -45,53 +44,19 @@ class TestInstantiator(TestCase):
self.assertEqual(return_type_str, "Tuple[Tensor, int, str]")
def test_instantiate_scripted_remote_module_template(self):
dir_path = Path(instantiator.INSTANTIATED_TEMPLATE_DIR_PATH)
# Cleanup.
file_paths = dir_path.glob(f"{instantiator._FILE_PREFIX}*.py")
for file_path in file_paths:
file_path.unlink()
# Check before run.
file_paths = dir_path.glob(f"{instantiator._FILE_PREFIX}*.py")
num_files_before = len(list(file_paths))
self.assertEqual(num_files_before, 0)
generated_module = instantiator.instantiate_scriptable_remote_module_template(
MyModuleInterface
)
self.assertTrue(hasattr(generated_module, "_remote_forward"))
self.assertTrue(hasattr(generated_module, "_generated_methods"))
# Check after run.
file_paths = dir_path.glob(f"{instantiator._FILE_PREFIX}*.py")
num_files_after = len(list(file_paths))
self.assertEqual(num_files_after, 1)
def test_instantiate_non_scripted_remote_module_template(self):
dir_path = Path(instantiator.INSTANTIATED_TEMPLATE_DIR_PATH)
# Cleanup.
file_paths = dir_path.glob(f"{instantiator._FILE_PREFIX}*.py")
for file_path in file_paths:
file_path.unlink()
# Check before run.
file_paths = dir_path.glob(f"{instantiator._FILE_PREFIX}*.py")
num_files_before = len(list(file_paths))
self.assertEqual(num_files_before, 0)
generated_module = (
instantiator.instantiate_non_scriptable_remote_module_template()
)
self.assertTrue(hasattr(generated_module, "_remote_forward"))
self.assertTrue(hasattr(generated_module, "_generated_methods"))
# Check after run.
file_paths = dir_path.glob(f"{instantiator._FILE_PREFIX}*.py")
num_files_after = len(list(file_paths))
self.assertEqual(num_files_after, 1)
if __name__ == "__main__":
run_tests()

View File

@ -64,6 +64,38 @@ class TestDTensorDebugMode(TestCase):
self.assertTrue(isinstance(debug_mode.operators[2], _RedistributeCall))
self.assertEqual(next(iter(debug_mode.operators[1])), torch.ops.aten.mm.default)
# check stringification
self.assertTrue(hasattr(debug_mode.operators[0], "args_str"))
self.assertFalse(hasattr(debug_mode.operators[0], "args"))
# check recording hook
def mm(x, y):
return (x @ y).sum()
eager_out = mm(x_dtensor, y_dtensor)
# check recording hook for compiled variant
with (
DebugMode() as debug_mode,
DebugMode.record_outputs(),
DebugMode.log_tensor_hashes(),
):
compiled_out = torch.compile(mm, backend="aot_eager")(x_dtensor, y_dtensor)
# check numerical equivalence
self.assertTrue(torch.equal(eager_out, compiled_out))
sum_op = next(
iter(
op
for op in debug_mode.operators
if isinstance(op, _OpCall) and str(op.op) == "aten.sum.default"
)
)
self.assertTrue(torch.equal(sum_op.record["output"], eager_out.to_local()))
self.assertTrue(
"aten::sum(t: f32[1, 32]) # {'hash': " in debug_mode.debug_string()
)
def test_debug_string_inside_context(self):
mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
@ -86,7 +118,9 @@ class TestDTensorDebugMode(TestCase):
x_dtensor = DTensor.from_local(x, mesh, [Shard(0)], run_check=False)
y_dtensor = DTensor.from_local(y, mesh, [Shard(1)], run_check=False)
with DebugMode(record_torchfunction=True) as debug_mode:
with DebugMode(
record_torchfunction=True, record_stack_trace=True
) as debug_mode:
z = x_dtensor + y_dtensor
z.sum().backward()
@ -119,6 +153,9 @@ class TestDTensorDebugMode(TestCase):
aten::detach(t: f32[1, 8])""",
)
# check stack trace
self.assertTrue("z.sum().backward()" in debug_mode.operators[-1].stack_trace)
def test_debug_mode_densor_redistribution_trace(self):
mesh = DeviceMesh(self.device_type, torch.arange(self.world_size).view(4, 2))
@ -267,6 +304,7 @@ class TestDTensorDebugMode(TestCase):
record_torchfunction=True,
record_faketensor=True,
record_tensor_attributes=["a1", "a2"],
store_original_args=True,
) as debug_mode:
torch.matmul(y, x)
@ -279,6 +317,9 @@ class TestDTensorDebugMode(TestCase):
aten::_unsafe_view(t: f32[64, 8], [8, 8, 8])""",
)
self.assertTrue(hasattr(debug_mode.operators[0], "args"))
self.assertEqual(id(debug_mode.operators[0].args[0]), id(y))
@parametrize("has_inner_mode", [True, False])
@parametrize("has_outer_mode", [True, False])
def test_nested_debug_mode(self, has_inner_mode, has_outer_mode):

View File

@ -20,18 +20,18 @@ from torch.distributed.tensor.experimental._attention import (
_cp_options,
_disable_context_parallel_dispatcher,
_enable_context_parallel_dispatcher,
_HeadTailLoadBalancer,
_is_causal_behavior,
_LoadBalancer,
_PerDocumentHeadTailLoadBalancer,
_PTRRLoadBalancer,
_RotateMethod,
context_parallel,
context_parallel_unshard,
set_rotate_method,
)
from torch.distributed.tensor.experimental._cp_custom_ops import flex_cp_allgather
from torch.distributed.tensor.experimental._load_balancer import (
_HeadTailLoadBalancer,
_LoadBalancer,
_PerDocumentHeadTailLoadBalancer,
_PTRRLoadBalancer,
from torch.distributed.tensor.experimental._context_parallel._cp_custom_ops import (
flex_cp_allgather,
)
from torch.distributed.tensor.parallel import parallelize_module
from torch.nn.attention import sdpa_kernel, SDPBackend

View File

@ -204,32 +204,28 @@ class DistConvolutionOpsTest(DTensorTestBase):
self.assertTrue(b_dt.grad is not None)
self.assertTrue(x_dt.grad is None)
def _run_single_arg_fwd(self, model, arg) -> tuple[torch.Tensor, torch.Tensor]:
"""Given model and arg, runs fwd model local and distbuted given device_mesh"""
device_mesh = self.build_device_mesh()
model_copy = copy.deepcopy(model).to(device=self.device_type)
dist_model = distribute_module(model, device_mesh, _conv_fn)
arg_dt = DTensor.from_local(arg, device_mesh, [Replicate()])
out_dt = dist_model(arg_dt.to(device=self.device_type))
out = model_copy(arg)
return (out_dt.full_tensor(), out)
@with_comms
def test_conv1d(self):
device_mesh = self.build_device_mesh()
model = nn.Conv1d(64, 64, 3, padding=1)
model_gt = copy.deepcopy(model)
x = torch.randn(1, 64, 8)
x_dt = DTensor.from_local(x, device_mesh, [Replicate()])
model_dt = distribute_module(
model, device_mesh, _conv_fn, input_fn=None, output_fn=None
)
out_dt = model_dt(x_dt)
out = model_gt(x)
x = torch.randn(1, 64, 8, device=self.device_type)
out_dt, out = self._run_single_arg_fwd(model, x)
self.assertEqual(out_dt.shape, out.shape)
@with_comms
def test_conv3d(self):
device_mesh = self.build_device_mesh()
model = nn.Conv3d(64, 64, 3, padding=1)
model_gt = copy.deepcopy(model).to(device=self.device_type)
x = torch.randn(1, 64, 8, 8, 8, device=self.device_type)
x_dt = DTensor.from_local(x, device_mesh, [Replicate()])
model_dt = distribute_module(
model, device_mesh, _conv_fn, input_fn=None, output_fn=None
)
out_dt = model_dt(x_dt)
out = model_gt(x)
out_dt, out = self._run_single_arg_fwd(model, x)
self.assertEqual(out_dt.shape, out.shape)

View File

@ -520,8 +520,6 @@ class DTensorExportTest(TestCase):
2,
)
# "Explanation: SourcelessBuilder.create does not know how to wrap <class 'types.UnionType'>"
@unittest.expectedFailure
def test_union_typed_annotation(self):
def fn(leaf: torch.Tensor | DTensor):
def nest_fn(leaf: torch.Tensor | DTensor):
@ -535,7 +533,7 @@ class DTensorExportTest(TestCase):
z = torch.randn(16, 16)
gm = graph_capture_and_aot_export_joint_with_descriptors(fn, (z,))
print(gm)
self.assertEqual(fn(z), gm(z)[0])
instantiate_parametrized_tests(DTensorExportTest)

View File

@ -981,6 +981,41 @@ class TestComputeCommReorderingBucketing(TestComputeCommReorderingMultiProc):
correct = func(a, b, c, ranks=ranks)
self.assertTrue(same(out, correct))
@unittest.skipIf(not HAS_GPU, "Inductor+gpu needs triton and recent GPU arch")
@torch._inductor.config.patch(get_bucket_patches())
def test_basic_all_reduce_bucketing(self):
"""Test that independent all_reduce operations get bucketed together."""
def func(a, b, c):
# Three independent all_reduces that should be bucketed
ar1 = _functional_collectives.all_reduce(a, "sum", "0")
ar2 = _functional_collectives.all_reduce(b, "sum", "0")
ar3 = _functional_collectives.all_reduce(c, "sum", "0")
return ar1.sum() + ar2.sum() + ar3.sum()
with _dynamo_dist_per_rank_init(
self.rank,
self.world_size,
self.backend(device_type),
fake_pg=not at_least_x_gpu(2),
):
a = torch.ones(4, 4, dtype=torch.float, device=device_type) + self.rank
b = torch.ones(4, 4, dtype=torch.float, device=device_type) * 2
c = torch.ones(4, 4, dtype=torch.float, device=device_type) * 3
compiled = torch.compile(func)
out, aten_graph_str = run_and_get_aten_graph(compiled, a, b, c)
# Should see a single bucketed all_reduce
FileCheck().check_count(
"torch.ops._c10d_functional.wait_tensor.default", 1, exactly=True
).run(aten_graph_str)
# Verify correctness
correct = func(a, b, c)
self.assertTrue(same(out, correct))
if __name__ == "__main__":
from torch._dynamo.test_case import run_tests

View File

@ -2,7 +2,6 @@
import copy
import json
import logging
import os
import pickle
import random
@ -22,7 +21,6 @@ from unittest import mock, SkipTest
import torch
import torch.distributed as c10d
import torch.distributed._functional_collectives as _functional_collectives
from torch.distributed.distributed_c10d import SHRINK_ABORT as NCCL_SHRINK_ABORT
if not c10d.is_available() or not c10d.is_nccl_available():
@ -49,15 +47,12 @@ from torch._C._distributed_c10d import ErrorType, OpType, WorkResult
from torch.nn.parallel import DistributedDataParallel
from torch.testing._internal.common_cuda import _get_torch_rocm_version, TEST_MULTIGPU
from torch.testing._internal.common_distributed import (
get_required_world_size,
get_timeout,
init_multigpu_helper,
MultiProcessTestCase,
requires_multicast_support,
requires_nccl,
requires_nccl_shrink,
requires_nccl_version,
requires_world_size,
skip_if_lt_x_gpu,
skip_if_rocm_multiprocess,
sm_is_or_higher_than,
@ -92,17 +87,6 @@ BFLOAT16_AVAILABLE = torch.cuda.is_available() and (
torch.version.cuda is not None or torch.version.hip is not None
)
from logging_utils import (
configure as _log_configure,
log_test_info,
log_test_success,
log_test_validation,
log_test_warning,
)
_log_configure(level=logging.INFO, force=True)
class RendezvousEnvTest(TestCase):
@retry_on_connect_failures
@ -333,7 +317,7 @@ class ProcessGroupNCCLGroupTest(MultiProcessTestCase):
@property
def world_size(self):
return get_required_world_size(self, 2)
return 2
@property
def rank_to_GPU(self):
@ -1271,628 +1255,6 @@ class ProcessGroupNCCLGroupTest(MultiProcessTestCase):
pg_2 = c10d.new_group([0, 1])
self.assertEqual(pg_2.group_desc, "undefined")
@requires_nccl_shrink()
@requires_world_size(2)
def test_shrink_group_basic(self):
"""Test basic shrink_group functionality."""
self._perform_shrink_test([1], "Basic shrink test")
@requires_nccl_shrink()
@requires_world_size(2)
def test_shrink_group_validation(self):
"""Test input validation in shrink_group."""
device, pg = self._setup_shrink_test("validation")
def _test_invalid_input(ranks, description, expected_exception):
"""Helper to test invalid inputs."""
try:
c10d.shrink_group(ranks)
self.fail(f"Expected {expected_exception.__name__} for {description}")
except expected_exception:
log_test_validation(self.rank, f"{description}")
except Exception:
if expected_exception is Exception: # Accept any exception
log_test_validation(self.rank, f"{description}")
else:
raise
# Test cases
_test_invalid_input([], "Empty exclusion list", ValueError)
if self.world_size > 1:
_test_invalid_input([0, 0, 1], "Duplicate ranks", Exception)
_test_invalid_input([self.world_size + 1], "Out of bounds rank", Exception)
log_test_success(self.rank, "All validation tests passed")
dist.destroy_process_group()
@requires_nccl_shrink()
@requires_world_size(2)
def test_shrink_group_backend_properties(self):
"""Test that backend properties are preserved after shrinking."""
test_name = "Backend Properties Test"
ranks_to_exclude = [0]
# Reuse _setup_shrink_test for complete setup (device, environment, and process group)
device, pg = self._setup_shrink_test("backend_properties")
# Follow _perform_shrink_test pattern from here
log_test_info(self.rank, f"{test_name} (world_size={self.world_size})")
is_excluded = self.rank in ranks_to_exclude
log_test_info(
self.rank,
f"Excluding ranks: {ranks_to_exclude}, am_excluded: {is_excluded}",
)
# Store original backend property values (not references) before shrinking
original_timeout = None
original_high_priority = None
if not is_excluded:
original_backend = pg._get_backend(device)
original_timeout = original_backend.options._timeout
original_high_priority = original_backend.options.is_high_priority_stream
log_test_info(
self.rank,
f"Storing original backend properties: timeout={original_timeout}, high_priority={original_high_priority}",
)
if is_excluded:
log_test_info(
self.rank,
f"Excluded rank {self.rank} - setup complete, skipping shrink operation",
)
dist.destroy_process_group() # hang without it
return
# Only non-excluded ranks proceed with shrink (same as _perform_shrink_test)
log_test_info(self.rank, "Non-excluded rank calling shrink_group")
shrunk_pg = c10d.shrink_group(ranks_to_exclude)
# Reuse _validate_shrunk_group helper (same as _perform_shrink_test)
expected_size = self.world_size - len(ranks_to_exclude)
_ = self._validate_shrunk_group(shrunk_pg, expected_size, test_name)
# Add custom backend properties validation
new_backend = shrunk_pg._get_backend(device)
log_test_info(self.rank, "Validating backend properties are preserved")
new_timeout = new_backend.options._timeout
new_high_priority = new_backend.options.is_high_priority_stream
log_test_info(
self.rank,
f"Timeout comparison - original: {original_timeout}, new: {new_timeout}",
)
self.assertEqual(
original_timeout, new_timeout, f"{test_name}: timeout not preserved"
)
log_test_info(
self.rank,
f"High priority stream comparison - original: {original_high_priority}, new: {new_high_priority}",
)
self.assertEqual(
original_high_priority,
new_high_priority,
f"{test_name}: high_priority_stream not preserved",
)
log_test_validation(
self.rank, f"{test_name}: Backend properties preserved successfully"
)
log_test_success(
self.rank, f"{test_name} successful (shrink + backend validation)"
)
# Cleanup (same as _perform_shrink_test)
dist.destroy_process_group()
@requires_nccl_shrink()
@requires_world_size(2)
def test_shrink_group_multiple_comms(self):
"""Test shrink_group with multiple communicators and subgroup invalidation."""
device, pg = self._setup_shrink_test("multiple_comms")
# Create subgroup [0, 1] and test shrinking it
subgroup = c10d.new_group([0, 1])
if self.rank <= 1:
# Shrink subgroup: exclude rank 1
if self.rank == 0: # Only rank 0 remains
shrunk_subgroup = c10d.shrink_group([1], group=subgroup)
self.assertEqual(shrunk_subgroup.size(), 1)
# Test communication on shrunk subgroup
tensor = torch.full((1,), self.rank).cuda(device)
c10d.all_reduce(tensor, group=shrunk_subgroup)
self.assertEqual(tensor.item(), 0) # Only rank 0
log_test_success(self.rank, "Subgroup shrinking successful")
dist.barrier() # Sync before default group test
# Shrink default group: exclude last rank
ranks_to_exclude = [self.world_size - 1]
if self.rank not in ranks_to_exclude:
shrunk_default = c10d.shrink_group(ranks_to_exclude)
expected_size = self.world_size - 1
self.assertEqual(shrunk_default.size(), expected_size)
# Test collective on shrunk default group
tensor = torch.full((1,), self.rank).cuda(device)
c10d.all_reduce(tensor, group=shrunk_default)
expected_sum = sum(
range(self.world_size - 1)
) # 0 + 1 + ... + (world_size-2)
self.assertEqual(tensor.item(), expected_sum)
log_test_success(self.rank, "Default group shrinking successful")
# Note: After shrinking default group, the old subgroup is invalid
# due to global rank reassignment
dist.destroy_process_group()
def _test_shrink_group_with_flag(self, shrink_flag, flag_name, rank_to_exclude):
"""Helper method to test shrink_group with a specific flag."""
if self.world_size < 2:
log_test_info(self.rank, f"Skipping (needs ≥2 GPUs, got {self.world_size})")
return
ranks_to_exclude = [rank_to_exclude]
log_test_info(self.rank, f"Using {flag_name} flag (value: {shrink_flag})")
if flag_name == "NCCL_SHRINK_ABORT":
log_test_info(
self.rank,
"ABORT flag will terminate ongoing operations before shrinking",
)
self._perform_shrink_test(
ranks_to_exclude, f"{flag_name} flag test", shrink_flags=shrink_flag
)
@requires_nccl_shrink()
@requires_world_size(2)
def test_shrink_group_flags(self):
"""Test shrink_group with different shrink flags."""
# Test ABORT flags
log_test_info(self.rank, "Testing NCCL_SHRINK_ABORT flag")
self._test_shrink_group_with_flag(NCCL_SHRINK_ABORT, "NCCL_SHRINK_ABORT", 1)
@requires_nccl_shrink()
@requires_world_size(2)
def test_shrink_group_nccl_config(self):
"""Verify that passing NCCL config via pg_options influences the shrunk group's backend options."""
device, pg = self._setup_shrink_test("config")
if self.rank == self.world_size - 1:
# excluded rank should not call shrink_group
dist.destroy_process_group()
return
# Prepare pg_options with NCCL config overrides
# Capture parent's current backend options to ensure we can prove override vs inherit
parent_backend = pg._get_backend(torch.device("cuda"))
parent_hp = parent_backend.options.is_high_priority_stream
parent_blocking = parent_backend.options.config.blocking
# Choose overrides that differ from the parent (flip where possible)
override_hp = not parent_hp
if parent_blocking in (0, 1):
override_blocking = 1 - parent_blocking
else:
# If undefined or unexpected, set to 1 which is a concrete value
override_blocking = 1
opts = c10d.ProcessGroupNCCL.Options()
opts.is_high_priority_stream = override_hp
opts.config.blocking = override_blocking
shrunk_pg = c10d.shrink_group([self.world_size - 1], pg_options=opts)
# Validate backend options propagated
backend = shrunk_pg._get_backend(torch.device("cuda"))
# is_high_priority_stream should exactly match our override and differ from parent
self.assertEqual(backend.options.is_high_priority_stream, override_hp)
self.assertNotEqual(backend.options.is_high_priority_stream, parent_hp)
# config is a struct; check representative field and difference from parent when meaningful
self.assertEqual(backend.options.config.blocking, override_blocking)
if parent_blocking in (0, 1):
self.assertNotEqual(backend.options.config.blocking, parent_blocking)
dist.destroy_process_group()
@requires_nccl_shrink()
@requires_world_size(2)
def test_shrink_group_performance(self):
"""Test shrink_group performance and regression detection."""
import time
ranks_to_exclude = self._get_default_ranks_to_exclude()
is_excluded = self.rank in ranks_to_exclude
if not ranks_to_exclude:
log_test_info(self.rank, "Skipping performance test (world_size=1)")
return
log_test_info(self.rank, f"Performance test with {self.world_size} processes")
device, pg = self._setup_shrink_test("performance")
if not is_excluded:
log_test_info(self.rank, "Measuring shrink_group performance")
start_time = time.time()
shrunk_pg = c10d.shrink_group(ranks_to_exclude)
end_time = time.time()
elapsed_time = end_time - start_time
log_test_info(self.rank, f"shrink_group: {elapsed_time:.3f}s")
# Regression check: should complete within reasonable time
self.assertLess(
elapsed_time,
30.0,
f"shrink_group took {elapsed_time:.3f}s, possible regression",
)
# Test collective performance
expected_size = self.world_size - len(ranks_to_exclude)
self._validate_shrunk_group(shrunk_pg, expected_size, "performance")
collective_start = time.time()
_ = self._test_collective_on_shrunk_group(
shrunk_pg, device, ranks_to_exclude, "performance"
)
collective_time = time.time() - collective_start
log_test_info(self.rank, f"all_reduce: {collective_time:.3f}s")
log_test_success(self.rank, "Performance test passed")
else:
log_test_info(self.rank, "Excluded rank - waiting")
dist.destroy_process_group()
@requires_nccl_shrink()
@requires_world_size(4)
def test_shrink_group_multiple_exclusions(self):
"""Test shrink_group with multiple ranks excluded at once."""
# Scale exclusions with world size
ranks_to_exclude = list(range(2, self.world_size, 2)) # Every other rank from 2
self._perform_shrink_test(ranks_to_exclude, "Multiple exclusions test")
@requires_nccl_shrink()
@requires_world_size(3)
def test_shrink_group_multiple_iterations(self):
"""Test multiple shrink operations in sequence."""
log_test_info(
self.rank,
f"Starting test_shrink_group_multiple_iterations with world_size={self.world_size}",
)
store = c10d.FileStore(self.file_name, self.world_size)
device = torch.device(f"cuda:{self.rank}")
_ = self._create_process_group_nccl(store, self.opts(), device_id=device)
# Track current effective world size throughout shrinking operations
current_world_size = self.world_size
log_test_info(self.rank, f"Initial world_size: {current_world_size}")
# First shrinking: exclude the last rank(s)
first_exclusion = [self.world_size - 1]
if self.world_size >= 6:
first_exclusion.append(
self.world_size - 2
) # Exclude last two ranks for larger sizes
log_test_info(self.rank, f"First shrinking: excluding ranks {first_exclusion}")
if self.rank not in first_exclusion:
# Only non-excluded ranks should call shrink_group
first_pg = c10d.shrink_group(first_exclusion)
self.assertIsNotNone(first_pg)
# IMPORTANT: Update world size after first shrinking
current_world_size = first_pg.size()
expected_first_size = self.world_size - len(first_exclusion)
log_test_info(
self.rank,
f"After first shrinking: world_size {self.world_size} -> {current_world_size}",
)
self.assertEqual(first_pg.size(), expected_first_size)
# Second shrinking: exclude another rank from the remaining group
# Choose a rank that's in the middle range
if current_world_size >= 3:
second_exclusion = [
current_world_size - 1
] # Exclude the new "last" rank
log_test_info(
self.rank,
f"Second shrinking from group of size {current_world_size}: excluding ranks {second_exclusion}",
)
if self.rank not in second_exclusion:
# Only non-excluded ranks should call shrink_group for second iteration
second_pg = c10d.shrink_group(second_exclusion, group=first_pg)
self.assertIsNotNone(second_pg)
# IMPORTANT: Update world size after second shrinking
final_world_size = second_pg.size()
expected_final_size = current_world_size - len(second_exclusion)
log_test_info(
self.rank,
f"After second shrinking: world_size {current_world_size} -> {final_world_size}",
)
self.assertEqual(second_pg.size(), expected_final_size)
# Test collective on final group
tensor = torch.full((1,), self.rank).cuda(device)
log_test_info(
self.rank,
f"Performing all_reduce on final group (size {final_world_size}) with tensor: {tensor.item()}",
)
c10d.all_reduce(tensor, group=second_pg)
log_test_info(
self.rank,
f"Final all_reduce completed, result: {tensor.item()}",
)
# Calculate expected sum of remaining ranks
all_excluded = set(first_exclusion + second_exclusion)
remaining_ranks = [
r for r in range(self.world_size) if r not in all_excluded
]
expected_sum = sum(remaining_ranks)
log_test_info(
self.rank,
f"Remaining ranks: {remaining_ranks}, expected sum: {expected_sum}, actual: {tensor.item()}",
)
self.assertEqual(tensor.item(), expected_sum)
log_test_info(self.rank, "Final verification passed")
else:
log_test_info(
self.rank,
"This rank excluded in second shrinking, not calling shrink_group",
)
else:
log_test_info(
self.rank, "Skipping second shrinking (remaining group too small)"
)
else:
log_test_info(
self.rank,
"This rank excluded in first shrinking, not calling shrink_group",
)
log_test_info(self.rank, "Destroying process group")
dist.destroy_process_group()
log_test_info(self.rank, "test_shrink_group_multiple_iterations completed")
# Helper methods for optimized shrink group tests
def _setup_shrink_test(self, test_suffix, world_size=None, warmup=True):
"""Common setup for shrink group tests."""
os.environ["TORCH_NCCL_USE_COMM_NONBLOCKING"] = "1"
world_size = world_size or self.world_size
store = c10d.FileStore(self.file_name + f"_{test_suffix}", world_size)
device = torch.device(f"cuda:{self.rank}")
c10d.init_process_group(
"nccl",
world_size=world_size,
rank=self.rank,
store=store,
pg_options=self.opts(),
device_id=device,
)
pg = c10d.distributed_c10d._get_default_group()
if warmup:
c10d.all_reduce(torch.ones(1).cuda(device), group=pg)
return device, pg
def _validate_shrunk_group(self, shrunk_pg, expected_size, test_name=""):
"""Validate properties of a shrunk process group."""
self.assertIsNotNone(shrunk_pg, f"{test_name}: shrunk_pg should not be None")
actual_size = shrunk_pg.size()
self.assertEqual(
actual_size, expected_size, f"{test_name}: group size mismatch"
)
new_rank = shrunk_pg.rank()
self.assertTrue(
0 <= new_rank < expected_size, f"{test_name}: invalid new rank {new_rank}"
)
log_test_info(
self.rank,
f"{test_name}: world_size {self.world_size} -> {actual_size}, rank {self.rank} -> {new_rank}",
)
return new_rank
def _test_collective_on_shrunk_group(
self, shrunk_pg, device, ranks_to_exclude, test_name=""
):
"""Test collective communication on shrunk group and verify correctness."""
test_tensor = torch.full((1,), self.rank, device=device, dtype=torch.float32)
c10d.all_reduce(test_tensor, group=shrunk_pg)
result = test_tensor.item()
expected_sum = sum(
r for r in range(self.world_size) if r not in ranks_to_exclude
)
self.assertEqual(
result, expected_sum, f"{test_name}: collective result mismatch"
)
log_test_info(
self.rank, f"{test_name}: collective passed ({result} == {expected_sum})"
)
return result
def _perform_shrink_test(
self, ranks_to_exclude, test_name, shrink_flags=0, with_collective=True
):
"""Complete shrink test flow: setup, shrink, validate, test collective, cleanup.
Consistent API: All ranks perform setup to initialize distributed environment.
ONLY non-excluded ranks call shrink_group() for both default and non-default groups.
Excluded ranks perform setup, then exit without calling shrink_group() or waiting.
"""
log_test_info(self.rank, f"{test_name} (world_size={self.world_size})")
is_excluded = self.rank in ranks_to_exclude
log_test_info(
self.rank,
f"Excluding ranks: {ranks_to_exclude}, am_excluded: {is_excluded}",
)
# All ranks (including excluded ones) perform setup to initialize distributed environment
device, pg = self._setup_shrink_test(test_name.lower().replace(" ", "_"))
is_default_group = pg == c10d.distributed_c10d._get_default_group()
if is_excluded:
log_test_info(
self.rank,
f"Excluded rank {self.rank} - setup complete, skipping shrink operation",
)
if shrink_flags & NCCL_SHRINK_ABORT:
log_test_info(self.rank, f"Using abort for excluded rank {self.rank}")
pg._get_backend(torch.device(device)).abort()
log_test_info(
self.rank, f"cleanup resources for excluded rank {self.rank}"
)
dist.destroy_process_group()
log_test_info(self.rank, f"Excluded rank {self.rank} - exit")
else:
log_test_info(
self.rank, f"Using regular destroy for excluded rank {self.rank}"
)
dist.destroy_process_group()
return None
# Only non-excluded ranks proceed with shrink
log_test_info(
self.rank,
f"Non-excluded rank calling shrink_group (default_group={is_default_group})",
)
shrunk_pg = c10d.shrink_group(ranks_to_exclude, shrink_flags=shrink_flags)
log_test_info(
self.rank,
f"Non-excluded rank calling shrink_group (default_group={is_default_group}) done",
)
# Non-excluded ranks: validate and test the new group
expected_size = self.world_size - len(ranks_to_exclude)
_ = self._validate_shrunk_group(shrunk_pg, expected_size, test_name)
if with_collective:
_ = self._test_collective_on_shrunk_group(
shrunk_pg, device, ranks_to_exclude, test_name
)
log_test_success(self.rank, f"{test_name} successful (shrink + collective)")
else:
log_test_success(self.rank, f"{test_name} successful (shrink only)")
dist.destroy_process_group()
return shrunk_pg
def _get_default_ranks_to_exclude(self):
"""Get default ranks to exclude based on world size."""
if self.world_size <= 1:
return []
return [self.world_size - 1] # Exclude last rank by default
@requires_nccl_shrink()
@requires_world_size(3)
def test_shrink_group_vs_abort_reinit_performance(self):
"""Compare performance of shrink_group vs traditional abort+reinit (simplified for reliability)."""
log_test_info(self.rank, "=== TEST 1: abort+reinit ===")
device, pg1 = self._setup_shrink_test("_perf_reinit")
torch.cuda.synchronize(device)
# Test 1: Traditional abort + reinit
start_time = time.perf_counter()
dist.destroy_process_group()
device, new_pg = self._setup_shrink_test("perf_shrink_test1")
reinit_time = time.perf_counter() - start_time
# Test collective with original rank values for fair comparison (non-blocking mode)
test_tensor = torch.full((1,), self.rank, device=device, dtype=torch.float32)
work = c10d.all_reduce(test_tensor, group=new_pg, async_op=True)
work.wait()
torch.cuda.synchronize(device)
# Verify correctness
expected_sum = sum(r for r in range(self.world_size))
self.assertEqual(test_tensor.item(), expected_sum, "Reinit collective failed")
log_test_info(self.rank, f"abort+reinit: {reinit_time:.4f}s")
dist.destroy_process_group(new_pg)
# Test 2: shrink_group with NCCL_SHRINK_ABORT
log_test_info(self.rank, "=== TEST 2: shrink_group ===")
ranks_to_exclude = [self.world_size - 1]
is_excluded = self.rank in ranks_to_exclude
log_test_info(
self.rank,
f"Excluding ranks: {ranks_to_exclude}, am_excluded: {is_excluded}",
)
device, pg1 = self._setup_shrink_test("perf_shrink_test2") # Unique suffix
shrink_time = 0
if not is_excluded:
torch.cuda.synchronize(device) # Ensure accurate timing
start_time = time.perf_counter()
shrunk_pg = c10d.shrink_group(
ranks_to_exclude, shrink_flags=NCCL_SHRINK_ABORT
)
c10d.all_reduce(torch.ones(1).cuda(device), group=shrunk_pg)
shrink_time = time.perf_counter() - start_time
# Test collective communication on shrunk group (non-blocking mode)
test_tensor = torch.full(
(1,), self.rank, device=device, dtype=torch.float32
)
work = c10d.all_reduce(test_tensor, group=shrunk_pg, async_op=True)
work.wait()
# Verify correctness
expected_sum = sum(
r for r in range(self.world_size) if r not in ranks_to_exclude
)
self.assertEqual(
test_tensor.item(),
expected_sum,
"shrink_test: collective result mismatch",
)
torch.cuda.synchronize(device) # Ensure operations complete
log_test_info(self.rank, f"shrink_group: {shrink_time:.4f}s")
dist.destroy_process_group()
else:
log_test_info(self.rank, "Excluded from shrink test - exiting immediately")
dist.destroy_process_group()
return
# Performance analysis (only for participating ranks)
if shrink_time > 0 and reinit_time > 0:
speedup = reinit_time / shrink_time
time_saved = reinit_time - shrink_time
log_test_info(self.rank, "=== PERFORMANCE RESULTS ===")
log_test_info(self.rank, f"shrink_group: {shrink_time:.4f}s")
log_test_info(self.rank, f"abort+reinit: {reinit_time:.4f}s")
log_test_info(self.rank, f"time_saved: {time_saved:+.4f}s")
log_test_info(self.rank, f"speedup: {speedup:.2f}x")
if speedup > 1.1:
log_test_success(self.rank, "shrink_group significantly faster")
elif speedup > 0.9:
log_test_info(self.rank, "≈ comparable performance")
else:
log_test_warning(self.rank, "abort+reinit faster")
log_test_info(self.rank, "Performance test completed")
@requires_nccl()
@skip_but_pass_in_sandcastle_if(not TEST_MULTIGPU, "NCCL test requires 2+ GPUs")
def test_deterministic_mode_no_break(self):

View File

@ -195,7 +195,7 @@ if not TEST_WITH_DEV_DBG_ASAN:
for i, t in enumerate(tensors):
self.assertEqual(t, torch.ones(5, 5, device=device) + i)
elif self.rank == 0:
for i, t in enumerate(tensors):
for t in tensors:
zeros = torch.zeros(5, 5, device=device)
self.assertEqual(t, zeros)
y = torch.sum(torch.stack(tensors), axis=0)

View File

@ -408,6 +408,79 @@ class TestOverlapPreservingBucketing(InductorTestCase):
"%all_gather_into_tensor_out", 1, exactly=False
).run(graph_str)
def test_can_bucket_all_reduce(self):
"""
Test that all_reduce operations CAN bucket together.
Graph structure:
ar1_start -> ar2_start -> mm1 (hides ar1) -> mm2 (hides ar2) -> ar1_wait -> ar2_wait
"""
def func(a, b):
group_name = "0"
# Start both all_reduce operations
ar1 = torch.ops._c10d_functional.all_reduce(a, "sum", group_name)
ar2 = torch.ops._c10d_functional.all_reduce(b, "sum", group_name)
# Independent compute that can hide both
mm1 = torch.mm(a, a)
mm2 = torch.mm(b, b)
# Wait for both
ar1_out = torch.ops._c10d_functional.wait_tensor(ar1)
ar2_out = torch.ops._c10d_functional.wait_tensor(ar2)
return ar1_out.sum() + ar2_out.sum() + mm1.sum() + mm2.sum()
# Use fake mode to trace without executing
with FakeTensorMode():
a = torch.ones(4, 4, device=self.device)
b = torch.ones(4, 4, device=self.device) * 2
# Trace with make_fx
traced = make_fx(func)(a, b)
# Find nodes
ar1, ar2 = traced.graph.find_nodes(
op="call_function",
target=torch.ops._c10d_functional.all_reduce.default,
)
mm1, mm2 = traced.graph.find_nodes(
op="call_function", target=torch.ops.aten.mm.default
)
# For all_reduce, start_node == wait_node (no separate wait)
hiding_annotations = {
ar1: mm1,
ar2: mm2,
}
# Build collective info
collective_info = build_collective_info(traced.graph, hiding_annotations)
node_ancestors = compute_ancestors(traced.graph)
scheduled = OrderedSet(traced.graph.nodes)
# Run bucketing
from torch._inductor.fx_passes.overlap_preserving_bucketer import (
OverlapPreservingBucketer,
)
bucketer = OverlapPreservingBucketer(
traced.graph,
collective_info,
node_ancestors,
scheduled,
)
bucketer.bucket_collectives()
# Verify: should have 1 bucketed all_reduce
# After bucketing, there should be only one all_reduce node (the bucketed one)
graph_str = str(traced.graph)
FileCheck().check_count("%all_reduce", 1, exactly=True).check_count(
"%mm", 2
).run(graph_str)
def test_can_bucket_multidtype_collectives(self):
"""
Test that all_gathers with different dtypes CAN bucket together.

View File

@ -1718,6 +1718,39 @@ SeqNr|OrigAten|SrcFn|FwdSrcFn
self.assertEqual(eager_no_sq, comp_ind_no_sq)
self.assertEqual(eager_no_sq.stride(), comp_ind_no_sq.stride())
@torch._dynamo.config.patch(capture_scalar_outputs=True)
@torch._dynamo.config.patch(capture_dynamic_output_shape_ops=True)
def test_unbacked_activation_specialized_in_inductor(self):
"""Test compilation with unbacked operations like nonzero."""
torch._dynamo.reset()
def fuzzed_program(arg_0, sentinel):
var_node_1 = arg_0
var_node_5 = torch.full((1, 2), -66, dtype=torch.int32)
var_node_6 = torch.full((1, 2), 77, dtype=torch.int64)
var_node_4 = torch.ops.aten.add(var_node_5, var_node_6)
var_node_7 = torch.full((1, 2), -64, dtype=torch.int32)
var_node_3 = torch.ops.aten.mul(var_node_4, var_node_7)
var_node_9 = torch.full((3, 4), False, dtype=torch.bool)
var_node_8 = torch.nonzero(var_node_9)
var_node_2 = torch.ops.aten.add(var_node_3, var_node_8)
var_node_0 = torch.ops.aten.div(var_node_1, var_node_2)
result = var_node_0 * sentinel
if result.is_complex():
result = result.real
return result
sentinel = torch.tensor(1.0, requires_grad=True)
arg_0 = torch.randint(0, 3, (1, 2), dtype=torch.int64)
args = (arg_0,) + (sentinel,)
result_original = fuzzed_program(*args)
compiled_program = torch.compile(fuzzed_program, fullgraph=True, dynamic=True)
result_compiled = compiled_program(*args)
self.assertTrue(torch.allclose(result_original, result_compiled))
if __name__ == "__main__":
from torch._dynamo.test_case import run_tests

View File

@ -41,6 +41,20 @@ from torch.testing._internal.triton_utils import requires_cuda_and_triton
from torch.testing._internal.two_tensor import TwoTensor
def aot_eager_regional_inductor():
"""
Regional inductor backend for AOT autograd.
Uses regional_inductor as both forward and backward compiler.
"""
from torch._dynamo.backends.common import aot_autograd
from torch.fx.passes.regional_inductor import regional_inductor
return aot_autograd(
fw_compiler=regional_inductor,
bw_compiler=regional_inductor,
)
def saved_tensors_hooks_to_gm(
pack_fn,
unpack_fn,
@ -1898,6 +1912,171 @@ class AOTAutogradCacheTests(InductorTestCase):
# no recompiles
self.assertFalse(counters)
@inductor_config.patch("fx_graph_remote_cache", False)
@inductor_config.patch("fx_graph_cache", True)
@functorch_config.patch({"enable_autograd_cache": True})
@functorch_config.patch({"bundled_autograd_cache": True})
def test_regional_inductor_basic(self):
"""
Basic test for regional inductor with bundled autograd cache.
Tests that regional inductor compilation results can be cached and hit.
"""
import torch.fx.traceback as fx_traceback
def fn(x, y):
sin = torch.sin(x)
# Mark this region to be compiled with inductor
with fx_traceback.annotate({"compile_with_inductor": 0}):
mul = sin * y
add = mul + 1
return torch.sin(add)
x = torch.randn(10, device="cpu")
y = torch.randn(10, device="cpu")
# Compile with regional inductor backend
compiled_fn = torch.compile(
fn, backend=aot_eager_regional_inductor(), fullgraph=True
)
# First call should miss in cache
result1 = compiled_fn(x, y)
self.assertEqual(counters["aot_autograd"]["autograd_cache_miss"], 1)
self.assertEqual(counters["aot_autograd"]["autograd_cache_hit"], 0)
self.assertEqual(counters["aot_autograd"]["autograd_cache_saved"], 1)
# Second call should hit (after clearing dynamo)
self._clear_dynamo_and_codecache()
result2 = compiled_fn(x, y)
self.assertEqual(counters["aot_autograd"]["autograd_cache_miss"], 1)
self.assertEqual(counters["aot_autograd"]["autograd_cache_hit"], 1)
self.assertEqual(counters["aot_autograd"]["autograd_cache_saved"], 1)
# Results should be the same
self.assertEqual(result1, result2)
@inductor_config.patch("fx_graph_remote_cache", False)
@inductor_config.patch("fx_graph_cache", True)
@functorch_config.patch({"enable_autograd_cache": True})
@functorch_config.patch({"bundled_autograd_cache": True})
def test_regional_inductor_with_backward(self):
"""
Test regional inductor with backward pass and bundled autograd cache.
Note: Regional inductor triggers multiple AOT autograd compilations:
- One for the outer graph (with regional inductor backend)
- One for each marked region (via standalone_compile)
"""
import torch.fx.traceback as fx_traceback
def fn(x, y):
sin = torch.sin(x)
# Mark this region to be compiled with inductor
with fx_traceback.annotate({"compile_with_inductor": 0}):
mul = sin * y
add = mul + 1
return torch.sin(add)
x = torch.randn(10, requires_grad=True)
y = torch.randn(10, requires_grad=True)
x2 = x.detach().clone().requires_grad_(True)
y2 = y.detach().clone().requires_grad_(True)
# Compile with regional inductor backend
compiled_fn = torch.compile(
fn, backend=aot_eager_regional_inductor(), fullgraph=True
)
# First call: AOT autograd compiles the outer graph (1 miss)
# Regional inductor then compiles the marked region (1 more miss)
result1 = compiled_fn(x, y)
result1.sum().backward()
# We expect 2 cache misses: outer graph + marked region
initial_misses = counters["aot_autograd"]["autograd_cache_miss"]
initial_saves = counters["aot_autograd"]["autograd_cache_saved"]
self.assertGreater(initial_misses, 0)
self.assertGreater(initial_saves, 0)
# Second call should hit (after clearing dynamo)
self._clear_dynamo_and_codecache()
result2 = compiled_fn(x2, y2)
result2.sum().backward()
# Should have cache hits now
final_hits = counters["aot_autograd"]["autograd_cache_hit"]
self.assertGreater(final_hits, 0)
# Cache misses and saves should not increase
self.assertEqual(
counters["aot_autograd"]["autograd_cache_miss"], initial_misses
)
self.assertEqual(
counters["aot_autograd"]["autograd_cache_saved"], initial_saves
)
# Results and gradients should be the same
self.assertEqual(result1, result2)
self.assertEqual(x.grad, x2.grad)
self.assertEqual(y.grad, y2.grad)
@inductor_config.patch("fx_graph_remote_cache", False)
@inductor_config.patch("fx_graph_cache", True)
@functorch_config.patch({"enable_autograd_cache": True})
@functorch_config.patch({"bundled_autograd_cache": True})
def test_regional_inductor_cache_miss_on_change(self):
"""
Test that changing the function causes a cache miss with regional inductor.
Regional inductor creates multiple AOT compilations, so we track
the change in cache misses rather than absolute counts.
"""
import torch.fx.traceback as fx_traceback
def fn1(x, y):
sin = torch.sin(x)
with fx_traceback.annotate({"compile_with_inductor": 0}):
mul = sin * y
add = mul + 1
return torch.sin(add)
def fn2(x, y):
sin = torch.sin(x)
with fx_traceback.annotate({"compile_with_inductor": 0}):
mul = sin * y
add = mul + 2 # Changed from +1 to +2
return torch.sin(add)
x = torch.randn(10)
y = torch.randn(10)
# Compile first function
compiled_fn1 = torch.compile(
fn1, backend=aot_eager_regional_inductor(), fullgraph=True
)
result1 = compiled_fn1(x, y)
first_misses = counters["aot_autograd"]["autograd_cache_miss"]
first_saves = counters["aot_autograd"]["autograd_cache_saved"]
self.assertGreater(first_misses, 0)
self.assertEqual(counters["aot_autograd"]["autograd_cache_hit"], 0)
self.assertGreater(first_saves, 0)
# Compile second function (different graph)
self._clear_dynamo_and_codecache()
compiled_fn2 = torch.compile(
fn2, backend=aot_eager_regional_inductor(), fullgraph=True
)
result2 = compiled_fn2(x, y)
# Should miss because graph is different (more misses than before)
self.assertGreater(
counters["aot_autograd"]["autograd_cache_miss"], first_misses
)
self.assertEqual(counters["aot_autograd"]["autograd_cache_hit"], 0)
self.assertGreater(
counters["aot_autograd"]["autograd_cache_saved"], first_saves
)
# Results should be different
self.assertNotEqual(result1, result2)
@functorch_config.patch({"bundled_autograd_cache": True})
class AOTAutogradCacheBundledTests(AOTAutogradCacheTests):

View File

@ -582,6 +582,23 @@ from user code:
actual = compiled_fn(fn, *inputs)
self.assertEqual(expected, actual)
def test_aot_compile_with_default_args(self):
def fn(x, y=1):
return x + x
compiled_fn = torch.compile(fn, fullgraph=True).aot_compile(
((torch.randn(3, 4),), {})
)
inputs = (torch.randn(3, 4),)
expected = fn(*inputs)
actual = compiled_fn(*inputs)
self.assertEqual(expected, actual)
compiled_fn.save_compiled_function(self.path())
with open(self.path(), "rb") as f:
compiled_fn = torch.compiler.load_compiled_function(f)
actual = compiled_fn(*inputs)
self.assertEqual(expected, actual)
if __name__ == "__main__":
from torch._dynamo.test_case import run_tests

Some files were not shown because too many files have changed in this diff Show More