Compare commits

..

135 Commits

Author SHA1 Message Date
94b993d7e4 linting 2025-11-13 17:18:35 -08:00
0c76b784d1 MPS: Fix clamp scalar cache key to store floats in hex representation 2025-11-13 15:58:58 -08:00
0cd0bd7217 address DDE in matmul decomp (#166541)
Address https://github.com/pytorch/pytorch/issues/165081
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166541
Approved by: https://github.com/mlazos
2025-11-13 23:50:00 +00:00
fe33d7cadf Revert "address DDE in matmul decomp (#166541)"
This reverts commit c940b1fbbca8da7e526bf610ce007f8af75f6cd5.

Reverted https://github.com/pytorch/pytorch/pull/166541 on behalf of https://github.com/zou3519 due to broke Inductor CI ([comment](https://github.com/pytorch/pytorch/pull/166541#issuecomment-3530162518))
2025-11-13 23:29:06 +00:00
a9542426d0 [MPS] Add Metal complex mm implementation (#167755)
As MPSGraph one returns incorrect results if matrix inner dimention exceed 4K
Add regression test

Fixes https://github.com/pytorch/pytorch/issues/167727
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167755
Approved by: https://github.com/manuelcandales
2025-11-13 22:40:59 +00:00
f79cdc89db [CD] [aarch64] unify the build.sh to build for aarch64 wheel (#166044)
related to https://github.com/pytorch/pytorch/issues/163970

Changes:
Below are addressed from review from @malfet and @atalman:

1. Simplified the x86 TORCH_CUDA_ARCH_LIST logic to reuse the base list in`.ci/manywheel/build_cuda.sh`.
2. Added function filter_aarch64_archs() that filters the TORCH_CUDA_ARCH_LIST for aarch64 based on the x86 code.
3. Added function in `.ci/pytorch/build.sh` to report error if ACL is not present.
4. Deprecated previous aarch64 scripts (`.ci/aarch64_linux/` folder).

Improvements:

1. Significant improvement in build time for CUDA ARM wheel build -

Reduced build time from 5.5–6 hours to 1 hour 40–50 minutes
taking this 13.0 build for example, 6h 11m 46s to 1h 50m 1s ≈ 70 % faster build time
old: https://github.com/pytorch/pytorch/actions/runs/19304934204/job/55209695430
new: https://github.com/pytorch/pytorch/actions/runs/19301014750/job/55195226316
Reason: MAX_JOBS=5 is now removed after we move away from original aarch64 build workflow, previously it was OOM in building flash-attn, new MAX_JOBS is 12.
https://github.com/pytorch/pytorch/pull/166044/files#diff-ccef31095e4f2d203710232531c38bff3251e41cf73ec84ee59f224bb64034aeL280

2. Unified workflow for building x86 and sbsa wheels - more maintainable code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166044
Approved by: https://github.com/atalman
2025-11-13 22:35:00 +00:00
3d063519bf [inductor][ez] skip cache for unit test via envvar (#167237)
It would be surprising to see the cache get hit in Unit Test when TORCHINDUCTOR_FX_GRAPH_CACHE_DEFAULT is set to 1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167237
Approved by: https://github.com/eellison
2025-11-13 22:28:16 +00:00
0b3bdb0d89 [EZ][BE] Remove unnecessary semicolon in Module.cpp (#167756)
`${subj}`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167756
Approved by: https://github.com/Skylion007
2025-11-13 22:02:08 +00:00
8f00ec31ca [dynamo, nested graph breaks] disallow graph breaks in functorch ops, enable nested graph break tests on test_higher_order_ops.py (#166674)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166674
Approved by: https://github.com/ydwu4
ghstack dependencies: #166673
2025-11-13 21:52:02 +00:00
21f32e4af3 [dynamo] clean up BaseUserFunctionVariable and LocalGeneratorObjectVariable (#166673)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166673
Approved by: https://github.com/Skylion007, https://github.com/guilhermeleobas, https://github.com/mlazos
2025-11-13 21:52:02 +00:00
940979a229 [export, 3.14] handle patching methods with functools.partial correctly in non-strict export (#167396)
Note: dynamo is not affected by this since patching class methods are not supported right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167396
Approved by: https://github.com/angelayi
ghstack dependencies: #167382, #167383, #167384, #167387
2025-11-13 21:47:30 +00:00
4fc688625a [3.14, dataloader] handle forkserver default mp start method in 3.14 (#167387)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167387
Approved by: https://github.com/malfet
ghstack dependencies: #167382, #167383, #167384
2025-11-13 21:47:30 +00:00
23f4f323ea [dynamo, 3.14] enable dynamo in 3.14 (#167384)
dynamo tests are passing in the CI PR above - so we could probably just enable dynamo right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167384
Approved by: https://github.com/Skylion007, https://github.com/mlazos
ghstack dependencies: #167382, #167383
2025-11-13 21:47:23 +00:00
9ac3fc0d0a [inductor, 3.14] catch pickle.PicklingError exceptions (#167383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167383
Approved by: https://github.com/aorenste, https://github.com/mlazos
ghstack dependencies: #167382
2025-11-13 21:47:14 +00:00
38806f381a [inductor, 3.14] fix itertools.product pickle error in test_cpu_repro (#167382)
`inductor/test_cpu_cpp_wrapper` was failing since it was attempting to pickle`itertools.product`, and that is no longer picklable in 3.14. We work around by eagerly generating a list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167382
Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/mlazos
2025-11-13 21:47:06 +00:00
cfb3a6b3da [2/N][BugFix][Refactor] fix several instances which use f = open(...) without a corresponding f.close() (#167628)
continue in https://github.com/pytorch/pytorch/pull/167423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167628
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-11-13 21:15:45 +00:00
d8384e296e [Inductor] Remove bf16 fallback for atomic_add (#167380)
Fixes: #97016

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167380
Approved by: https://github.com/mlazos
2025-11-13 20:41:35 +00:00
d273422582 [CUDA] Large max pool fix (#167427)
Fixes #167253
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167427
Approved by: https://github.com/eqy, https://github.com/malfet
2025-11-13 20:11:41 +00:00
fadb62f592 [PyTorch] fix profiler issue with empty exported trace file (#167601)
Summary:
The previous implementation incorrectly attempted to read from a `NamedTemporaryFile` file pointer after calling `profiler.export_chrome_trace(fp.name)`. The issue is that `export_chrome_trace()` writes to a file at the path `fp.name`, but doesn't write to the file pointer `fp` itself. This meant when the code tried to read from `fp`, it got empty content.

The fix explicitly closes the temporary file first, then calls `export_chrome_trace(fp.name)` which writes the JSON trace to a file at that path. We then open that file separately for reading and copy its contents to the gzipped output file. This ensures we're reading from the actual file that was written to, not an empty file pointer.

Changes made in both `fbcode/caffe2/torch/profiler/profiler.py` and `xplat/caffe2/torch/profiler/profiler.py`:
- `export_chrome_trace()`: Fixed file reading for gzipped chrome trace exports by opening the written file separately
- `export_memory_timeline()`: Fixed file reading for gzipped memory timeline exports by opening the written file separately

Test Plan:
* run benchmark
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/benchmark:benchmark_train_pipeline -- \
    --yaml_config=fbcode/torchrec/distributed/benchmark/yaml/sparse_data_dist_base.yml
```
* upload trace
```
DIFF=D86737513 fbcode/torchrec/fb/scripts/trace_to_manifold.sh
```
======== markdown ============

[manifold folder](https://www.internalfb.com/manifold/explorer/torchrec_benchmark_traces/tree/permanent_traces/DIFF/D86737513)
[trace-sparse_data_dist_base-rank0.json.gz](https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/permanent_traces/DIFF/D86737513/trace-sparse_data_dist_base-rank0.json.gz&bucket=torchrec_benchmark_traces)

Differential Revision: D86737513

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167601
Approved by: https://github.com/angelayi
2025-11-13 19:40:09 +00:00
e5eb89e111 remove allocation of new unbacked symbols during mod eval (#167123)
When executing code like torch._check(numel % newsize == 0, ...), we previously allocated a new unbacked symbol due to #113165. However, this allocation is no longer necessary and can cause issues due to inconsistent behavior when tracing torch._check multiple times.

In particular, the allocation can lead to a memo disaster where the previously allocated symbol is returned instead of a new one, causing unexpected behavior.

This PR removes the unnecessary allocation, ensuring consistent behavior and avoiding potential issues. The change is validated by the following code, which now compiles without issues:
```
import torch

def fn(x):
    i0 = x.nonzero().size(0)
    y = torch.zeros((i0, 192))
    return y.view([12, -1, 192])
with torch._dynamo.config.patch({"capture_dynamic_output_shape_ops": True}):
    torch.compile(fn, fullgraph=True)(torch.ones((12,)))
```

By removing this unnecessary allocation, we simplify the code and avoid potential issues."

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167123
Approved by: https://github.com/Lucaskabela
2025-11-13 18:52:41 +00:00
b5e0e6932a Correctly populate storage offset in DTensor constructor (#167597)
The storage offset always matches the local offset because you never have rank dependent offset (your shard may be different, but your view into it will always be the same across all ranks!)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167597
Approved by: https://github.com/malfet
ghstack dependencies: #166868, #166867, #167076
2025-11-13 18:26:11 +00:00
6ea779188c [DebugMode] torch.hash_tensor option (#167486)
Adds `torch.hash_tensor` (#154149) as tensor hashing variant; allows tuple of hashes in log annotations for more info (e.g. `with DebugMode.log_tensor_hashes(hash_fn=["norm", "hash_tensor"]): ...`)

also fixes some corner cases around norm hashing (preserves NaNs/infs, avoids erroring on smaller dtypes)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167486
Approved by: https://github.com/xmfan
2025-11-13 17:46:09 +00:00
460c7e196c Handle only a Tensor for IntList parsing (#167606)
Fixes https://github.com/pytorch/pytorch/issues/167562

Authored with Claude Code

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167606
Approved by: https://github.com/colesbury
2025-11-13 17:39:38 +00:00
7aac506cdc Revert "[precompile] Integrate AOTI as a backend. (#167338)"
This reverts commit 273babeec3c6211f30b806797f35a6e9c47c737f.

Reverted https://github.com/pytorch/pytorch/pull/167338 on behalf of https://github.com/jeanschmidt due to seems to be breaking internal tests and builds, see D86919103 ([comment](https://github.com/pytorch/pytorch/pull/167338#issuecomment-3528950888))
2025-11-13 17:39:03 +00:00
374ee9e867 Fix missing thrust includes (#167450)
CCCL recently dropped a ton of transient includes that blew up thrust compile times

That means we need to include what we use

Fixes build issues found in internal CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167450
Approved by: https://github.com/Skylion007, https://github.com/Aidyn-A
2025-11-13 17:02:43 +00:00
698aa0f3e5 [MPS] sparse_mask_projection (#166260)
Implements sparse mask projection. I'm aware that SparseMPSTensorMath needs some refactoring, which I'll do in a followup PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166260
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-11-13 17:01:54 +00:00
eqy
d3ca4a3a4f [CUDA][64-bit indexing] Handle 64-bit outer dim cumsum case (#167326)
For #167086, same change more or less as #143696

Let's see if CI wants a large tensor test decorator

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167326
Approved by: https://github.com/ngimel, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-11-13 17:00:00 +00:00
c940b1fbbc address DDE in matmul decomp (#166541)
Address https://github.com/pytorch/pytorch/issues/165081
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166541
Approved by: https://github.com/mlazos
2025-11-13 16:41:35 +00:00
4de24bcc56 [Fix XPU typo] Fix a comment typo of FindSYCLToolkit.cmake (#165884)
The character U+ff1a ":" could be confused with the ASCII character U+003a ":", which is more common in source code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165884
Approved by: https://github.com/cyyever, https://github.com/guangyey, https://github.com/EikanWang
2025-11-13 12:32:48 +00:00
f2d0a472ef [xpu][feature] Add XPU support on torch.accelerator.get_memory_info (#162564)
# Motivation
Support XPU for `torch.accelerator.get_memory_info`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162564
Approved by: https://github.com/albanD
ghstack dependencies: #156812
2025-11-13 11:03:17 +00:00
9ae0ecec7d Introduce a new API torch.accelerator.get_memory_info (#156812)
# Motivation
`torch.cuda.mem_get_info` and `torch.xpu.mem_get_info` are widely used in other popular repos, such as
- 076313bd09/python/sglang/srt/utils.py (L378),
- 7ecc2d7f39/src/accelerate/utils/modeling.py (L822),
- 7ba34b1241/vllm/worker/worker.py (L150).
-
This PR introduces a unified API `torch.accelerator.get_memory_info` to cover this scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156812
Approved by: https://github.com/albanD
2025-11-13 11:01:39 +00:00
ce4f31f662 [OpenReg][Feat][Docs] Enrich hook implementation and add focused documentation (#165980)
## Summary
This PR enriches the implementation of `OpenRegHooks.h` and adds focused documentation for `OpenReg` hooks.

## Key Changes
- A new document: `docs/source/accelerator/hooks.md`
- New `OpenReg` hooks like `isBuilt()`, `isAvailable()` and so on...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165980
Approved by: https://github.com/fffrog

Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
2025-11-13 08:36:18 +00:00
2c846bb614 [xpu][test]port embedding indexing and native_mha test files for Intel GPU (#165886)
we port test_indexing, test_native_mha and test_embedding for Intel GPU in this pr.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

Use torch.accelerator for general gpu
Skip the case if running on xpu which has known issues
using torch.nn.attention.sdpa_kernel() to replace torch.backends.cuda.sdp_kernel() for Intel GPU as torch.backends.cuda.sdp_kernel() is depricated and Intel xpu did not support it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165886
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-11-13 08:17:23 +00:00
8c86ccfbc9 [DebugMode] .show_stack_trace inline (#167589)
Shows inline stack traces, with `.debug_string(show_stack_trace=True)`. For bwd ops we use `.fwd_stack_trace` when available.

Needs some improvement for:
- backwards: not all dispatch calls run under an autograd node, so some just have generic traces (e.g. `loss.backward()`)
- compiled regions: stack trace isn't very meaningful to start (e.g. points to codegened line)

Sample for test_nn_module (fwd + bwd):
```
    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:396 in forward, code: return self.l2(self.l1(x))
    aten::t(t: f32[4, 4])
    aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4])
    aten::t(t: f32[4, 4])
    aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:405 in forward, code: return self.xyz(self.abc(x))
    aten::t(t: f32[4, 4])
    aten::addmm(t: f32[4], t: f32[4, 4], t: f32[4, 4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:429 in test_nn_module, code: out = mod(inp).sum()
    aten::sum(t: f32[4, 4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:430 in test_nn_module, code: out.backward()
    aten::ones_like(t: f32[], pin_memory=False, memory_format=torch.preserve_format)

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:429 in test_nn_module, code: out = mod(inp).sum()
    aten::expand(t: f32[], [4, 4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:405 in forward, code: return self.xyz(self.abc(x))
    aten::t(t: f32[4, 4])
    aten::mm(t: f32[4, 4], t: f32[4, 4])
    aten::t(t: f32[4, 4])
    aten::mm(t: f32[4, 4], t: f32[4, 4])
    aten::t(t: f32[4, 4])
    aten::sum.dim_IntList(t: f32[4, 4], [0], True)
    aten::view(t: f32[1, 4], [4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:430 in test_nn_module, code: out.backward()
    aten::detach(t: f32[4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:405 in forward, code: return self.xyz(self.abc(x))
    aten::t(t: f32[4, 4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:430 in test_nn_module, code: out.backward()
    aten::detach(t: f32[4, 4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:396 in forward, code: return self.l2(self.l1(x))
    aten::t(t: f32[4, 4])
    aten::mm(t: f32[4, 4], t: f32[4, 4])
    aten::t(t: f32[4, 4])
    aten::mm(t: f32[4, 4], t: f32[4, 4])
    aten::t(t: f32[4, 4])
    aten::sum.dim_IntList(t: f32[4, 4], [0], True)
    aten::view(t: f32[1, 4], [4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:430 in test_nn_module, code: out.backward()
    aten::detach(t: f32[4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:396 in forward, code: return self.l2(self.l1(x))
    aten::t(t: f32[4, 4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:430 in test_nn_module, code: out.backward()
    aten::detach(t: f32[4, 4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:396 in forward, code: return self.l2(self.l1(x))
    aten::t(t: f32[4, 4])
    aten::mm(t: f32[4, 4], t: f32[4, 4])
    aten::t(t: f32[4, 4])
    aten::sum.dim_IntList(t: f32[4, 4], [0], True)
    aten::view(t: f32[1, 4], [4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:430 in test_nn_module, code: out.backward()
    aten::detach(t: f32[4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:396 in forward, code: return self.l2(self.l1(x))
    aten::t(t: f32[4, 4])

    # File: /data/users/pianpwk/pytorch/test/distributed/tensor/debug/test_debug_mode.py:430 in test_nn_module, code: out.backward()
    aten::detach(t: f32[4, 4])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167589
Approved by: https://github.com/yushangdi
2025-11-13 08:15:27 +00:00
8f96e7bc1d Only remove_noop in pre_grad passes if remove_noop is not in the remove_passes_list (#167479)
Summary: Only remove_noop in pre_grad passes if remove_noop is not in the remove_passes_list

Test Plan:
Tested as part of lowering for ss_omni_exp model.

f825774360

Unit Tests were run and succeeded as well!

Differential Revision: D86694854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167479
Approved by: https://github.com/mlazos
2025-11-13 07:27:31 +00:00
782fc3c72b [DTensor] Add CPU instruction count benchmark for dispatch (#167394)
Following example from #149932 and doc in
[README.md](benchmarks/dynamo/pr_time_benchmarks/README.md)

cd benchmarks/dynamo/pr_time_benchmarks
`PYTHONPATH=./:../../../ python benchmarks/dtensor.py a`

Currently outputs:

```
collecting instruction count for dtensor_dispatch_detach
instruction count for iteration 0 is 14919468
instruction count for iteration 1 is 136283
instruction count for iteration 2 is 133750
instruction count for iteration 3 is 133757
instruction count for iteration 4 is 133751
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167394
Approved by: https://github.com/laithsakka
2025-11-13 06:54:08 +00:00
1a67403fc6 Move MemPool out of c10 and into ATen. (#167506)
Necessary to allow CachingHostAllocator, which sits in ATen, to
allocate its memory to a memory pool.

Otherwise, we would have a circular dependency, where libtorch_cuda.so
depends upon libc10_cuda.so, but libc10_cuda.so's MemPool object
references CachingHostAllocator symbols in libtorch_cuda.so.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167506
Approved by: https://github.com/ngimel, https://github.com/malfet
2025-11-13 06:18:29 +00:00
3d801a4c01 DTensor fast path: port return_and_correct_aliasing and inplace/out checks (#167475)
This seems to generate a several-microsecond performance improvement in the detach benchmark I've been using.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167475
Approved by: https://github.com/ezyang
ghstack dependencies: #167051, #166372, #166808
2025-11-13 06:11:38 +00:00
2034ca99ae extend C++ DTensor fast path to local operator dispatch (#166808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166808
Approved by: https://github.com/ezyang
ghstack dependencies: #167051, #166372
2025-11-13 06:11:38 +00:00
480b4ff882 Avoid creating Python OpSchema in the DTensor dispatch fast path (#166372)
All we need to do is move a few checks around.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166372
Approved by: https://github.com/ezyang
ghstack dependencies: #167051
2025-11-13 06:11:30 +00:00
f570e589da Add C++ fast path for DTensor.__torch_dispatch__ (#167051)
This patches the `__torch_dispatch__` machinery to detect DTensor and hand over control to a C++ fast path. Unlike #166370 and #166369 (which added a DTensor dispatch key and are intended to be replaced by this PR), this approach fundamentally *is* `__torch_dispatch__`, hopefully sidestepping all manner of thorny "does it work just like `__torch_dispatch__`?" that came up during development and review of #166370.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167051
Approved by: https://github.com/ezyang
2025-11-13 06:11:22 +00:00
f9851af59b Add Attention ops to CI (#165915)
This pull request introduces a new attention operator microbenchmark workflow to the CI system, enabling automated benchmarking and reporting for attention-related operations. The main changes include adding a new GitHub Actions workflow, to add attention benchmarks to the existing Pytorch operator microbenchmark [dashboard](https://hud.pytorch.org/benchmark/v3/dashboard/pytorch_operator_microbenchmark?renderGroupId=main&time.start=2025-10-27T00%3A00%3A00.000Z&time.end=2025-10-29T01%3A00%3A00.000Z&filters.device=cuda&filters.arch=NVIDIA+A100-SXM4-40GB&filters.deviceName=cuda%7C%7CNVIDIA+A100-SXM4-40GB&filters.operatorName=&lcommit.commit=665df0bc7288996d638fcc3da750f8cb2addd6d0&lcommit.workflow_id=18888994873&lcommit.date=2025-10-29T00%3A00%3A00Z&lcommit.branch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915&rcommit.commit=665df0bc7288996d638fcc3da750f8cb2addd6d0&rcommit.workflow_id=18888994873&rcommit.date=2025-10-29T00%3A00%3A00Z&rcommit.branch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915&lbranch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915&rbranch=refs%2Ftags%2Fciflow%2Fop-benchmark%2F165915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165915
Approved by: https://github.com/jbschlosser
2025-11-13 05:30:04 +00:00
eeebf9f664 [dynamo] [3.14] Update broken numpy test (#167681)
This is related to upgrading numpy versions, not 3.14 specifically.  See https://github.com/numpy/numpy/pull/27148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167681
Approved by: https://github.com/williamwen42
ghstack dependencies: #167619
2025-11-13 04:27:55 +00:00
d9a50bf9a8 [dynamo] [3.14] Support np._CopyMode (#167619)
Upgrading scipy to 1.16 introduced errors related to the `copy` parameter of
`np.array`.  Add special handling for `np._CopyMode.IF_NEEDED`, which is not
handled correctly, but matches the existing behavior when `copy=None`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167619
Approved by: https://github.com/williamwen42
2025-11-13 04:27:55 +00:00
2984331c87 [inductor][NFC][2/X] extract do_autotuning/autotune/benchmark from AlgorithmSelectorCache.__call__ (#167489)
Summary: see https://github.com/pytorch/pytorch/pull/167487 for context

Test Plan: CI

Differential Revision: D86714833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167489
Approved by: https://github.com/aorenste
2025-11-13 03:29:39 +00:00
9b68682df2 [ROCm] Enable several DISABLED issues (#167183)
Profiler:
Fixes #166422

Default:
Fixes #165386
Fixes #145019
Fixes #145069
Fixes #165295
Fixes #165294
Fixes #165093
Fixes #164235
Fixes #164194
Fixes #164193
Fixes #155217
Fixes #163918
Fixes #163917
Fixes #155235
Fixes #122352
Fixes #121576
Fixes #121806
Fixes #104366

Inductor:
Fixes #164337
Fixes #148523
Fixes #115002
Fixes #111066
Fixes #107774

Distributed
Fixes #161612
Fixes #161502
Fixes #161459
Fixes #161402
Fixes #155711
Fixes #152201
Fixes #152367
Fixes #152349
Fixes #152168
Fixes #152169
Fixes #151153
Fixes #151077
Fixes #112815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167183
Approved by: https://github.com/jeffdaily
2025-11-13 02:50:35 +00:00
8f5f89c9a0 Revert "Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace (#167248)"
This reverts commit 537167aa1e50a4379dca244163aaf369ed8e5161.

Reverted https://github.com/pytorch/pytorch/pull/167248 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/167248#issuecomment-3524925727))
2025-11-13 02:46:35 +00:00
8919f69362 [Inductor][2/2] Decouple flags for optimization and debug symbols (#167575)
Summary:
What: Decouple flags for compile (unoptimized build) and symbols (optimized build)
Why: Reduce confusion around naming and usage

Test Plan: Unit test & CI

Differential Revision: D86683526

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167575
Approved by: https://github.com/jansel, https://github.com/hl475
2025-11-13 00:59:15 +00:00
19c867873a [opqaue obj] Add attribute support (#167230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167230
Approved by: https://github.com/zou3519
ghstack dependencies: #163284, #163714, #163936
2025-11-13 00:35:20 +00:00
e3dadb1d36 [opaque obj] torch.compile support (#163936)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163936
Approved by: https://github.com/zou3519
ghstack dependencies: #163284, #163714
2025-11-13 00:35:20 +00:00
c9b09a31e8 [opaque obj] Allow non-effectful scriptobjs (#163714)
Fixes functionalization so that we can run ops using ScriptObjects w/o needing effects. Previously we would run into an error when running functionalization on the TorchBindOpOverloads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163714
Approved by: https://github.com/zou3519
ghstack dependencies: #163284
2025-11-13 00:35:20 +00:00
35571fe94b [effects] Add register_effectful_op (#163284)
Refactored register_effectful_op to return a handler to match how fake kernels are registered. This makes it easier to deregister effects

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163284
Approved by: https://github.com/zou3519
2025-11-13 00:35:20 +00:00
485f2b607a ProxyTorchDispatchMode: Decomposing missing sympy.SymExpr should handle constant literals (#167585)
The previous work to decompose missing sympy.SymExpr (#164717) handled combinations of sub-nodes (like `s1*s2`) but I forgot to handle explicit literals (like `2*s2`).

Added a unit test based on the report.

Fixes T244632748

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167585
Approved by: https://github.com/bobrenjc93
2025-11-13 00:27:10 +00:00
0c5d5c7e9a [dynamo][invoke_subgraph] Do not restore side effects on invoke_subgraph (#167446)
Test that checks non proxy-able outputs. Also add a test that fails to
be fixed later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167446
Approved by: https://github.com/zou3519
ghstack dependencies: #167438, #167442
2025-11-13 00:16:40 +00:00
5f98a0363a [dynamo] Make HintsWrapperHigherOrderVariable follow wrap semantics (#167442)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167442
Approved by: https://github.com/zou3519
ghstack dependencies: #167438
2025-11-13 00:16:40 +00:00
2d739001d3 [dynamo] speculate_subgraph_with_auto_output_flattening (#167438)
Summary

  This PR refactors the wrap higher-order operator infrastructure in PyTorch's Dynamo to introduce automatic output flattening for subgraph speculation. The key change is the addition of
  speculate_subgraph_with_auto_output_flattening() which separates the output variable trackers (VTs) that Dynamo continues tracing with from the actual FX graph outputs.

  Key Changes

  New speculate_subgraph_with_auto_output_flattening() function

  - Introduces a new approach for handling HOPs (Higher-Order Operators) that are just "subgraph placeholders", i.e. the HOP essentially just runs the subgraph with inputs (e.g., invoke_subgraph, activation checkpointing,
   autograd.Function)
  - Disentangles output VTs from graph outputs: Allows the subgraph to return complex Python objects (like custom user-defined objects containing tensors) while only registering tensor/symint VTs as actual FX
  graph outputs
  - Mirrors typical Dynamo processing where VTs can "run ahead" for continued tracing while the graph is a side data structure

  Benefits

  1. Handles non-proxyable outputs: Supports HOPs that return custom Python objects containing tensors
  2. Cleaner separation of concerns: Output VTs for continued tracing vs. graph outputs for FX representation
  3. More flexible: Returns graph_output_vts instead of treespec, giving more control over what becomes a graph output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167438
Approved by: https://github.com/zou3519
2025-11-13 00:16:40 +00:00
273babeec3 [precompile] Integrate AOTI as a backend. (#167338)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167338
Approved by: https://github.com/jamesjwu
2025-11-13 00:02:26 +00:00
a76dd6b7c6 [MPS] SparseMps mv op (#166708)
Should be merged after #166561
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166708
Approved by: https://github.com/malfet
2025-11-12 22:44:29 +00:00
2fa18d1545 [export] Codemod more tests to use dynamo_graph_capture_for_export (#167663)
Summary:
as title.

Test Plan:
CI

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167663
Approved by: https://github.com/tugsbayasgalan
2025-11-12 22:44:18 +00:00
537167aa1e Fix thread safety in getCurrentCUDABlasHandle and getCUDABlasLtWorkspace (#167248)
Summary:
getCurrentCUDABlasHandle() and getCUDABlasLtWorkspace() use static mutable maps that are not protected from concurrent read-and-write. This leads to crashes.

This diff adds mutexes to synchronize access to the static maps.

Test Plan:
Use a GPU OD, run multi-threaded tests with TSAN:
```
buck test fbcode//mode/dev-tsan fbcode//caffe2:cuda_cublas_handle_pool_test  -- --stress-runs 100
```
https://www.internalfb.com/intern/testinfra/testrun/14355223937501118

TSAN: P2026731804

Differential Revision: D86316117

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167248
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-11-12 22:43:56 +00:00
0dac408f43 MatMal - fix folding logic (#166891)
Summary:
Folding logic on Matmal can be decomposed to BMM or folding + MM.

Current common Training path for 3D * 2D matmul: library will always fold, since Tensor1 or Tensor2 BOTH require a grad, so we fold since Tensor2 has grad.   But reasoning isn't really sound, it was done as a memory optimization - when its also generally same/more performant.

However, in Chemistry / Modular Modeling its common to directly calculate Forces as derivate of Energy (ie. dl/dX, but NOT dl/dW) in inference.  This exposed bug where we only have 1 of 2 Tensors requires grad, and may choose NOT to fold, resulting in 30% regression due to suboptimal BMM decomposition of torch.nn.Linear (-> calls into matmul).

I actually think even in cases we need either dl/dX or dl/dW, we should be folding when working with inputs of [B, M, N] and weights of [N, K].  Its strictly better for memory and same/faster when you consider both forward + backward runtime, and M's that are not multiples of 8 are particularly brutally slow using BMM vs MM.

Also, compiler out of box could not solve this issue, which raise another concern (was actually highlighted 2 years ago in comments, but seems still case today: (https://github.com/pytorch/pytorch/issues/118548#issuecomment-1919528910)

Differential Revision: D86128493

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166891
Approved by: https://github.com/ngimel
2025-11-12 22:18:03 +00:00
158e72427b [torch] Update caffe2/c10/cuda to build under CUDA 13 (#167534)
Summary:
Update caffe2/c10/cuda to build under CUDA 13

As of CUDA 13, the cudaMemAdvise() has been updated to take in `cudaMemLocation` as argument instead of `int` device id

This is needed for building FBGEMM_GPU under CUDA 13 (see D86372925)

Test Plan:
```
# Default build
buck build  @//mode/opt fbcode//caffe2/c10/cuda:cuda

# CUDA 13 build
buck build  @//mode/opt -c fbcode.arch=aarch64 -c fbcode.nvcc_arch=b200 -c fbcode.platform010_cuda_version=13.0  fbcode//caffe2/c10/cuda:cuda

# AMD build
buck build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2/c10/cuda:cuda
```

Reviewed By: atalman

Differential Revision: D86578286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167534
Approved by: https://github.com/seemethere
2025-11-12 22:12:40 +00:00
0184ef291d [inductor][NFC][1/X] extract create_no_valid_choices from AlgorithmSelectorCache.__call__ (#167487)
Summary:
What: moves `create_no_valid_choices` out of `AlgorithmSelectorCache.__call__` and into the body of `AlgorithmSelectorCache`
Why: nested function definitions make it harder to understand what `AlgorithmSelectorCache.__call__` is doing, on top of making patching/testing/etc more difficult

Test Plan: CI

Differential Revision: D86712921

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167487
Approved by: https://github.com/aorenste
2025-11-12 22:03:37 +00:00
2ca428c721 [CD] Preload libnvrtc-builtinso.so (#167614)
Which is a regression introduced by https://github.com/pytorch/pytorch/pull/167046
That causes CuDNN SDPA fail with actionable `cuDNN Frontend error: [cudnn_frontend] Error: No valid execution plans built.` error

Change `cuda_libs` from dict to list, and add `test_sdpa` regression test to binary smoke tests

Fixes https://github.com/pytorch/pytorch/issues/167602
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167614
Approved by: https://github.com/Aidyn-A, https://github.com/atalman, https://github.com/nWEIdia
2025-11-12 21:50:13 +00:00
1311385f9d Revert "fix failure of exporting compiled model with nested dynamic shapes (#166358)"
This reverts commit 416421c7c455e3befb0772fcc3379661a24aff71.

Reverted https://github.com/pytorch/pytorch/pull/166358 on behalf of https://github.com/jeanschmidt due to seems to be breaking internal signals, see D86790405, @angelayi may you help the author get this change landed? ([comment](https://github.com/pytorch/pytorch/pull/166358#issuecomment-3524052822))
2025-11-12 21:46:38 +00:00
5f0a5b8f87 Revert "Use stable topological sort in fuse_by_partitions (#167397)"
This reverts commit 7886070fc5cdbc9b51b7e2b6432c80ccae01c4fc.

Reverted https://github.com/pytorch/pytorch/pull/167397 on behalf of https://github.com/jeanschmidt due to seems to be breaking executorch signals internally, see D86780724 ([comment](https://github.com/pytorch/pytorch/pull/167397#issuecomment-3523992343))
2025-11-12 21:26:57 +00:00
74e85c6944 Add TORCH_BOX helper for STABLE_TORCH_LIBRARY_IMPL (#167582)
Implementation greatly adapted from @lw's https://github.com/pytorch/pytorch/pull/163505. TORCH_BOX is the StableIValue version of `make_boxed_from_unboxed_functor`.

the differences:
- uses headeronly concepts
- adds an unbox type mapping to support user kernels taking in torch::headeronly::HeaderOnlyArrayRef<T> (by calling to<std::vector<T>> in those cases)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167582
Approved by: https://github.com/swolchok
ghstack dependencies: #167386
2025-11-12 20:29:21 +00:00
a6a0379b9c [caffe2] Address -Wswitch-default warnings in headers (#167563)
Summary: Improve compatibility with projects that have -Wswitch-default errors/warnings enabled by suppressing those errors/warnings in caffe2 headers.

Test Plan: CI Pass

Differential Revision: D86785451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167563
Approved by: https://github.com/shoumikhin
2025-11-12 19:53:43 +00:00
a95eee68d9 [user-streams] Add API for accessing current stream given a device (#167620)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167620
Approved by: https://github.com/xuanzhang816
2025-11-12 19:32:07 +00:00
2ad70c9446 [CI] manually gen json for segfaults (#167250)
segfaults dont gen xml, so we get no info about them in clickhouse or in the xml or in the json, so this manually generates something and uploads it to s3 to be ingested

at some point some of the existing code for test reports should be changed to just use the json that gets uploaded in the job or something
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167250
Approved by: https://github.com/huydhn
2025-11-12 19:16:12 +00:00
bc09a84150 Hide all symbols (except stable/headeronly/shim) if TORCH_STABLE_ONLY is defined (#167496)
Fixes https://github.com/pytorch/pytorch/issues/161660

This extends the `TORCH_STABLE_ONLY` stopgap added in https://github.com/pytorch/pytorch/pull/161658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167496
Approved by: https://github.com/janeyx99
ghstack dependencies: #167495
2025-11-12 19:15:52 +00:00
760c901c9a [torch] Update caffe2/torch/csrc to build under CUDA 13 (#167401)
Summary:
Update caffe2/torch/csrc to build under CUDA 13.

As of CUDA 13, CCCL v3 is the default, and as such, nvToolsExt.h has been moved to  nvtx3/nvtx3.hpp.

This is needed for building FBGEMM_GPU under CUDA 13 (see D86372925)

Test Plan:
```
# Default build
buck build --flagfile fbcode//mode/dev-nosan fbcode//caffe2:_C_impl
buck build --flagfile fbcode//mode/dev-nosan fbcode//caffe2:_C_impl_cuda

# CUDA 13 build
buck build  @//mode/opt -c fbcode.arch=aarch64 -c fbcode.nvcc_arch=b200 -c fbcode.platform010_cuda_version=13.0  fbcode//caffe2:_C_impl
buck build  @//mode/opt -c fbcode.arch=aarch64 -c fbcode.nvcc_arch=b200 -c fbcode.platform010_cuda_version=13.0  fbcode//caffe2:_C_impl_cuda
```

Differential Revision: D86517946

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167401
Approved by: https://github.com/Skylion007
2025-11-12 18:54:45 +00:00
d105e3a198 [dynamo][DebugMode] mask python keys in dispatch_key_set guard checks (#164992)
I found that running any compiled function under DebugMode more than once will trigger recompilations, e.g. with the really simple modified test case in `test_compile`:
```
[0/1] [__recompiles] Recompiling function f in /data/users/pianpwk/ptclone/pytorch/test/distributed/tensor/debug/test_debug_mode.py:268
[0/1] [__recompiles]     triggered by the following guard failure(s):
[0/1] [__recompiles]     - 0/0:
[0/2] [__recompiles] Recompiling function f in /data/users/pianpwk/ptclone/pytorch/test/distributed/tensor/debug/test_debug_mode.py:268
[0/2] [__recompiles]     triggered by the following guard failure(s):
[0/2] [__recompiles]     - 0/1:
[0/2] [__recompiles]     - 0/0:
```

Digging deeper, the guard failures were due to TENSOR_MATCH guards failing on dispatch key set checks (seemingly on the Python dispatch key):
5a1fbf45ad/torch/csrc/dynamo/guards.cpp (L199-L203)

This seems to due to the `ignore_compile_internals=True` flag on custom dispatch modes being on, which causes these modes to "hide" themselves during compilation, making dynamo guard on the Python dispatch key being off.

The (maybe imperfect) solution is to mask out the Python keys for guard comparisons, when `_is_in_any_mode_without_ignore_compile_internals` is False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164992
Approved by: https://github.com/williamwen42
2025-11-12 18:15:07 +00:00
ed79693706 [ROCm][CI] dynamo benchmark repvgg_a2 is flaky (#167660)
Update dynamo results due to flaky model
https://github.com/pytorch/pytorch/actions/runs/19283051320/job/55139788014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167660
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-11-12 17:41:19 +00:00
10a1578408 Revert "Update Kineto Submodule (#167343)"
This reverts commit c7007e758478fcac4ed9bb0479d73d6e397e8b8a.

Reverted https://github.com/pytorch/pytorch/pull/167343 on behalf of https://github.com/jeffdaily due to causing ROCm distributed jobs to time out ([comment](https://github.com/pytorch/pytorch/pull/167343#issuecomment-3523053342))
2025-11-12 17:23:28 +00:00
bdb37536be Add Tests (#167392)
Need to wait for:
https://github.com/Dao-AILab/flash-attention/pull/1998 to land

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167392
Approved by: https://github.com/jbschlosser
ghstack dependencies: #167348
2025-11-12 17:13:36 +00:00
dd7a45abc0 [5/N] Use Python 3.10 typing (#167449)
This PR applies new Union and Optional typing syntax to some files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167449
Approved by: https://github.com/albanD
2025-11-12 17:05:31 +00:00
7557e38e32 [ROCm] hipSPARSELt support - Update cuda_to_hip_mappings.py (#167335)
Modified cuda_to_hip_mappings.py to map cuSPARSELt headers and types to their hipSPARSELt counterparts, improving compatibility and functionality for ROCm users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167335
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
2025-11-12 17:04:43 +00:00
c5d91d9e3e Revert "Introduce a new API torch.accelerator.get_memory_info (#156812)"
This reverts commit abf31db2cc039ee299337bad6f7f11577c877481.

Reverted https://github.com/pytorch/pytorch/pull/156812 on behalf of https://github.com/jeanschmidt due to seems to be breaking 1000s of internal build rules, see D86638790 ([comment](https://github.com/pytorch/pytorch/pull/156812#issuecomment-3522729156))
2025-11-12 16:15:06 +00:00
a32832682c Revert "[xpu][feature] Add XPU support on torch.accelerator.get_memory_info (#162564)"
This reverts commit 3cfbf98ea9d937d23f3700168b22706c957308ce.

Reverted https://github.com/pytorch/pytorch/pull/162564 on behalf of https://github.com/jeanschmidt due to seems to be breaking 1000s of internal build rules, see D86638790 ([comment](https://github.com/pytorch/pytorch/pull/156812#issuecomment-3522729156))
2025-11-12 16:15:06 +00:00
4f6aae35fd Revert "[MPS] SparseMps mv op (#166708)"
This reverts commit 406719c3daf84b4ecec98134ef3ad6ca953b86c4.

Reverted https://github.com/pytorch/pytorch/pull/166708 on behalf of https://github.com/jeanschmidt due to breaks internal signals, see D86606212 ([comment](https://github.com/pytorch/pytorch/pull/166708#issuecomment-3522720186))
2025-11-12 16:12:11 +00:00
4cff8b5e07 Add option to disable applying side effects in dynamo (#167239)
There are two motivating use cases for this change:
1) export (when we trace pytree calls into a graph, we don't want to accidentally trace the side effect bytecode which will pollute the initial state) -> We want to warn about side effects and don't want to actually apply them
2) VLLM -> They want to detect side effects and error out.

We implement this with two configs where one config controls whether we want to apply side effects (by default yes) and the warning level for side effects (warning for export and error for VLLM). We intentionally ignore input side effects, because they are captured in the graph and export would never trace the actual dynamo graph module when tracing the pytree calls).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167239
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2025-11-12 15:36:47 +00:00
4714eb7021 Update dynamic_inductor_timm_training.csv (#167609)
These tests were failing since they were added in
https://github.com/pytorch/pytorch/pull/165381

Evidence: scroll back in HUD, on that commit they were
failing.

I'm going to (1) set the accuracy to get CI green and (2) file an issue
for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167609
Approved by: https://github.com/choijon5, https://github.com/desertfire
2025-11-12 15:15:46 +00:00
780e32524c Move XPUEvent to c10 (#158336)
# Motivation
Move `XPUEvent` to `c10/xpu` to keep consistent with `XPUStream`, which is already in `c10/xpu`. The most important thing is that we will leverage `XPUEven`t in our caching allocator instead of a raw sycl event.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158336
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-11-12 11:29:42 +00:00
6bf51de533 harden backed_size_oblivious and broadcast_shapes (#167232)
We probably need something similar for expand

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167232
Approved by: https://github.com/ColinPeppler
2025-11-12 09:30:24 +00:00
d33d125c94 [inductor] Remove output copy_ for pallas backend in some cases (#167516)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167516
Approved by: https://github.com/oulgen
2025-11-12 06:18:35 +00:00
dc8bb52f77 Inductor Lite Mode (#167115)
This PR introduces inductor lite mode for opt-in optimizations and numeric correctness guarantees.

Different from default mode that applies all possible fusions, lite mode gives the control back to user and provides guarantee on numeric correctness. Specifically, this mode:

- **Fallback by Default**: Fallback for ALL nodes by default, unless users explicitly mark node for inductor fusion.
- **Selective Decomposition**: Skip decomposition for all nodes except for user marked nodes.
- **Regional inductor compile**
- Skip dead code elimination
- Skip buffer reues
- Skip reorder passes, such as reorder for peak memory, reorder for compute comm overlap, and reorder_for_reducing_graph_partitions.
- Skip all pre-grad, joint-graph, and post-grad passes.

## Example: Flex Attention

```python
import torch
import torch.fx.traceback as fx_traceback
from torch.nn.attention.flex_attention import create_block_mask, flex_attention

def _squared(score, b, h, m, n):
    return score * score

def mask_mod(b, h, q, k):
    return q >= 0

a, b = 12, 64
block_mask = create_block_mask(mask_mod, None, None, a * b, a * b, device="cuda")

def fn(x):
    x = torch.sin(x)
    with fx_traceback.annotate({"compile_with_inductor": 0}):
        x = flex_attention(x, x, x, block_mask=block_mask, score_mod=_squared)
    return torch.cos(x)

x = torch.randn(1, 1, a * b, b, dtype=torch.bfloat16, device="cuda", requires_grad=True)

opt_fn = torch.compile(fn, mode="lite", fullgraph=True,)
opt_fn(x)
```

[code diff](https://www.internalfb.com/intern/diffing/?paste_number=2027441476)

[default mode tlp](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpYAzDxX/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000) vs [lite mode tlp](https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpnnuh1W/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000)

## Numerics

Inductor lite mode provides bitwise equivalence with `aot_eager` backend on torchtitan llama3-8b and DeepSeek v3. https://github.com/pytorch/torchtitan/pull/2005

close: #167012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167115
Approved by: https://github.com/ezyang
2025-11-12 05:36:26 +00:00
9997e853e9 [DebugMode] record triton kernels, run-to-run determinism checks (#167028)
Following up on https://github.com/pytorch/pytorch/pull/166348, extends DebugMode to capture inductor triton kernels at runtime, and adds an API for checking run-to-run determinism based on tensor hashes.

The workflow looks something like...
```python
# do 1st run with hashes, get logs
with DebugMode() as debug_mode, DebugMode.log_tensor_hashes():
    compiled_model(*inputs)
logs1 = debug_mode.logs

# do 2nd run
with DebugMode() as debug_mode, DebugMode.log_tensor_hashes():
    compiled_model(*inputs)
logs2 = debug_mode.logs

# returns list of calls w/ mismatched outputs
mismatches = DebugMode.check_hash_mismatches(logs1, logs2)
```

Example dump off a smaller version of @drisspg's FlexAttention fwd+bwd determinism tests [script](https://gist.github.com/pianpwk/f65cc63811d12853709dcc77d7eb69f1) (without forced reduction order):
```
cfg: TestConfig(name='Standard', B=2, Hq=32, Hkv=32, Q=2048, KV=2048, Dqk=128, Dv=128)
DETERMINISM: fwd: True, bwd_q: False, bwd_k: False, bwd_v: True

$$$ DEBUG MODE DUMP $$$  (this is what the logs look like)

    [triton] triton_tem_fused_0(arg_Q=t: bf16[2, 32, 2048, 128], arg_K=t: bf16[2, 32, 2048, 128], arg_V=t: bf16[2, 32, 2048, 128], arg_LSE=t: f32[2, 32, 2048], arg_MAX=t: f32[2, 32, 2048], arg_KV_NUM_BLKS=t: i32[2, 32, 16], arg_KV_IDX=t: i32[2, 32, 16, 16], arg_FULL_KV_NUM_BLKS=t: i32[2, 32, 16], arg_FULL_KV_IDX=t: i32[2, 32, 16, 16], out_ptr0=t: bf16[2, 32, 2048, 128])
    # post-kernel hashes: {arg_Q: 13385916.068706088, arg_K: 13389356.409105342, arg_V: 13384993.48412523, arg_LSE: 1347168.9026973695, arg_MAX: 81775.3811062593, arg_KV_NUM_BLKS: 1024.0, arg_KV_IDX: 122880.0, arg_FULL_KV_NUM_BLKS: 7680.0, arg_FULL_KV_IDX: 122880.0, out_ptr0: 924917.7918248245}

    [triton] triton_per_fused_zeros_0(in_ptr0=t: bf16[2, 32, 2048, 128], in_ptr1=t: bf16[2, 32, 2048, 128], out_ptr1=t: f32[2, 32, 2048], xnumel=131072, r0_numel=128)
    # post-kernel hashes: {in_ptr0: 924917.7918248245, in_ptr1: 13389213.797377996, out_ptr1: 81775.38106592931}

    [triton] triton_tem_fused_zeros_1(arg_Q=t: bf16[2, 32, 2048, 128], arg_K=t: bf16[2, 32, 2048, 128], arg_V=t: bf16[2, 32, 2048, 128], arg_LSE=t: f32[2, 32, 2048], arg_DELTA=t: f32[2, 32, 2048], arg_DO=t: bf16[2, 32, 2048, 128], arg_DQ=t: bf16[2, 32, 2048, 128], arg_DV=t: bf16[2, 32, 2048, 128], arg_KV_NUM_BLKS=t: i32[2, 32, 16], arg_KV_IDX=t: i32[2, 32, 16, 16], arg_Q_NUM_BLKS=t: i32[2, 32, 16], arg_Q_IDX=t: i32[2, 32, 16, 16], arg_FULL_KV_NUM_BLKS=t: i32[2, 32, 16], arg_FULL_KV_IDX=t: i32[2, 32, 16, 16], arg_FULL_Q_NUM_BLKS=t: i32[2, 32, 16], arg_FULL_Q_IDX=t: i32[2, 32, 16, 16], out_ptr0=t: bf16[2, 32, 2048, 128])
    # post-kernel hashes: {arg_Q: 13385916.068706088, arg_K: 13389356.409105342, arg_V: 13384993.48412523, arg_LSE: 1347168.9026973695, arg_DELTA: 81775.38106592931, arg_DO: 13389213.797377996, arg_DQ: 874474.8084187683, arg_DV: 727742.3138379117, arg_KV_NUM_BLKS: 1024.0, arg_KV_IDX: 122880.0, arg_Q_NUM_BLKS: 1024.0, arg_Q_IDX: 122880.0, arg_FULL_KV_NUM_BLKS: 7680.0, arg_FULL_KV_IDX: 122880.0, arg_FULL_Q_NUM_BLKS: 7680.0, arg_FULL_Q_IDX: 122880.0, out_ptr0: 700542.3431890717}

$$$ MISMATCHES $$$
mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_0', 'arg_name': 'arg_MAX', 'pytree_path': None, 'hash1': 0.0, 'hash2': 81775.3811062593, 'rel_diff': 1.0, 'is_input_hash': False}  # I guess this one is misleading? not sure if I'm doing something wrong with waiting for kernel results
mismatch: {'call_type': 'triton kernel', 'call': 'triton_per_fused_zeros_0', 'arg_name': 'out_ptr1', 'pytree_path': None, 'hash1': 81775.3811062593, 'hash2': 81775.38106592931, 'rel_diff': 4.931801261646669e-10, 'is_input_hash': False}
mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_zeros_1', 'arg_name': 'arg_DELTA', 'pytree_path': None, 'hash1': 81775.3811062593, 'hash2': 81775.38106592931, 'rel_diff': 4.931801261646669e-10, 'is_input_hash': False}
mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_zeros_1', 'arg_name': 'arg_DQ', 'pytree_path': None, 'hash1': 874474.8097136207, 'hash2': 874474.8084187683, 'rel_diff': 1.480720012120795e-09, 'is_input_hash': False}
mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_zeros_1', 'arg_name': 'out_ptr0', 'pytree_path': None, 'hash1': 700542.3488049245, 'hash2': 700542.3431890717, 'rel_diff': 8.016435812581196e-09, 'is_input_hash': False}
```

note: current hash implementation is basically tensor norm, so tensor closeness -> hash closeness. This is likely to change soon, e.g. maybe to `torch.hash_tensor` (https://github.com/pytorch/pytorch/pull/154149) by default

Sample paste diff between log dumps from 2 runs:
<img width="1665" height="445" alt="Screenshot 2025-11-05 at 11 27 24 PM" src="https://github.com/user-attachments/assets/41402e37-f50b-4a9e-a17c-bb98b5917076" />

Another case where running this for FSDP2 on Llama3-8B, helped narrow down divergence b/w aot_eager <-> inductor, to inductor's FWD RMSNorm kernels: P2027003180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167028
Approved by: https://github.com/v0i0
2025-11-12 05:21:07 +00:00
2a09f6e02e [4/N] Use Python 3.10 typing (#167458)
This PR applies new Union and Optional typing syntax to some files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167458
Approved by: https://github.com/albanD
2025-11-12 05:15:40 +00:00
bf380fbd4c [vision hash update] update the pinned vision hash (#167491)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167491
Approved by: https://github.com/pytorchbot
2025-11-12 04:40:02 +00:00
148fd9a522 [audio hash update] update the pinned audio hash (#167490)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167490
Approved by: https://github.com/pytorchbot
2025-11-12 04:31:31 +00:00
7bb8d8c200 [ROCm][CI] Add trunk-rocm-mi300.yml to test new MI3xx CI capacity (#167587)
This adds a workflow to run full set of UTs on default and distributed configs on ROCm MI3xx CI runners, to _eventually_ assess if the CI capacity can handle the PR-based workload for trunk.yml. The plan was to keep this workflow in unstable as we test out this new CI capacity, so it wouldn't impact PR merges. However, since upstream maintainers have indicated that, as of today, even unstable workflows will block PR merges, we are going with branch push-based triggers to at least pipeclean this workflow on the new CI capacity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167587
Approved by: https://github.com/jeffdaily
2025-11-12 03:44:15 +00:00
5ce4a8b49f Revert "fix wrong accuracy_status when exception. (#165731)"
This reverts commit bfcdbd0a970e5ce08cecd0aa33dd389819f0ec4f.

Reverted https://github.com/pytorch/pytorch/pull/165731 on behalf of https://github.com/zou3519 due to broke inductor periodic ([comment](https://github.com/pytorch/pytorch/pull/165731#issuecomment-3519743601))
2025-11-12 03:36:27 +00:00
7dd56474f2 [annotation] Skip copying custom meta for gradient accumulation nodes; tag with is_gradient_acc=True (#167572)
The seq_nr  doesn't always increment for gradient accumulation nodes, and they might be copying annotation from forward nodes.

I'm just going to skip copying the custom meta for any gradient accumulation nodes and give them a special tag e.g. node.meta["is_gradient_acc"]=True

Example repro for deepseek torchtitan (without using DTensor): https://gist.github.com/yushangdi/aae13ea382732f31d0fdfb3ffeda12c8

(side note: if you want some more hints on these gradient acc node: 1) they have torch.ops.aten.add.Tensor op, not add.default. 2) they have the highest seq_nr(s) )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167572
Approved by: https://github.com/mlazos
2025-11-12 03:35:57 +00:00
3260bf3b19 [export] stop gap strict export v2 enable and testing. (#167236)
Summary:
Added a new flag called "use_legacy_dynamo_graph_capture" which defaults to True and only False with the updated test_strict_export_v2.py

In addiotion to this flag, we also use legacy tracer when the following features are used:
1. dynamic shape
2. preserve module call signature
3. retracing.
4. draft mode.

Test Plan:
test_strict_export_v2.py

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167236
Approved by: https://github.com/tugsbayasgalan
2025-11-12 03:33:40 +00:00
05c6a06b2b Add FA4 to sdpa (#167348)
# Summary
See title ;)

## Design

Currently once you install there is no going back in the same python process, this need not be the case, cc @mikaylagawarecki's work on being able to grab original impl. I'll leave for follow up.

Okay I added an open reg, but I really want the backends to be found so some weird typing but we get
<img width="523" height="197" alt="Screenshot 2025-11-07 at 3 30 32 PM" src="https://github.com/user-attachments/assets/586de943-bbed-40cf-abd1-131f747a4cf1" />

## Overheads:
<img width="799" height="735" alt="Screenshot 2025-11-07 at 2 35 04 PM" src="https://github.com/user-attachments/assets/f9217f31-3e42-4816-8fb3-29ea8b49d735" />
First call to forward -> majority of time is spent in jit for FA

First call to backward, 3sec interestingly it doesn't appear that with_stack gets events in the backwards loop @albanD is this expected?
<img width="948" height="385" alt="Screenshot 2025-11-07 at 2 35 50 PM" src="https://github.com/user-attachments/assets/a40bacd0-3fb0-4bd8-b33e-bec8fb3f36c0" />

Getting form Pt op to impl is about 43 us which is dwarfed by other cpu overheads
<img width="1227" height="649" alt="Screenshot 2025-11-07 at 2 37 41 PM" src="https://github.com/user-attachments/assets/51da0615-facd-41e1-a6e2-fb7778079ab6" />

Just invoking the jit object from cutesl is 100s of us
<img width="545" height="414" alt="Screenshot 2025-11-07 at 2 38 19 PM" src="https://github.com/user-attachments/assets/d20345a0-6c47-4dcb-892f-9ef9894a1cf5" />

### Example usage
```Py
#!/usr/bin/env python3

"""Minimal FA4 smoke test for scaled dot product attention."""

from __future__ import annotations

import sys
from jsonargparse import CLI

import torch
import torch.nn.functional as F
from torch.nn.attention import (
    install_flash_attention_impl,
    sdpa_kernel,
    SDPBackend,
)

def _map_dtype(kind: str) -> torch.dtype:
    return torch.bfloat16 if kind == "bf16" else torch.float16

# To infinity and beyond
install_flash_attention_impl("FA4")

@sdpa_kernel([SDPBackend.FLASH_ATTENTION])
def main(
    module_path: str = "flash_attn.cute.interface",
    batch: int = 4,
    seq: int = 81292,
    heads: int = 16,
    head_dim: int = 128,
    device: int = 0,
    dtype: str = "bf16"
    ) -> None:
    if not torch.cuda.is_available():
        sys.exit("CUDA is required for FA4 smoke testing")
    torch.cuda.set_device(device)
    dtype = _map_dtype(dtype)
    generator = torch.Generator(device="cuda").manual_seed(0)
    q = torch.randn(
        batch,
        heads,
        seq,
        head_dim,
        device="cuda",
        dtype=dtype,
        requires_grad=True,
        generator=generator,
    )
    k = torch.randn(
        batch,
        heads,
        seq,
        head_dim,
        device="cuda",
        dtype=dtype,
        requires_grad=True,
        generator=generator,
    )
    v = torch.randn(
        batch,
        heads,
        seq,
        head_dim,
        device="cuda",
        dtype=dtype,
        requires_grad=True,
        generator=generator,
    )
    from transformer_nuggets.utils.benchmark import profiler
    with profiler("sdpa_FA4", with_stack=False):
        for _ in range(3):
            out = F.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=0.0, is_causal=False)
            loss = out.real.sum()
            loss.backward()
    print("Scaled dot product attention output norm:", out.norm().item())
    print("dq norm:", q.grad.norm().item())

if __name__ == "__main__":
    CLI(main)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167348
Approved by: https://github.com/albanD, https://github.com/malfet
2025-11-12 03:29:07 +00:00
25e9d8124c Revert "Use c7i.2xlarge for B200 build (#167078)"
This reverts commit bb3748346484d49ace45dcc92b72c12b2ba30d98.

Reverted https://github.com/pytorch/pytorch/pull/167078 on behalf of https://github.com/zxiiro due to This seems to be breaking build when compile is not using sscache. Needs more investigation. ([comment](https://github.com/pytorch/pytorch/pull/167078#issuecomment-3519717750))
2025-11-12 03:22:48 +00:00
bc882f8284 Support Python 3.14 Lazy Function Annotations on FX graph (#167573)
This was intersting to debug:
(1) Python 3.14 ships with a lazy way for retrieving annotations. The annotations field can be a callable that lazily evaluates it
(2) On the dynamo side, `SET_FUNCTION_ATTRIBUTE` needs to handle an extra flag value (0x10)
(3) The decorator `functools.wraps` used extensively in the codebase (e.g. `make_fx`, `functionalize`) doesn't copy the `__annotations__` attribute by default. To correctly retrieve an annotatin, we need to walk on the chain of `__wrapped__` objects and retrieve the attribute from the first function. Fortunately, there are functions on the stdlib to do this.

Fixes:
```
'test/functorch/test_eager_transforms.py::TestFunctionalizeCPU::test_functionalize_fx_out_op_cpu', 'test/functorch/test_eager_transforms.py::TestFunctionalizeCPU::test_functionalize_fx_transpose_simple_cpu', 'test/functorch/test_eager_transforms.py::TestFunctionalizeCPU::test_functionalize_optional_tensorlist2_cpu', 'test/functorch/test_eager_transforms.py::TestFunctionalizeCPU::test_functionalize_fx_multi_out_op_cpu', 'test/functorch/test_eager_transforms.py::TestFunctionalizeCPU::test_functionalize_fx_reapply_views_simple_cpu', 'test/functorch/test_eager_transforms.py::TestFunctionalizeCPU::test_functionalize_fx_simple_cpu', 'test/functorch/test_eager_transforms.py::TestFunctionalizeCPU::test_functionalize_nonfunctional_output_cpu', 'test/functorch/test_eager_transforms.py::TestFunctionalizeCPU::test_functionalize_opt_tensor_list_cpu', 'test/functorch/test_eager_transforms.py::TestFunctionalizeCPU::test_functionalize_optional_tensorlist1_cpu'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167573
Approved by: https://github.com/williamwen42
2025-11-12 02:55:57 +00:00
edd365ed4a [MemTracker] Fix: Remove monkey patching DTensor dispatch (#167580)
Fixes `MemTracker` for #167051

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167580
Approved by: https://github.com/anshul-si
2025-11-12 02:51:40 +00:00
1366a2fa55 [ROCm][CI] Enable uploading of artifacts from docker-builds.yml (#167379)
Needed for https://github.com/pytorch/pytorch/pull/167554 so that we can enable docker caching for ROCm MI3xx runners

Replaces https://github.com/pytorch/pytorch/pull/167378 (by filing from pytorch/pytorch branch so OIDC login doesn't fail due to forked repo)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167379
Approved by: https://github.com/jeffdaily
2025-11-12 02:24:05 +00:00
91f0c5a9da [simplefsdp] add manual bucketing pass (#165487)
As titled, this PR adds manual bucketing pass to SimpleFSDP. Users will need to parse FQNs they wanted to bucket together using `module_bucket_plans`. Then, `_manual_bucket_collectives` will get the node of the subgraphs correspond to each `bucket_module`, and bucket bucketable (FSDP-style) AG/RS together. `_manual_reorder_graph` reorders them for overlapping.

For detailed performance, see this torchtitan PR: https://github.com/pytorch/torchtitan/pull/1881.

There are a few todo items isted in torchtitan PR. Let's start with this PR that implements FSDP+TP+llama3 manual bucketing. I will fix/add the rest in follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165487
Approved by: https://github.com/ezyang
2025-11-12 02:18:34 +00:00
67390692c5 [ROCm][CI] Restrict docker-cache-rocm.yml to main/release branches (#167593)
Follow-up to https://github.com/pytorch/pytorch/pull/167554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167593
Approved by: https://github.com/jeffdaily
2025-11-12 02:08:23 +00:00
1debfd44fd Revert "Add FA4 to sdpa (#167348)"
This reverts commit cdf0a9c21f7c27298a5bc71620206353125c5494.

Reverted https://github.com/pytorch/pytorch/pull/167348 on behalf of https://github.com/malfet due to Looks like it broke lint? ([comment](https://github.com/pytorch/pytorch/pull/167348#issuecomment-3519549113))
2025-11-12 02:05:30 +00:00
cdf0a9c21f Add FA4 to sdpa (#167348)
# Summary
See title ;)

## Design

Currently once you install there is no going back in the same python process, this need not be the case, cc @mikaylagawarecki's work on being able to grab original impl. I'll leave for follow up.

Okay I added an open reg, but I really want the backends to be found so some weird typing but we get
<img width="523" height="197" alt="Screenshot 2025-11-07 at 3 30 32 PM" src="https://github.com/user-attachments/assets/586de943-bbed-40cf-abd1-131f747a4cf1" />

## Overheads:
<img width="799" height="735" alt="Screenshot 2025-11-07 at 2 35 04 PM" src="https://github.com/user-attachments/assets/f9217f31-3e42-4816-8fb3-29ea8b49d735" />
First call to forward -> majority of time is spent in jit for FA

First call to backward, 3sec interestingly it doesn't appear that with_stack gets events in the backwards loop @albanD is this expected?
<img width="948" height="385" alt="Screenshot 2025-11-07 at 2 35 50 PM" src="https://github.com/user-attachments/assets/a40bacd0-3fb0-4bd8-b33e-bec8fb3f36c0" />

Getting form Pt op to impl is about 43 us which is dwarfed by other cpu overheads
<img width="1227" height="649" alt="Screenshot 2025-11-07 at 2 37 41 PM" src="https://github.com/user-attachments/assets/51da0615-facd-41e1-a6e2-fb7778079ab6" />

Just invoking the jit object from cutesl is 100s of us
<img width="545" height="414" alt="Screenshot 2025-11-07 at 2 38 19 PM" src="https://github.com/user-attachments/assets/d20345a0-6c47-4dcb-892f-9ef9894a1cf5" />

### Example usage
```Py
#!/usr/bin/env python3

"""Minimal FA4 smoke test for scaled dot product attention."""

from __future__ import annotations

import sys
from jsonargparse import CLI

import torch
import torch.nn.functional as F
from torch.nn.attention import (
    install_flash_attention_impl,
    sdpa_kernel,
    SDPBackend,
)

def _map_dtype(kind: str) -> torch.dtype:
    return torch.bfloat16 if kind == "bf16" else torch.float16

# To infinity and beyond
install_flash_attention_impl("FA4")

@sdpa_kernel([SDPBackend.FLASH_ATTENTION])
def main(
    module_path: str = "flash_attn.cute.interface",
    batch: int = 4,
    seq: int = 81292,
    heads: int = 16,
    head_dim: int = 128,
    device: int = 0,
    dtype: str = "bf16"
    ) -> None:
    if not torch.cuda.is_available():
        sys.exit("CUDA is required for FA4 smoke testing")
    torch.cuda.set_device(device)
    dtype = _map_dtype(dtype)
    generator = torch.Generator(device="cuda").manual_seed(0)
    q = torch.randn(
        batch,
        heads,
        seq,
        head_dim,
        device="cuda",
        dtype=dtype,
        requires_grad=True,
        generator=generator,
    )
    k = torch.randn(
        batch,
        heads,
        seq,
        head_dim,
        device="cuda",
        dtype=dtype,
        requires_grad=True,
        generator=generator,
    )
    v = torch.randn(
        batch,
        heads,
        seq,
        head_dim,
        device="cuda",
        dtype=dtype,
        requires_grad=True,
        generator=generator,
    )
    from transformer_nuggets.utils.benchmark import profiler
    with profiler("sdpa_FA4", with_stack=False):
        for _ in range(3):
            out = F.scaled_dot_product_attention(q, k, v, attn_mask=None, dropout_p=0.0, is_causal=False)
            loss = out.real.sum()
            loss.backward()
    print("Scaled dot product attention output norm:", out.norm().item())
    print("dq norm:", q.grad.norm().item())

if __name__ == "__main__":
    CLI(main)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167348
Approved by: https://github.com/albanD
2025-11-12 01:07:59 +00:00
115016f1a2 [Device Mesh][ez] Clean up unused parameters and duplicate codes (#167581)
While refactoring the code, I found we re-init `_flatten_mapping` and still keep `_flatten_mesh_list ` inside code which is not needed anymore. Let's remove it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167581
Approved by: https://github.com/fegin
2025-11-12 00:59:32 +00:00
971e6ca434 fix sym_size_, sym_stride lowering (#167565)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167565
Approved by: https://github.com/bobrenjc93, https://github.com/Microve, https://github.com/Skylion007
ghstack dependencies: #167345
2025-11-12 00:53:36 +00:00
e8d411e7f7 FSDPMemTracker fix with multihander hooks. (#165662)
Fixes #164663

## Issue
The torch model with multiple layers that is wrapped with fsdp2 registers pre and post forward hooks in a group using `_MultiHandler`. This becomes an issue during the context manager of the tracker where the hooks are reset and replaced. The hooks are all using the same fsdp state pointer so one reset will reset all.  So when the output layer was modified with a new pre and post forward hook it would delete the previous layer's initialization causing `KeyError` for the Norm layer as it is nonexistent.

## The Fix
Check to see if there are multiple `_MultiHandler` objects and `RemoveHandler` objects and only execute the remove hook once.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165662
Approved by: https://github.com/sanketpurandare
2025-11-11 23:49:36 +00:00
2e5233d7bd Revert "Support AC in default partitioner when functionalization is enabled (#166610)"
This reverts commit de773364be041ca7fd2dcaf35ca15c093fc9370b.

Reverted https://github.com/pytorch/pytorch/pull/166610 on behalf of https://github.com/soulitzer due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/166610#issuecomment-3519047226))
2025-11-11 23:01:09 +00:00
514dd96376 Remove --no-use-pep517 flag (#167096)
In pip 25.3 and newer, use of --no-use-pep517 has been removed (https://pip.pypa.io/en/stable/news/). In builds with pip 25.2, a warning message notes:

> DEPRECATION: Building 'torchvision' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'torchvision'. Discussion can be found at https://github.com/pypa/pip/issues/6334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167096
Approved by: https://github.com/atalman
2025-11-11 23:00:35 +00:00
9ae62fcc18 [ROCm][CI] dynamo benchmarks update ci expected accuracy (#167574)
repvgg_a2 IMPROVED: accuracy=pass, expected=fail_accuracy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167574
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-11-11 22:54:55 +00:00
ae71b0e163 Fix typo in torch._refs (#167310)
Should be a typo here, but it doesn't raise an error because the inner function splits it into `a` and `,`, and the `,` case check is skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167310
Approved by: https://github.com/eellison
2025-11-11 22:31:09 +00:00
5b6ff8148d Revert "[ARM] Improve LLM performance & mem usage using int4-bf16 KleidiAI kernels (#158250)"
This reverts commit 402c46503002f98ccfc023a733081fb0719223a1.

Reverted https://github.com/pytorch/pytorch/pull/158250 on behalf of https://github.com/izaitsevfb due to Broke some torch.compile jobs ([comment](https://github.com/pytorch/pytorch/pull/158250#issuecomment-3518944863))
2025-11-11 22:27:51 +00:00
1f7e4343e7 [ROCm][CI] Add docker-cache-rocm.yml to test MI3xx CI docker caching (#167554)
* Trigger this workflow on every completed run of `docker-builds.yml`
* Uses `ubuntu-latest` for downloading artifacts from `docker-build` workflow run
* Uses `linux.rocm.gfx942.docker-cache` to cache docker images as tarballs for MI3xx CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167554
Approved by: https://github.com/jeffdaily
2025-11-11 21:32:22 +00:00
b21856f5fc Revert "[DebugMode] record triton kernels, run-to-run determinism checks (#167028)"
This reverts commit 259ba0ecabd809edd35d12b4f992777cb5923b68.

Reverted https://github.com/pytorch/pytorch/pull/167028 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/167028#issuecomment-3518811298))
2025-11-11 21:31:12 +00:00
259ba0ecab [DebugMode] record triton kernels, run-to-run determinism checks (#167028)
Following up on https://github.com/pytorch/pytorch/pull/166348, extends DebugMode to capture inductor triton kernels at runtime, and adds an API for checking run-to-run determinism based on tensor hashes.

The workflow looks something like...
```python
# do 1st run with hashes, get logs
with DebugMode() as debug_mode, DebugMode.log_tensor_hashes():
    compiled_model(*inputs)
logs1 = debug_mode.logs

# do 2nd run
with DebugMode() as debug_mode, DebugMode.log_tensor_hashes():
    compiled_model(*inputs)
logs2 = debug_mode.logs

# returns list of calls w/ mismatched outputs
mismatches = DebugMode.check_hash_mismatches(logs1, logs2)
```

Example dump off a smaller version of @drisspg's FlexAttention fwd+bwd determinism tests [script](https://gist.github.com/pianpwk/f65cc63811d12853709dcc77d7eb69f1) (without forced reduction order):
```
cfg: TestConfig(name='Standard', B=2, Hq=32, Hkv=32, Q=2048, KV=2048, Dqk=128, Dv=128)
DETERMINISM: fwd: True, bwd_q: False, bwd_k: False, bwd_v: True

$$$ DEBUG MODE DUMP $$$  (this is what the logs look like)

    [triton] triton_tem_fused_0(arg_Q=t: bf16[2, 32, 2048, 128], arg_K=t: bf16[2, 32, 2048, 128], arg_V=t: bf16[2, 32, 2048, 128], arg_LSE=t: f32[2, 32, 2048], arg_MAX=t: f32[2, 32, 2048], arg_KV_NUM_BLKS=t: i32[2, 32, 16], arg_KV_IDX=t: i32[2, 32, 16, 16], arg_FULL_KV_NUM_BLKS=t: i32[2, 32, 16], arg_FULL_KV_IDX=t: i32[2, 32, 16, 16], out_ptr0=t: bf16[2, 32, 2048, 128])
    # post-kernel hashes: {arg_Q: 13385916.068706088, arg_K: 13389356.409105342, arg_V: 13384993.48412523, arg_LSE: 1347168.9026973695, arg_MAX: 81775.3811062593, arg_KV_NUM_BLKS: 1024.0, arg_KV_IDX: 122880.0, arg_FULL_KV_NUM_BLKS: 7680.0, arg_FULL_KV_IDX: 122880.0, out_ptr0: 924917.7918248245}

    [triton] triton_per_fused_zeros_0(in_ptr0=t: bf16[2, 32, 2048, 128], in_ptr1=t: bf16[2, 32, 2048, 128], out_ptr1=t: f32[2, 32, 2048], xnumel=131072, r0_numel=128)
    # post-kernel hashes: {in_ptr0: 924917.7918248245, in_ptr1: 13389213.797377996, out_ptr1: 81775.38106592931}

    [triton] triton_tem_fused_zeros_1(arg_Q=t: bf16[2, 32, 2048, 128], arg_K=t: bf16[2, 32, 2048, 128], arg_V=t: bf16[2, 32, 2048, 128], arg_LSE=t: f32[2, 32, 2048], arg_DELTA=t: f32[2, 32, 2048], arg_DO=t: bf16[2, 32, 2048, 128], arg_DQ=t: bf16[2, 32, 2048, 128], arg_DV=t: bf16[2, 32, 2048, 128], arg_KV_NUM_BLKS=t: i32[2, 32, 16], arg_KV_IDX=t: i32[2, 32, 16, 16], arg_Q_NUM_BLKS=t: i32[2, 32, 16], arg_Q_IDX=t: i32[2, 32, 16, 16], arg_FULL_KV_NUM_BLKS=t: i32[2, 32, 16], arg_FULL_KV_IDX=t: i32[2, 32, 16, 16], arg_FULL_Q_NUM_BLKS=t: i32[2, 32, 16], arg_FULL_Q_IDX=t: i32[2, 32, 16, 16], out_ptr0=t: bf16[2, 32, 2048, 128])
    # post-kernel hashes: {arg_Q: 13385916.068706088, arg_K: 13389356.409105342, arg_V: 13384993.48412523, arg_LSE: 1347168.9026973695, arg_DELTA: 81775.38106592931, arg_DO: 13389213.797377996, arg_DQ: 874474.8084187683, arg_DV: 727742.3138379117, arg_KV_NUM_BLKS: 1024.0, arg_KV_IDX: 122880.0, arg_Q_NUM_BLKS: 1024.0, arg_Q_IDX: 122880.0, arg_FULL_KV_NUM_BLKS: 7680.0, arg_FULL_KV_IDX: 122880.0, arg_FULL_Q_NUM_BLKS: 7680.0, arg_FULL_Q_IDX: 122880.0, out_ptr0: 700542.3431890717}

$$$ MISMATCHES $$$
mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_0', 'arg_name': 'arg_MAX', 'pytree_path': None, 'hash1': 0.0, 'hash2': 81775.3811062593, 'rel_diff': 1.0, 'is_input_hash': False}  # I guess this one is misleading? not sure if I'm doing something wrong with waiting for kernel results
mismatch: {'call_type': 'triton kernel', 'call': 'triton_per_fused_zeros_0', 'arg_name': 'out_ptr1', 'pytree_path': None, 'hash1': 81775.3811062593, 'hash2': 81775.38106592931, 'rel_diff': 4.931801261646669e-10, 'is_input_hash': False}
mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_zeros_1', 'arg_name': 'arg_DELTA', 'pytree_path': None, 'hash1': 81775.3811062593, 'hash2': 81775.38106592931, 'rel_diff': 4.931801261646669e-10, 'is_input_hash': False}
mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_zeros_1', 'arg_name': 'arg_DQ', 'pytree_path': None, 'hash1': 874474.8097136207, 'hash2': 874474.8084187683, 'rel_diff': 1.480720012120795e-09, 'is_input_hash': False}
mismatch: {'call_type': 'triton kernel', 'call': 'triton_tem_fused_zeros_1', 'arg_name': 'out_ptr0', 'pytree_path': None, 'hash1': 700542.3488049245, 'hash2': 700542.3431890717, 'rel_diff': 8.016435812581196e-09, 'is_input_hash': False}
```

note: current hash implementation is basically tensor norm, so tensor closeness -> hash closeness. This is likely to change soon, e.g. maybe to `torch.hash_tensor` (https://github.com/pytorch/pytorch/pull/154149) by default

Sample paste diff between log dumps from 2 runs:
<img width="1665" height="445" alt="Screenshot 2025-11-05 at 11 27 24 PM" src="https://github.com/user-attachments/assets/41402e37-f50b-4a9e-a17c-bb98b5917076" />

Another case where running this for FSDP2 on Llama3-8B, helped narrow down divergence b/w aot_eager <-> inductor, to inductor's FWD RMSNorm kernels: P2027003180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167028
Approved by: https://github.com/v0i0
2025-11-11 20:37:53 +00:00
051f1fe8e3 Revert "[ROCm][CI] Update docker-cache-mi300.yml to test MI300 CI docker caching (#167554)"
This reverts commit ee387c43feada1cc2049b42a970ec4e2f12f210e.

Reverted https://github.com/pytorch/pytorch/pull/167554 on behalf of https://github.com/jithunnair-amd due to workflow had failure 'Unexpected input(s) 'run_id'' ([comment](https://github.com/pytorch/pytorch/pull/167554#issuecomment-3518642191))
2025-11-11 20:34:44 +00:00
ee387c43fe [ROCm][CI] Update docker-cache-mi300.yml to test MI300 CI docker caching (#167554)
Trigger this workflow on every completed run of `docker-builds.yml` and run on `ubuntu-latest` so it doesn't queue infinitely for `rocm-docker` label

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167554
Approved by: https://github.com/jeffdaily
2025-11-11 19:49:00 +00:00
3a944661d6 Cpython test_math.FMATests (#167217)
Resolves issues running the dynamo cpython math.fma tests.

Though math.fma is enabled to perform a multiply add in dynamo, torch.addcmul is currently used which doesn't guarantee the user request for fma. It was decided to not use inductor fma prim as it would break the contract of using aten/core ir in dynamo output - otherwise export=True may have issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167217
Approved by: https://github.com/guilhermeleobas
2025-11-11 19:26:18 +00:00
56034074ca Revert "[Inductor] Naive foreach autotune support (#162053)"
This reverts commit 6c5db82584bf71f5b1db3b598bbd00f44140c28d.

Reverted https://github.com/pytorch/pytorch/pull/162053 on behalf of https://github.com/mlazos due to Sorry, there's an internal slowdown due to the extra triton configs you added ([comment](https://github.com/pytorch/pytorch/pull/162053#issuecomment-3518423369))
2025-11-11 19:23:40 +00:00
8def619bbe [user-streams] wait_stream op (#167512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167512
Approved by: https://github.com/williamwen42
ghstack dependencies: #167510, #167511
2025-11-11 19:18:03 +00:00
61883a5787 [user-streams] Allow new streams to be created and registered during compilation (#167511)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167511
Approved by: https://github.com/williamwen42
ghstack dependencies: #167510
2025-11-11 19:18:03 +00:00
d8ada1ee76 [user-streams] Allow new events to be created and registered during compilation (#167510)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167510
Approved by: https://github.com/williamwen42
2025-11-11 19:18:03 +00:00
fe841a1db4 [DeviceMesh] Log DeviceMesh.__init__ usage (#167375)
Adds (meta-internal-only) API usage logging for DeviceMesh creation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167375
Approved by: https://github.com/fduwjj
ghstack dependencies: #167374
2025-11-11 19:15:47 +00:00
b65829b84f [DTensor] Log API usage metrics for DTensor and DeviceMesh (#167374)
Logging propagate_op_sharding_non_cached is a compromise between
 - logging in DTensor.__init__ to catch ALL DTensor usage
 - sparing the overhead in a latency-senstitive region like
   DTensor.__init__
 - and 'real' DTensor usage should incur at least one call to sharding
   propagation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167374
Approved by: https://github.com/zpcore
2025-11-11 19:15:47 +00:00
b0e0ae97ba include thrust/distance.h explicitly in cuda sparse softmax (#167436)
`thrust::distance` is defined there
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167436
Approved by: https://github.com/Skylion007
2025-11-11 19:10:55 +00:00
f44a1ddcb2 Revert "[ROCm][CI] Update docker-cache-mi300.yml to test MI300 CI docker caching (#167554)"
This reverts commit 184e2cbc89570e1bf466b15d70fc36ed71be0eb9.

Reverted https://github.com/pytorch/pytorch/pull/167554 on behalf of https://github.com/jithunnair-amd due to Need to fix lint ([comment](https://github.com/pytorch/pytorch/pull/167554#issuecomment-3518382341))
2025-11-11 19:09:45 +00:00
184e2cbc89 [ROCm][CI] Update docker-cache-mi300.yml to test MI300 CI docker caching (#167554)
Trigger this workflow on every completed run of `docker-builds.yml` and run on `ubuntu-latest` so it doesn't queue infinitely for `rocm-docker` label

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167554
Approved by: https://github.com/jeffdaily
2025-11-11 19:07:19 +00:00
416421c7c4 fix failure of exporting compiled model with nested dynamic shapes (#166358)
## Problems
When exporting a compiled model with nested input like below
```python
import torch
from torch.export import export, Dim

def test_export_compiled_model_with_nested_dynamic_shapes():
   """Test exporting a compiled model with nested dict inputs and dynamic shapes."""
   print("Running test_export_compiled_model_with_nested_dynamic_shapes...")

   class M(torch.nn.Module):
       def forward(self, data_batch):
           return data_batch["a1"] + data_batch["a2"]

   m = M()
   compiled_m = torch.compile(m)
   example_args = ({
       "a1": torch.ones(3, 3),
       "a2": torch.ones(3, 3),
   },)
   dynamic_shapes = ({
       "a1": {0: Dim.DYNAMIC},
       "a2": {0: Dim.DYNAMIC},
   },)

   try:
       ep = export(compiled_m, example_args, dynamic_shapes=dynamic_shapes, strict=True)
       gm = ep.module()
       result_exported = gm(*example_args)
       result_compiled = compiled_m(*example_args)

       assert torch.allclose(result_exported, result_compiled), "Results don't match!"
       print("✓ test_export_compiled_model_with_nested_dynamic_shapes PASSED")
       return True
   except Exception as e:
       print(f"✗ test_export_compiled_model_with_nested_dynamic_shapes FAILED")
       print(f"Error: {e}")
       import traceback
       traceback.print_exc()
       return False

def test_export_compiled_model_with_kwargs_dynamic_shapes():
   """Test exporting a compiled model with kwargs and dynamic shapes."""
   print("\nRunning test_export_compiled_model_with_kwargs_dynamic_shapes...")

   class M(torch.nn.Module):
       def forward(self, a1, a2):
           return a1 + a2

   m = M()
   compiled_m = torch.compile(m)
   example_args = ()
   example_kwargs = {
       "a1": torch.ones(3, 3),
       "a2": torch.ones(3, 3),
   }
   dynamic_shapes = {
       "a1": {0: Dim.DYNAMIC},
       "a2": {0: Dim.DYNAMIC},
   }

   try:
       ep = export(compiled_m, example_args, kwargs=example_kwargs, dynamic_shapes=dynamic_shapes, strict=True)
       gm = ep.module()
       result_exported = gm(**example_kwargs)
       result_compiled = compiled_m(**example_kwargs)

       assert torch.allclose(result_exported, result_compiled), "Results don't match!"
       print("✓ test_export_compiled_model_with_kwargs_dynamic_shapes PASSED")
       return True
   except Exception as e:
       print(f"✗ test_export_compiled_model_with_kwargs_dynamic_shapes FAILED")
       print(f"Error: {e}")
       import traceback
       traceback.print_exc()
       return False

if __name__ == "__main__":
   print("Testing export of compiled models with dynamic shapes\n")
   print("=" * 70)

   results = []
   results.append(test_export_compiled_model_with_nested_dynamic_shapes())
   results.append(test_export_compiled_model_with_kwargs_dynamic_shapes())

   print("\n" + "=" * 70)
   print(f"\nResults: {sum(results)}/{len(results)} tests passed")

   if all(results):
       print("✓ All tests passed!")
   else:
       print("✗ Some tests failed")
       exit(1)
```

It will report
```
======================================================================
Running test_export_compiled_model_with_nested_dynamic_shapes...
✗ test_export_compiled_model_with_nested_dynamic_shapes FAILED
Error: Detected mismatch between the structure of `inputs` and `dynamic_shapes`: `inputs[0]` is a <class 'tuple'>, but `dynamic_shapes[0]` is a <class 'dict'>
For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#dynamic-shapes-validation

The error above occurred when calling torch.export.export. If you would like to view some more information about this error, and get a list of all other errors that may occur in your export call, you can replace your `export()` call with `draft_export()`.
Traceback (most recent call last):
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/dynamic_shapes.py", line 614, in _tree_map_with_path
    return tree_map_with_path(f, tree, *dynamic_shapes, is_leaf=is_leaf)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/utils/_pytree.py", line 2055, in tree_map_with_path
    all_keypath_leaves = keypath_leaves + [treespec.flatten_up_to(r) for r in rests]
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/utils/_pytree.py", line 2055, in <listcomp>
    all_keypath_leaves = keypath_leaves + [treespec.flatten_up_to(r) for r in rests]
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/utils/_pytree.py", line 1188, in flatten_up_to
    helper(self, tree, subtrees)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/utils/_pytree.py", line 1185, in helper
    helper(subspec, subtree, subtrees)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/utils/_pytree.py", line 1141, in helper
    raise ValueError(
ValueError: Node type mismatch; expected <class 'tuple'>, but got <class 'dict'>.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/chzhu/infinitrain/test_exprot.py", line 25, in test_export_compiled_model_with_nested_dynamic_shapes
    ep = export(compiled_m, example_args, dynamic_shapes=dynamic_shapes, strict=True)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/__init__.py", line 311, in export
    raise e
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/__init__.py", line 277, in export
    return _export(
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 1163, in wrapper
    raise e
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 1129, in wrapper
    ep = fn(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/exported_program.py", line 124, in wrapper
    return fn(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 2255, in _export
    ep = _export_for_training(
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 1163, in wrapper
    raise e
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 1129, in wrapper
    ep = fn(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/exported_program.py", line 124, in wrapper
    return fn(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 2071, in _export_for_training
    export_artifact = export_func(
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 1415, in _strict_export
    gm_torch_level = _export_to_torch_ir(
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 785, in _export_to_torch_ir
    _check_dynamic_shapes(combined_args, dynamic_shapes)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/dynamic_shapes.py", line 1031, in _check_dynamic_shapes
    _tree_map_with_path(check_shape, combined_args, dynamic_shapes, tree_name="inputs")
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/dynamic_shapes.py", line 686, in _tree_map_with_path
    _compare(tree_spec, other_tree_spec, [])
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/dynamic_shapes.py", line 677, in _compare
    _compare(
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/dynamic_shapes.py", line 652, in _compare
    raise_mismatch_error(
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/dynamic_shapes.py", line 634, in raise_mismatch_error
    raise UserError(
torch._dynamo.exc.UserError: Detected mismatch between the structure of `inputs` and `dynamic_shapes`: `inputs[0]` is a <class 'tuple'>, but `dynamic_shapes[0]` is a <class 'dict'>
For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#dynamic-shapes-validation

The error above occurred when calling torch.export.export. If you would like to view some more information about this error, and get a list of all other errors that may occur in your export call, you can replace your `export()` call with `draft_export()`.

Running test_export_compiled_model_with_kwargs_dynamic_shapes...
✗ test_export_compiled_model_with_kwargs_dynamic_shapes FAILED
Error: When `dynamic_shapes` is specified as a dict, its top-level keys must be the arg names ['kwargs'] of `inputs`, but here they are ['a1', 'a2']. Since here `inputs` is a list/tuple enclosing a single dict, maybe you just forgot to enclose `dynamic_shapes` in a list/tuple?
For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#dynamic-shapes-validation

The error above occurred when calling torch.export.export. If you would like to view some more information about this error, and get a list of all other errors that may occur in your export call, you can replace your `export()` call with `draft_export()`.
Traceback (most recent call last):
  File "/home/chzhu/infinitrain/test_exprot.py", line 62, in test_export_compiled_model_with_kwargs_dynamic_shapes
    ep = export(compiled_m, example_args, kwargs=example_kwargs, dynamic_shapes=dynamic_shapes, strict=True)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/__init__.py", line 311, in export
    raise e
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/__init__.py", line 277, in export
    return _export(
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 1163, in wrapper
    raise e
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 1129, in wrapper
    ep = fn(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/exported_program.py", line 124, in wrapper
    return fn(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 2255, in _export
    ep = _export_for_training(
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 1163, in wrapper
    raise e
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 1129, in wrapper
    ep = fn(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/exported_program.py", line 124, in wrapper
    return fn(*args, **kwargs)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 2071, in _export_for_training
    export_artifact = export_func(
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 1415, in _strict_export
    gm_torch_level = _export_to_torch_ir(
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_trace.py", line 785, in _export_to_torch_ir
    _check_dynamic_shapes(combined_args, dynamic_shapes)
  File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/dynamic_shapes.py", line 1007, in _check_dynamic_shapes
    raise UserError(
torch._dynamo.exc.UserError: When `dynamic_shapes` is specified as a dict, its top-level keys must be the arg names ['kwargs'] of `inputs`, but here they are ['a1', 'a2']. Since here `inputs` is a list/tuple enclosing a single dict, maybe you just forgot to enclose `dynamic_shapes` in a list/tuple?
For more information about this error, see: https://pytorch.org/docs/main/generated/exportdb/index.html#dynamic-shapes-validation

The error above occurred when calling torch.export.export. If you would like to view some more information about this error, and get a list of all other errors that may occur in your export call, you can replace your `export()` call with `draft_export()`.

======================================================================
```
## Torch Version
(reproducible nightly version)

## Other Behavior
The model can export regularly when we test without compiling the model
```python
import torch
from torch.export import export, Dim

def test_export_compiled_model_with_nested_dynamic_shapes():
   """Test exporting a compiled model with nested dict inputs and dynamic shapes."""
   print("Running test_export_compiled_model_with_nested_dynamic_shapes...")

   class M(torch.nn.Module):
       def forward(self, data_batch):
           return data_batch["a1"] + data_batch["a2"]

   m = M()
   example_args = ({
       "a1": torch.ones(3, 3),
       "a2": torch.ones(3, 3),
   },)
   dynamic_shapes = ({
       "a1": {0: Dim.DYNAMIC},
       "a2": {0: Dim.DYNAMIC},
   },)

   try:
       ep = export(m, example_args, dynamic_shapes=dynamic_shapes, strict=True)
       gm = ep.module()
       result_exported = gm(*example_args)
       result_compiled = m(*example_args)

       assert torch.allclose(result_exported, result_compiled), "Results don't match!"
       print("✓ test_export_compiled_model_with_nested_dynamic_shapes PASSED")
       return True
   except Exception as e:
       print(f"✗ test_export_compiled_model_with_nested_dynamic_shapes FAILED")
       print(f"Error: {e}")
       import traceback
       traceback.print_exc()
       return False

def test_export_compiled_model_with_kwargs_dynamic_shapes():
   """Test exporting a compiled model with kwargs and dynamic shapes."""
   print("\nRunning test_export_compiled_model_with_kwargs_dynamic_shapes...")

   class M(torch.nn.Module):
       def forward(self, a1, a2):
           return a1 + a2

   m = M()
   example_args = ()
   example_kwargs = {
       "a1": torch.ones(3, 3),
       "a2": torch.ones(3, 3),
   }
   dynamic_shapes = {
       "a1": {0: Dim.DYNAMIC},
       "a2": {0: Dim.DYNAMIC},
   }

   try:
       ep = export(m, example_args, kwargs=example_kwargs, dynamic_shapes=dynamic_shapes, strict=True)
       gm = ep.module()
       result_exported = gm(**example_kwargs)
       result_compiled = m(**example_kwargs)

       assert torch.allclose(result_exported, result_compiled), "Results don't match!"
       print("✓ test_export_compiled_model_with_kwargs_dynamic_shapes PASSED")
       return True
   except Exception as e:
       print(f"✗ test_export_compiled_model_with_kwargs_dynamic_shapes FAILED")
       print(f"Error: {e}")
       import traceback
       traceback.print_exc()
       return False

if __name__ == "__main__":
   print("Testing export of compiled models with dynamic shapes\n")
   print("=" * 70)

   results = []
   results.append(test_export_compiled_model_with_nested_dynamic_shapes())
   results.append(test_export_compiled_model_with_kwargs_dynamic_shapes())

   print("\n" + "=" * 70)
   print(f"\nResults: {sum(results)}/{len(results)} tests passed")

   if all(results):
       print("✓ All tests passed!")
   else:
       print("✗ Some tests failed")
       exit(1)

```
## Root Cause

This is because of a side effect of torch.compile(model). When the model is being compiled, the input signature will become (*args, **kwargs) automatically. In the above example, the `data_batch` will be added into `args` in combined_args [here](dc011d3203/torch/export/dynamic_shapes.py (L720)), and it will look like
```
{'args': ({'a1': tensor([[1., 1., 1.]... 1., 1.]]), 'a2': tensor([[1., 1., 1.]... 1., 1.]])},)}
```
Without the compiling, the combined args will look like
```
{'data_batch': {'a1': tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]]), 'a2': tensor([[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]])}}

```
Thus causing the mismatch when we use treemap to match the dynamic shape with the input argos

The error is also reproducible when we setup kwargs as example argos (see the 2nd test above)
## Fix
Proposed fix: In [_combine_args](dc011d3203/torch/export/dynamic_shapes.py (L720)) we explicitly flatten out the kwargs and args into combined args.
## Side Effects
There are 2 existing tests that assume this behavior and
1. add `args` explicitly to dynamic shapes
2. wrap args into nested format in dynamic_shape

I have modified those test to make args and dynamic_shapes to be in consistent format.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/166358
Approved by: https://github.com/angelayi
2025-11-11 19:04:58 +00:00
bd99ae3315 [Docs] Add warning that torch.export.load uses pickle (#167557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167557
Approved by: https://github.com/zhxchen17, https://github.com/angelayi
2025-11-11 18:47:14 +00:00
ce8672c24f Fix use of TORCH_CHECK in torch/csrc/stable (#167495)
Tested by above PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167495
Approved by: https://github.com/janeyx99
ghstack dependencies: #166579, #166694, #166695, #167362
2025-11-11 17:58:30 +00:00
402c465030 [ARM] Improve LLM performance & mem usage using int4-bf16 KleidiAI kernels (#158250)
Co-authored-by: Nikhil Gupta [nikhil.gupta2@arm.com](mailto:nikhil.gupta2@arm.com)

This PR enables the use of KleidiAI INT4 kernels that directly produce BF16 outputs within PyTorch to boost LLM prefill & decode performance

**This change improves decode throughput by ~15% & reduces memory required to inference the model by 50%**

### Benchmark Setup
```
Model: meta-llama/Llama-3.1-8B
Test Platform: Neoverse V2
```
### Detailed Results

| Metric                           | With `--compile`         | Without `--compile`      |
|----------------------------------|---------------------------|---------------------------|
| Quantization Scheme              | INT4 symmetric channelwise | INT4 symmetric channelwise |
| Input Precision                  | BF16                      | BF16                      |
| Number of Layers Quantized       | 32                        | 32                        |
| Average Compression Ratio        | 87.49%                    | 87.49%                    |
| Total Quantization Time (s)      | 9.62                      | 10.32                     |
| Compile Time (First) (s)         | 134.48                    | 1.69                      |
| Compile Time (Second) (s)        | 80.44                     | 1.60                      |
| Compile Time (Subsequent) (s)    | 0.19                      | 0.22                      |
| Prefill Tokens                   | 54                        | 54                        |
| Decoded Tokens                   | 33                        | 33                        |
| Prefill Time (s)                 | 0.19                      | 0.22                      |
| Decode Time (s)                  | 0.76                      | 1.38                      |
| E2E Generation Time (s)          | 0.95                      | 1.60                      |
| Prefill Throughput (tokens/s)    | 288.13                    | 249.91                    |
| Decode Throughput (tokens/s)     | 43.42                     | 23.83                     |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158250
Approved by: https://github.com/malfet, https://github.com/aditew01, https://github.com/fadara01

Co-authored-by: Nikhil Gupta <nikhil.gupta2@arm.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-11-11 17:50:22 +00:00
573a79fffa [OpenReg] Initialize device stream states for all devices in initOpenRegStreamsOnce (#167528)
Fixes #167527

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167528
Approved by: https://github.com/fffrog
2025-11-11 16:53:22 +00:00
4945180468 Add empty tensor check for _pad_packed_sequence (#167521)
That prevents null pointer dereference

Fixes https://github.com/pytorch/pytorch/issues/149622
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167521
Approved by: https://github.com/albanD
2025-11-11 16:46:13 +00:00
1df723e6f5 [inductor] Fix constant creation (#167398)
We ran into this issue when debugging inductor-lite. Calling `torch.tensor` within a fake mode (which is the case inside of inductor) will create a FakeTensor, which causes this FakeTensor to be used as a constant within inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167398
Approved by: https://github.com/eellison, https://github.com/BoyuanFeng
2025-11-11 16:30:46 +00:00
f9b81e23e4 [ROCm] Disable group gemm CK path when composable kernel (CK) is not enabled (#167403)
For ROCm builds without CK support, ensure use_fast_path is false so that the CK path is not triggered, since CK is currently not available in this configuration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/167403
Approved by: https://github.com/Skylion007, https://github.com/ScottTodd, https://github.com/jeffdaily
2025-11-11 16:15:51 +00:00
296 changed files with 10757 additions and 4451 deletions

View File

@ -1,19 +0,0 @@
# Aarch64 (ARM/Graviton) Support Scripts
Scripts for building aarch64 PyTorch PIP Wheels. These scripts build the following wheels:
* torch
* torchvision
* torchaudio
* torchtext
* torchdata
## Aarch64_ci_build.sh
This script is design to support CD operations within PyPi manylinux aarch64 container, and be executed in the container. It prepares the container and then executes __aarch64_wheel_ci_build.py__ to build the wheels. The script "assumes" the PyTorch repo is located at: ```/pytorch``` and will put the wheels into ```/artifacts```.
### Usage
```DESIRED_PYTHON=<PythonVersion> aarch64_ci_build.sh```
__NOTE:__ CI build is currently __EXPERMINTAL__
## Build_aarch64_wheel.py
This app allows a person to build using AWS EC3 resources and requires AWS-CLI and Boto3 with AWS credentials to support building EC2 instances for the wheel builds. Can be used in a codebuild CD or from a local system.
### Usage
```build_aarch64_wheel.py --key-name <YourPemKey> --use-docker --python 3.8 --branch <RCtag>```

View File

@ -1,53 +0,0 @@
#!/bin/bash
set -eux -o pipefail
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
# Set CUDA architecture lists to match x86 build_cuda.sh
if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"
elif [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0+PTX"
fi
# Compress the fatbin with -compress-mode=size for CUDA 13
if [[ "$DESIRED_CUDA" == *"13"* ]]; then
export TORCH_NVCC_FLAGS="-compress-mode=size"
# Bundle ptxas into the cu13 wheel, see https://github.com/pytorch/pytorch/issues/163801
export BUILD_BUNDLE_PTXAS=1
fi
SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"
source $SCRIPTPATH/aarch64_ci_setup.sh
###############################################################################
# Run aarch64 builder python
###############################################################################
cd /
# adding safe directory for git as the permissions will be
# on the mounted pytorch repo
git config --global --add safe.directory /pytorch
pip install -r /pytorch/requirements.txt
pip install auditwheel==6.2.0 wheel
if [ "$DESIRED_CUDA" = "cpu" ]; then
echo "BASE_CUDA_VERSION is not set. Building cpu wheel."
python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn
else
echo "BASE_CUDA_VERSION is set to: $DESIRED_CUDA"
export USE_SYSTEM_NCCL=1
# Check if we should use NVIDIA libs from PyPI (similar to x86 build_cuda.sh logic)
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling CUDA libraries with wheel for aarch64."
else
echo "Using nvidia libs from pypi for aarch64."
echo "Updated PYTORCH_EXTRA_INSTALL_REQUIREMENTS for aarch64: $PYTORCH_EXTRA_INSTALL_REQUIREMENTS"
export USE_NVIDIA_PYPI_LIBS=1
fi
python /pytorch/.ci/aarch64_linux/aarch64_wheel_ci_build.py --enable-mkldnn --enable-cuda
fi

View File

@ -1,21 +0,0 @@
#!/bin/bash
set -eux -o pipefail
# This script is used to prepare the Docker container for aarch64_ci_wheel_build.py python script
# By creating symlinks from desired /opt/python to /usr/local/bin/
NUMPY_VERSION=2.0.2
if [[ "$DESIRED_PYTHON" == "3.13" || "$DESIRED_PYTHON" == "3.13t" ]]; then
NUMPY_VERSION=2.1.2
fi
SCRIPTPATH="$( cd "$(dirname "$0")" ; pwd -P )"
source $SCRIPTPATH/../manywheel/set_desired_python.sh
pip install -q numpy==${NUMPY_VERSION} pyyaml==6.0.2 scons==4.7.0 ninja==1.11.1 patchelf==0.17.2
for tool in python python3 pip pip3 ninja scons patchelf; do
ln -sf ${DESIRED_PYTHON_BIN_DIR}/${tool} /usr/local/bin;
done
python --version

View File

@ -1,333 +0,0 @@
#!/usr/bin/env python3
# encoding: UTF-8
import os
import shutil
from subprocess import check_call, check_output
def list_dir(path: str) -> list[str]:
"""'
Helper for getting paths for Python
"""
return check_output(["ls", "-1", path]).decode().split("\n")
def replace_tag(filename) -> None:
with open(filename) as f:
lines = f.readlines()
for i, line in enumerate(lines):
if line.startswith("Tag:"):
lines[i] = line.replace("-linux_", "-manylinux_2_28_")
print(f"Updated tag from {line} to {lines[i]}")
break
with open(filename, "w") as f:
f.writelines(lines)
def patch_library_rpath(
folder: str,
lib_name: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Apply patchelf to set RPATH for a library in torch/lib"""
lib_path = f"{folder}/tmp/torch/lib/{lib_name}"
if use_nvidia_pypi_libs:
# For PyPI NVIDIA libraries, construct CUDA RPATH
cuda_rpaths = [
"$ORIGIN/../../nvidia/cudnn/lib",
"$ORIGIN/../../nvidia/nvshmem/lib",
"$ORIGIN/../../nvidia/nccl/lib",
"$ORIGIN/../../nvidia/cusparselt/lib",
]
if "130" in desired_cuda:
cuda_rpaths.append("$ORIGIN/../../nvidia/cu13/lib")
else:
cuda_rpaths.extend(
[
"$ORIGIN/../../nvidia/cublas/lib",
"$ORIGIN/../../nvidia/cuda_cupti/lib",
"$ORIGIN/../../nvidia/cuda_nvrtc/lib",
"$ORIGIN/../../nvidia/cuda_runtime/lib",
"$ORIGIN/../../nvidia/cufft/lib",
"$ORIGIN/../../nvidia/curand/lib",
"$ORIGIN/../../nvidia/cusolver/lib",
"$ORIGIN/../../nvidia/cusparse/lib",
"$ORIGIN/../../nvidia/nvtx/lib",
"$ORIGIN/../../nvidia/cufile/lib",
]
)
# Add $ORIGIN for local torch libs
rpath = ":".join(cuda_rpaths) + ":$ORIGIN"
else:
# For bundled libraries, just use $ORIGIN
rpath = "$ORIGIN"
if os.path.exists(lib_path):
os.system(
f"cd {folder}/tmp/torch/lib/; "
f"patchelf --set-rpath '{rpath}' --force-rpath {lib_name}"
)
def copy_and_patch_library(
src_path: str,
folder: str,
use_nvidia_pypi_libs: bool = False,
desired_cuda: str = "",
) -> None:
"""Copy a library to torch/lib and patch its RPATH"""
if os.path.exists(src_path):
lib_name = os.path.basename(src_path)
shutil.copy2(src_path, f"{folder}/tmp/torch/lib/{lib_name}")
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)
def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"""
Package the cuda wheel libraries
"""
folder = os.path.dirname(wheel_path)
os.mkdir(f"{folder}/tmp")
os.system(f"unzip {wheel_path} -d {folder}/tmp")
# Delete original wheel since it will be repackaged
os.system(f"rm {wheel_path}")
# Check if we should use PyPI NVIDIA libraries or bundle system libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"
if use_nvidia_pypi_libs:
print("Using nvidia libs from pypi - skipping CUDA library bundling")
# For PyPI approach, we don't bundle CUDA libraries - they come from PyPI packages
# We only need to bundle non-NVIDIA libraries
minimal_libs_to_copy = [
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
]
# Copy minimal libraries to unzipped_folder/torch/lib
for lib_path in minimal_libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)
# Patch torch libraries used for searching libraries
torch_libs_to_patch = [
"libtorch.so",
"libtorch_cpu.so",
"libtorch_cuda.so",
"libtorch_cuda_linalg.so",
"libtorch_global_deps.so",
"libtorch_python.so",
"libtorch_nvshmem.so",
"libc10.so",
"libc10_cuda.so",
"libcaffe2_nvrtc.so",
"libshm.so",
]
for lib_name in torch_libs_to_patch:
patch_library_rpath(folder, lib_name, use_nvidia_pypi_libs, desired_cuda)
else:
print("Bundling CUDA libraries with wheel")
# Original logic for bundling system CUDA libraries
# Common libraries for all CUDA versions
common_libs = [
# Non-NVIDIA system libraries
"/lib64/libgomp.so.1",
"/usr/lib64/libgfortran.so.5",
"/acl/build/libarm_compute.so",
"/acl/build/libarm_compute_graph.so",
# Common CUDA libraries (same for all versions)
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0",
"/usr/local/lib/libnvpl_lapack_core.so.0",
"/usr/local/lib/libnvpl_blas_core.so.0",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvshmem_host.so.3",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",
"/usr/local/cuda/lib64/libcudnn_cnn.so.9",
"/usr/local/cuda/lib64/libcudnn_graph.so.9",
"/usr/local/cuda/lib64/libcudnn_ops.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_runtime_compiled.so.9",
"/usr/local/cuda/lib64/libcudnn_engines_precompiled.so.9",
"/usr/local/cuda/lib64/libcudnn_heuristic.so.9",
"/usr/local/cuda/lib64/libcufile.so.0",
"/usr/local/cuda/lib64/libcufile_rdma.so.1",
"/usr/local/cuda/lib64/libcusparse.so.12",
]
# CUDA version-specific libraries
if "13" in desired_cuda:
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.13",
"/usr/local/cuda/lib64/libcublas.so.13",
"/usr/local/cuda/lib64/libcublasLt.so.13",
"/usr/local/cuda/lib64/libcudart.so.13",
"/usr/local/cuda/lib64/libcufft.so.12",
"/usr/local/cuda/lib64/libcusolver.so.12",
"/usr/local/cuda/lib64/libnvJitLink.so.13",
"/usr/local/cuda/lib64/libnvrtc.so.13",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.13.{minor_version}",
]
elif "12" in desired_cuda:
# Get the last character for libnvrtc-builtins version (e.g., "129" -> "9")
minor_version = desired_cuda[-1]
version_specific_libs = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
"/usr/local/cuda/lib64/libcudart.so.12",
"/usr/local/cuda/lib64/libcufft.so.11",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
f"/usr/local/cuda/lib64/libnvrtc-builtins.so.12.{minor_version}",
]
else:
raise ValueError(f"Unsupported CUDA version: {desired_cuda}.")
# Combine all libraries
libs_to_copy = common_libs + version_specific_libs
# Copy libraries to unzipped_folder/torch/lib
for lib_path in libs_to_copy:
copy_and_patch_library(lib_path, folder, use_nvidia_pypi_libs, desired_cuda)
# Make sure the wheel is tagged with manylinux_2_28
for f in os.scandir(f"{folder}/tmp/"):
if f.is_dir() and f.name.endswith(".dist-info"):
replace_tag(f"{f.path}/WHEEL")
break
os.system(f"wheel pack {folder}/tmp/ -d {folder}")
os.system(f"rm -rf {folder}/tmp/")
def complete_wheel(folder: str) -> str:
"""
Complete wheel build and put in artifact location
"""
wheel_name = list_dir(f"/{folder}/dist")[0]
# Please note for cuda we don't run auditwheel since we use custom script to package
# the cuda dependencies to the wheel file using update_wheel() method.
# However we need to make sure filename reflects the correct Manylinux platform.
if "pytorch" in folder and not enable_cuda:
print("Repairing Wheel with AuditWheel")
check_call(["auditwheel", "repair", f"dist/{wheel_name}"], cwd=folder)
repaired_wheel_name = list_dir(f"/{folder}/wheelhouse")[0]
print(f"Moving {repaired_wheel_name} wheel to /{folder}/dist")
os.rename(
f"/{folder}/wheelhouse/{repaired_wheel_name}",
f"/{folder}/dist/{repaired_wheel_name}",
)
else:
repaired_wheel_name = list_dir(f"/{folder}/dist")[0]
print(f"Copying {repaired_wheel_name} to artifacts")
shutil.copy2(
f"/{folder}/dist/{repaired_wheel_name}", f"/artifacts/{repaired_wheel_name}"
)
return repaired_wheel_name
def parse_arguments():
"""
Parse inline arguments
"""
from argparse import ArgumentParser
parser = ArgumentParser("AARCH64 wheels python CD")
parser.add_argument("--debug", action="store_true")
parser.add_argument("--build-only", action="store_true")
parser.add_argument("--test-only", type=str)
parser.add_argument("--enable-mkldnn", action="store_true")
parser.add_argument("--enable-cuda", action="store_true")
return parser.parse_args()
if __name__ == "__main__":
"""
Entry Point
"""
args = parse_arguments()
enable_mkldnn = args.enable_mkldnn
enable_cuda = args.enable_cuda
branch = check_output(
["git", "rev-parse", "--abbrev-ref", "HEAD"], cwd="/pytorch"
).decode()
print("Building PyTorch wheel")
build_vars = ""
# MAX_JOB=5 is not required for CPU backend (see commit 465d98b)
if enable_cuda:
build_vars += "MAX_JOBS=5 "
# Handle PyPI NVIDIA libraries vs bundled libraries
use_nvidia_pypi_libs = os.getenv("USE_NVIDIA_PYPI_LIBS", "0") == "1"
if use_nvidia_pypi_libs:
print("Configuring build for PyPI NVIDIA libraries")
# Configure for dynamic linking (matching x86 logic)
build_vars += "ATEN_STATIC_CUDA=0 USE_CUDA_STATIC_LINK=0 USE_CUPTI_SO=1 "
else:
print("Configuring build for bundled NVIDIA libraries")
# Keep existing static linking approach - already configured above
override_package_version = os.getenv("OVERRIDE_PACKAGE_VERSION")
desired_cuda = os.getenv("DESIRED_CUDA")
if override_package_version is not None:
version = override_package_version
build_vars += (
f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version} PYTORCH_BUILD_NUMBER=1 "
)
elif branch in ["nightly", "main"]:
build_date = (
check_output(["git", "log", "--pretty=format:%cs", "-1"], cwd="/pytorch")
.decode()
.replace("-", "")
)
version = (
check_output(["cat", "version.txt"], cwd="/pytorch").decode().strip()[:-2]
)
if enable_cuda:
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date}+{desired_cuda} PYTORCH_BUILD_NUMBER=1 "
else:
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1 "
elif branch.startswith(("v1.", "v2.")):
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "
if enable_mkldnn:
print("build pytorch with mkldnn+acl backend")
build_vars += "USE_MKLDNN=ON USE_MKLDNN_ACL=ON "
build_vars += "ACL_ROOT_DIR=/acl "
if enable_cuda:
build_vars += "BLAS=NVPL "
else:
build_vars += "BLAS=OpenBLAS OpenBLAS_HOME=/opt/OpenBLAS "
else:
print("build pytorch without mkldnn backend")
os.system(f"cd /pytorch; {build_vars} python3 -m build --wheel --no-isolation")
if enable_cuda:
print("Updating Cuda Dependency")
filename = os.listdir("/pytorch/dist/")
wheel_path = f"/pytorch/dist/{filename[0]}"
package_cuda_wheel(wheel_path, desired_cuda)
pytorch_wheel_name = complete_wheel("/pytorch/")
print(f"Build Complete. Created {pytorch_wheel_name}..")

View File

@ -1,999 +0,0 @@
#!/usr/bin/env python3
# This script is for building AARCH64 wheels using AWS EC2 instances.
# To generate binaries for the release follow these steps:
# 1. Update mappings for each of the Domain Libraries by adding new row to a table like this:
# "v1.11.0": ("0.11.0", "rc1"),
# 2. Run script with following arguments for each of the supported python versions and required tag, for example:
# build_aarch64_wheel.py --key-name <YourPemKey> --use-docker --python 3.8 --branch v1.11.0-rc3
import os
import subprocess
import sys
import time
from typing import Optional, Union
import boto3
# AMI images for us-east-1, change the following based on your ~/.aws/config
os_amis = {
"ubuntu20_04": "ami-052eac90edaa9d08f", # login_name: ubuntu
"ubuntu22_04": "ami-0c6c29c5125214c77", # login_name: ubuntu
"redhat8": "ami-0698b90665a2ddcf1", # login_name: ec2-user
}
ubuntu20_04_ami = os_amis["ubuntu20_04"]
def compute_keyfile_path(key_name: Optional[str] = None) -> tuple[str, str]:
if key_name is None:
key_name = os.getenv("AWS_KEY_NAME")
if key_name is None:
return os.getenv("SSH_KEY_PATH", ""), ""
homedir_path = os.path.expanduser("~")
default_path = os.path.join(homedir_path, ".ssh", f"{key_name}.pem")
return os.getenv("SSH_KEY_PATH", default_path), key_name
ec2 = boto3.resource("ec2")
def ec2_get_instances(filter_name, filter_value):
return ec2.instances.filter(
Filters=[{"Name": filter_name, "Values": [filter_value]}]
)
def ec2_instances_of_type(instance_type="t4g.2xlarge"):
return ec2_get_instances("instance-type", instance_type)
def ec2_instances_by_id(instance_id):
rc = list(ec2_get_instances("instance-id", instance_id))
return rc[0] if len(rc) > 0 else None
def start_instance(
key_name, ami=ubuntu20_04_ami, instance_type="t4g.2xlarge", ebs_size: int = 50
):
inst = ec2.create_instances(
ImageId=ami,
InstanceType=instance_type,
SecurityGroups=["ssh-allworld"],
KeyName=key_name,
MinCount=1,
MaxCount=1,
BlockDeviceMappings=[
{
"DeviceName": "/dev/sda1",
"Ebs": {
"DeleteOnTermination": True,
"VolumeSize": ebs_size,
"VolumeType": "standard",
},
}
],
)[0]
print(f"Create instance {inst.id}")
inst.wait_until_running()
running_inst = ec2_instances_by_id(inst.id)
print(f"Instance started at {running_inst.public_dns_name}")
return running_inst
class RemoteHost:
addr: str
keyfile_path: str
login_name: str
container_id: Optional[str] = None
ami: Optional[str] = None
def __init__(self, addr: str, keyfile_path: str, login_name: str = "ubuntu"):
self.addr = addr
self.keyfile_path = keyfile_path
self.login_name = login_name
def _gen_ssh_prefix(self) -> list[str]:
return [
"ssh",
"-o",
"StrictHostKeyChecking=no",
"-i",
self.keyfile_path,
f"{self.login_name}@{self.addr}",
"--",
]
@staticmethod
def _split_cmd(args: Union[str, list[str]]) -> list[str]:
return args.split() if isinstance(args, str) else args
def run_ssh_cmd(self, args: Union[str, list[str]]) -> None:
subprocess.check_call(self._gen_ssh_prefix() + self._split_cmd(args))
def check_ssh_output(self, args: Union[str, list[str]]) -> str:
return subprocess.check_output(
self._gen_ssh_prefix() + self._split_cmd(args)
).decode("utf-8")
def scp_upload_file(self, local_file: str, remote_file: str) -> None:
subprocess.check_call(
[
"scp",
"-i",
self.keyfile_path,
local_file,
f"{self.login_name}@{self.addr}:{remote_file}",
]
)
def scp_download_file(
self, remote_file: str, local_file: Optional[str] = None
) -> None:
if local_file is None:
local_file = "."
subprocess.check_call(
[
"scp",
"-i",
self.keyfile_path,
f"{self.login_name}@{self.addr}:{remote_file}",
local_file,
]
)
def start_docker(self, image="quay.io/pypa/manylinux2014_aarch64:latest") -> None:
self.run_ssh_cmd("sudo apt-get install -y docker.io")
self.run_ssh_cmd(f"sudo usermod -a -G docker {self.login_name}")
self.run_ssh_cmd("sudo service docker start")
self.run_ssh_cmd(f"docker pull {image}")
self.container_id = self.check_ssh_output(
f"docker run -t -d -w /root {image}"
).strip()
def using_docker(self) -> bool:
return self.container_id is not None
def run_cmd(self, args: Union[str, list[str]]) -> None:
if not self.using_docker():
return self.run_ssh_cmd(args)
assert self.container_id is not None
docker_cmd = self._gen_ssh_prefix() + [
"docker",
"exec",
"-i",
self.container_id,
"bash",
]
p = subprocess.Popen(docker_cmd, stdin=subprocess.PIPE)
p.communicate(
input=" ".join(["source .bashrc && "] + self._split_cmd(args)).encode(
"utf-8"
)
)
rc = p.wait()
if rc != 0:
raise subprocess.CalledProcessError(rc, docker_cmd)
def check_output(self, args: Union[str, list[str]]) -> str:
if not self.using_docker():
return self.check_ssh_output(args)
assert self.container_id is not None
docker_cmd = self._gen_ssh_prefix() + [
"docker",
"exec",
"-i",
self.container_id,
"bash",
]
p = subprocess.Popen(docker_cmd, stdin=subprocess.PIPE, stdout=subprocess.PIPE)
(out, err) = p.communicate(
input=" ".join(["source .bashrc && "] + self._split_cmd(args)).encode(
"utf-8"
)
)
rc = p.wait()
if rc != 0:
raise subprocess.CalledProcessError(rc, docker_cmd, output=out, stderr=err)
return out.decode("utf-8")
def upload_file(self, local_file: str, remote_file: str) -> None:
if not self.using_docker():
return self.scp_upload_file(local_file, remote_file)
tmp_file = os.path.join("/tmp", os.path.basename(local_file))
self.scp_upload_file(local_file, tmp_file)
self.run_ssh_cmd(
["docker", "cp", tmp_file, f"{self.container_id}:/root/{remote_file}"]
)
self.run_ssh_cmd(["rm", tmp_file])
def download_file(self, remote_file: str, local_file: Optional[str] = None) -> None:
if not self.using_docker():
return self.scp_download_file(remote_file, local_file)
tmp_file = os.path.join("/tmp", os.path.basename(remote_file))
self.run_ssh_cmd(
["docker", "cp", f"{self.container_id}:/root/{remote_file}", tmp_file]
)
self.scp_download_file(tmp_file, local_file)
self.run_ssh_cmd(["rm", tmp_file])
def download_wheel(
self, remote_file: str, local_file: Optional[str] = None
) -> None:
if self.using_docker() and local_file is None:
basename = os.path.basename(remote_file)
local_file = basename.replace(
"-linux_aarch64.whl", "-manylinux2014_aarch64.whl"
)
self.download_file(remote_file, local_file)
def list_dir(self, path: str) -> list[str]:
return self.check_output(["ls", "-1", path]).split("\n")
def wait_for_connection(addr, port, timeout=15, attempt_cnt=5):
import socket
for i in range(attempt_cnt):
try:
with socket.create_connection((addr, port), timeout=timeout):
return
except (ConnectionRefusedError, TimeoutError): # noqa: PERF203
if i == attempt_cnt - 1:
raise
time.sleep(timeout)
def update_apt_repo(host: RemoteHost) -> None:
time.sleep(5)
host.run_cmd("sudo systemctl stop apt-daily.service || true")
host.run_cmd("sudo systemctl stop unattended-upgrades.service || true")
host.run_cmd(
"while systemctl is-active --quiet apt-daily.service; do sleep 1; done"
)
host.run_cmd(
"while systemctl is-active --quiet unattended-upgrades.service; do sleep 1; done"
)
host.run_cmd("sudo apt-get update")
time.sleep(3)
host.run_cmd("sudo apt-get update")
def install_condaforge(
host: RemoteHost, suffix: str = "latest/download/Miniforge3-Linux-aarch64.sh"
) -> None:
print("Install conda-forge")
host.run_cmd(f"curl -OL https://github.com/conda-forge/miniforge/releases/{suffix}")
host.run_cmd(f"sh -f {os.path.basename(suffix)} -b")
host.run_cmd(f"rm -f {os.path.basename(suffix)}")
if host.using_docker():
host.run_cmd("echo 'PATH=$HOME/miniforge3/bin:$PATH'>>.bashrc")
else:
host.run_cmd(
[
"sed",
"-i",
"'/^# If not running interactively.*/i PATH=$HOME/miniforge3/bin:$PATH'",
".bashrc",
]
)
def install_condaforge_python(host: RemoteHost, python_version="3.8") -> None:
if python_version == "3.6":
# Python-3.6 EOLed and not compatible with conda-4.11
install_condaforge(
host, suffix="download/4.10.3-10/Miniforge3-4.10.3-10-Linux-aarch64.sh"
)
host.run_cmd(f"conda install -y python={python_version} numpy pyyaml")
else:
install_condaforge(
host, suffix="download/4.11.0-4/Miniforge3-4.11.0-4-Linux-aarch64.sh"
)
# Pytorch-1.10 or older are not compatible with setuptools=59.6 or newer
host.run_cmd(
f"conda install -y python={python_version} numpy pyyaml setuptools>=59.5.0"
)
def embed_libgomp(host: RemoteHost, use_conda, wheel_name) -> None:
host.run_cmd("pip3 install auditwheel")
host.run_cmd(
"conda install -y patchelf" if use_conda else "sudo apt-get install -y patchelf"
)
from tempfile import NamedTemporaryFile
with NamedTemporaryFile() as tmp:
tmp.write(embed_library_script.encode("utf-8"))
tmp.flush()
host.upload_file(tmp.name, "embed_library.py")
print("Embedding libgomp into wheel")
if host.using_docker():
host.run_cmd(f"python3 embed_library.py {wheel_name} --update-tag")
else:
host.run_cmd(f"python3 embed_library.py {wheel_name}")
def checkout_repo(
host: RemoteHost,
*,
branch: str = "main",
url: str,
git_clone_flags: str,
mapping: dict[str, tuple[str, str]],
) -> Optional[str]:
for prefix in mapping:
if not branch.startswith(prefix):
continue
tag = f"v{mapping[prefix][0]}-{mapping[prefix][1]}"
host.run_cmd(f"git clone {url} -b {tag} {git_clone_flags}")
return mapping[prefix][0]
host.run_cmd(f"git clone {url} -b {branch} {git_clone_flags}")
return None
def build_torchvision(
host: RemoteHost,
*,
branch: str = "main",
use_conda: bool = True,
git_clone_flags: str,
run_smoke_tests: bool = True,
) -> str:
print("Checking out TorchVision repo")
build_version = checkout_repo(
host,
branch=branch,
url="https://github.com/pytorch/vision",
git_clone_flags=git_clone_flags,
mapping={
"v1.7.1": ("0.8.2", "rc2"),
"v1.8.0": ("0.9.0", "rc3"),
"v1.8.1": ("0.9.1", "rc1"),
"v1.9.0": ("0.10.0", "rc1"),
"v1.10.0": ("0.11.1", "rc1"),
"v1.10.1": ("0.11.2", "rc1"),
"v1.10.2": ("0.11.3", "rc1"),
"v1.11.0": ("0.12.0", "rc1"),
"v1.12.0": ("0.13.0", "rc4"),
"v1.12.1": ("0.13.1", "rc6"),
"v1.13.0": ("0.14.0", "rc4"),
"v1.13.1": ("0.14.1", "rc2"),
"v2.0.0": ("0.15.1", "rc2"),
"v2.0.1": ("0.15.2", "rc2"),
},
)
print("Building TorchVision wheel")
# Please note libnpg and jpeg are required to build image.so extension
if use_conda:
host.run_cmd("conda install -y libpng jpeg")
# Remove .so files to force static linking
host.run_cmd(
"rm miniforge3/lib/libpng.so miniforge3/lib/libpng16.so miniforge3/lib/libjpeg.so"
)
# And patch setup.py to include libz dependency for libpng
host.run_cmd(
[
'sed -i -e \'s/image_link_flags\\.append("png")/image_link_flags += ["png", "z"]/\' vision/setup.py'
]
)
build_vars = ""
if branch == "nightly":
version = host.check_output(
["if [ -f vision/version.txt ]; then cat vision/version.txt; fi"]
).strip()
if len(version) == 0:
# In older revisions, version was embedded in setup.py
version = (
host.check_output(["grep", '"version = \'"', "vision/setup.py"])
.strip()
.split("'")[1][:-2]
)
build_date = (
host.check_output("cd vision && git log --pretty=format:%s -1")
.strip()
.split()[0]
.replace("-", "")
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
host.run_cmd(f"cd vision && {build_vars} python3 -m build --wheel --no-isolation")
vision_wheel_name = host.list_dir("vision/dist")[0]
embed_libgomp(host, use_conda, os.path.join("vision", "dist", vision_wheel_name))
print("Copying TorchVision wheel")
host.download_wheel(os.path.join("vision", "dist", vision_wheel_name))
if run_smoke_tests:
host.run_cmd(
f"pip3 install {os.path.join('vision', 'dist', vision_wheel_name)}"
)
host.run_cmd("python3 vision/test/smoke_test.py")
print("Delete vision checkout")
host.run_cmd("rm -rf vision")
return vision_wheel_name
def build_torchdata(
host: RemoteHost,
*,
branch: str = "main",
use_conda: bool = True,
git_clone_flags: str = "",
) -> str:
print("Checking out TorchData repo")
git_clone_flags += " --recurse-submodules"
build_version = checkout_repo(
host,
branch=branch,
url="https://github.com/pytorch/data",
git_clone_flags=git_clone_flags,
mapping={
"v1.13.1": ("0.5.1", ""),
"v2.0.0": ("0.6.0", "rc5"),
"v2.0.1": ("0.6.1", "rc1"),
},
)
print("Building TorchData wheel")
build_vars = ""
if branch == "nightly":
version = host.check_output(
["if [ -f data/version.txt ]; then cat data/version.txt; fi"]
).strip()
build_date = (
host.check_output("cd data && git log --pretty=format:%s -1")
.strip()
.split()[0]
.replace("-", "")
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
host.run_cmd(f"cd data && {build_vars} python3 -m build --wheel --no-isolation")
wheel_name = host.list_dir("data/dist")[0]
embed_libgomp(host, use_conda, os.path.join("data", "dist", wheel_name))
print("Copying TorchData wheel")
host.download_wheel(os.path.join("data", "dist", wheel_name))
return wheel_name
def build_torchtext(
host: RemoteHost,
*,
branch: str = "main",
use_conda: bool = True,
git_clone_flags: str = "",
) -> str:
print("Checking out TorchText repo")
git_clone_flags += " --recurse-submodules"
build_version = checkout_repo(
host,
branch=branch,
url="https://github.com/pytorch/text",
git_clone_flags=git_clone_flags,
mapping={
"v1.9.0": ("0.10.0", "rc1"),
"v1.10.0": ("0.11.0", "rc2"),
"v1.10.1": ("0.11.1", "rc1"),
"v1.10.2": ("0.11.2", "rc1"),
"v1.11.0": ("0.12.0", "rc1"),
"v1.12.0": ("0.13.0", "rc2"),
"v1.12.1": ("0.13.1", "rc5"),
"v1.13.0": ("0.14.0", "rc3"),
"v1.13.1": ("0.14.1", "rc1"),
"v2.0.0": ("0.15.1", "rc2"),
"v2.0.1": ("0.15.2", "rc2"),
},
)
print("Building TorchText wheel")
build_vars = ""
if branch == "nightly":
version = host.check_output(
["if [ -f text/version.txt ]; then cat text/version.txt; fi"]
).strip()
build_date = (
host.check_output("cd text && git log --pretty=format:%s -1")
.strip()
.split()[0]
.replace("-", "")
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
host.run_cmd(f"cd text && {build_vars} python3 -m build --wheel --no-isolation")
wheel_name = host.list_dir("text/dist")[0]
embed_libgomp(host, use_conda, os.path.join("text", "dist", wheel_name))
print("Copying TorchText wheel")
host.download_wheel(os.path.join("text", "dist", wheel_name))
return wheel_name
def build_torchaudio(
host: RemoteHost,
*,
branch: str = "main",
use_conda: bool = True,
git_clone_flags: str = "",
) -> str:
print("Checking out TorchAudio repo")
git_clone_flags += " --recurse-submodules"
build_version = checkout_repo(
host,
branch=branch,
url="https://github.com/pytorch/audio",
git_clone_flags=git_clone_flags,
mapping={
"v1.9.0": ("0.9.0", "rc2"),
"v1.10.0": ("0.10.0", "rc5"),
"v1.10.1": ("0.10.1", "rc1"),
"v1.10.2": ("0.10.2", "rc1"),
"v1.11.0": ("0.11.0", "rc1"),
"v1.12.0": ("0.12.0", "rc3"),
"v1.12.1": ("0.12.1", "rc5"),
"v1.13.0": ("0.13.0", "rc4"),
"v1.13.1": ("0.13.1", "rc2"),
"v2.0.0": ("2.0.1", "rc3"),
"v2.0.1": ("2.0.2", "rc2"),
},
)
print("Building TorchAudio wheel")
build_vars = ""
if branch == "nightly":
version = (
host.check_output(["grep", '"version = \'"', "audio/setup.py"])
.strip()
.split("'")[1][:-2]
)
build_date = (
host.check_output("cd audio && git log --pretty=format:%s -1")
.strip()
.split()[0]
.replace("-", "")
)
build_vars += f"BUILD_VERSION={version}.dev{build_date}"
elif build_version is not None:
build_vars += f"BUILD_VERSION={build_version} PYTORCH_VERSION={branch[1:].split('-', maxsplit=1)[0]}"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
host.run_cmd(
f"cd audio && export FFMPEG_ROOT=$(pwd)/third_party/ffmpeg && export USE_FFMPEG=1 \
&& ./packaging/ffmpeg/build.sh \
&& {build_vars} python3 -m build --wheel --no-isolation"
)
wheel_name = host.list_dir("audio/dist")[0]
embed_libgomp(host, use_conda, os.path.join("audio", "dist", wheel_name))
print("Copying TorchAudio wheel")
host.download_wheel(os.path.join("audio", "dist", wheel_name))
return wheel_name
def configure_system(
host: RemoteHost,
*,
compiler: str = "gcc-8",
use_conda: bool = True,
python_version: str = "3.8",
) -> None:
if use_conda:
install_condaforge_python(host, python_version)
print("Configuring the system")
if not host.using_docker():
update_apt_repo(host)
host.run_cmd("sudo apt-get install -y ninja-build g++ git cmake gfortran unzip")
else:
host.run_cmd("yum install -y sudo")
host.run_cmd("conda install -y ninja scons")
if not use_conda:
host.run_cmd(
"sudo apt-get install -y python3-dev python3-yaml python3-setuptools python3-wheel python3-pip"
)
host.run_cmd("pip3 install dataclasses typing-extensions")
if not use_conda:
print("Installing Cython + numpy from PyPy")
host.run_cmd("sudo pip3 install Cython")
host.run_cmd("sudo pip3 install numpy")
def build_domains(
host: RemoteHost,
*,
branch: str = "main",
use_conda: bool = True,
git_clone_flags: str = "",
) -> tuple[str, str, str, str]:
vision_wheel_name = build_torchvision(
host, branch=branch, use_conda=use_conda, git_clone_flags=git_clone_flags
)
audio_wheel_name = build_torchaudio(
host, branch=branch, use_conda=use_conda, git_clone_flags=git_clone_flags
)
data_wheel_name = build_torchdata(
host, branch=branch, use_conda=use_conda, git_clone_flags=git_clone_flags
)
text_wheel_name = build_torchtext(
host, branch=branch, use_conda=use_conda, git_clone_flags=git_clone_flags
)
return (vision_wheel_name, audio_wheel_name, data_wheel_name, text_wheel_name)
def start_build(
host: RemoteHost,
*,
branch: str = "main",
compiler: str = "gcc-8",
use_conda: bool = True,
python_version: str = "3.8",
pytorch_only: bool = False,
pytorch_build_number: Optional[str] = None,
shallow_clone: bool = True,
enable_mkldnn: bool = False,
) -> tuple[str, str, str, str, str]:
git_clone_flags = " --depth 1 --shallow-submodules" if shallow_clone else ""
if host.using_docker() and not use_conda:
print("Auto-selecting conda option for docker images")
use_conda = True
if not host.using_docker():
print("Disable mkldnn for host builds")
enable_mkldnn = False
configure_system(
host, compiler=compiler, use_conda=use_conda, python_version=python_version
)
if host.using_docker():
print("Move libgfortant.a into a standard location")
# HACK: pypa gforntran.a is compiled without PIC, which leads to the following error
# libgfortran.a(error.o)(.text._gfortrani_st_printf+0x34): unresolvable R_AARCH64_ADR_PREL_PG_HI21 relocation against symbol `__stack_chk_guard@@GLIBC_2.17' # noqa: E501, B950
# Workaround by copying gfortran library from the host
host.run_ssh_cmd("sudo apt-get install -y gfortran-8")
host.run_cmd("mkdir -p /usr/lib/gcc/aarch64-linux-gnu/8")
host.run_ssh_cmd(
[
"docker",
"cp",
"/usr/lib/gcc/aarch64-linux-gnu/8/libgfortran.a",
f"{host.container_id}:/opt/rh/devtoolset-10/root/usr/lib/gcc/aarch64-redhat-linux/10/",
]
)
print("Checking out PyTorch repo")
host.run_cmd(
f"git clone --recurse-submodules -b {branch} https://github.com/pytorch/pytorch {git_clone_flags}"
)
host.run_cmd("pytorch/.ci/docker/common/install_openblas.sh")
print("Building PyTorch wheel")
build_opts = ""
if pytorch_build_number is not None:
build_opts += f" -C--build-option=--build-number={pytorch_build_number}"
# Breakpad build fails on aarch64
build_vars = "USE_BREAKPAD=0 "
if branch == "nightly":
build_date = (
host.check_output("cd pytorch && git log --pretty=format:%s -1")
.strip()
.split()[0]
.replace("-", "")
)
version = host.check_output("cat pytorch/version.txt").strip()[:-2]
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={version}.dev{build_date} PYTORCH_BUILD_NUMBER=1"
if branch.startswith(("v1.", "v2.")):
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1"
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
if enable_mkldnn:
host.run_cmd("pytorch/.ci/docker/common/install_acl.sh")
print("build pytorch with mkldnn+acl backend")
build_vars += " USE_MKLDNN=ON USE_MKLDNN_ACL=ON"
build_vars += " BLAS=OpenBLAS"
build_vars += " OpenBLAS_HOME=/opt/OpenBLAS"
build_vars += " ACL_ROOT_DIR=/acl"
host.run_cmd(
f"cd $HOME/pytorch && {build_vars} python3 -m build --wheel --no-isolation{build_opts}"
)
print("Repair the wheel")
pytorch_wheel_name = host.list_dir("pytorch/dist")[0]
ld_library_path = "/acl/build:$HOME/pytorch/build/lib"
host.run_cmd(
f"export LD_LIBRARY_PATH={ld_library_path} && auditwheel repair $HOME/pytorch/dist/{pytorch_wheel_name}"
)
print("replace the original wheel with the repaired one")
pytorch_repaired_wheel_name = host.list_dir("wheelhouse")[0]
host.run_cmd(
f"cp $HOME/wheelhouse/{pytorch_repaired_wheel_name} $HOME/pytorch/dist/{pytorch_wheel_name}"
)
else:
print("build pytorch without mkldnn backend")
host.run_cmd(
f"cd pytorch && {build_vars} python3 -m build --wheel --no-isolation{build_opts}"
)
print("Deleting build folder")
host.run_cmd("cd pytorch && rm -rf build")
pytorch_wheel_name = host.list_dir("pytorch/dist")[0]
embed_libgomp(host, use_conda, os.path.join("pytorch", "dist", pytorch_wheel_name))
print("Copying the wheel")
host.download_wheel(os.path.join("pytorch", "dist", pytorch_wheel_name))
print("Installing PyTorch wheel")
host.run_cmd(f"pip3 install pytorch/dist/{pytorch_wheel_name}")
if pytorch_only:
return (pytorch_wheel_name, None, None, None, None)
domain_wheels = build_domains(
host, branch=branch, use_conda=use_conda, git_clone_flags=git_clone_flags
)
return (pytorch_wheel_name, *domain_wheels)
embed_library_script = """
#!/usr/bin/env python3
from auditwheel.patcher import Patchelf
from auditwheel.wheeltools import InWheelCtx
from auditwheel.elfutils import elf_file_filter
from auditwheel.repair import copylib
from auditwheel.lddtree import lddtree
from subprocess import check_call
import os
import shutil
import sys
from tempfile import TemporaryDirectory
def replace_tag(filename):
with open(filename, 'r') as f:
lines = f.read().split("\\n")
for i,line in enumerate(lines):
if not line.startswith("Tag: "):
continue
lines[i] = line.replace("-linux_", "-manylinux2014_")
print(f'Updated tag from {line} to {lines[i]}')
with open(filename, 'w') as f:
f.write("\\n".join(lines))
class AlignedPatchelf(Patchelf):
def set_soname(self, file_name: str, new_soname: str) -> None:
check_call(['patchelf', '--page-size', '65536', '--set-soname', new_soname, file_name])
def replace_needed(self, file_name: str, soname: str, new_soname: str) -> None:
check_call(['patchelf', '--page-size', '65536', '--replace-needed', soname, new_soname, file_name])
def embed_library(whl_path, lib_soname, update_tag=False):
patcher = AlignedPatchelf()
out_dir = TemporaryDirectory()
whl_name = os.path.basename(whl_path)
tmp_whl_name = os.path.join(out_dir.name, whl_name)
with InWheelCtx(whl_path) as ctx:
torchlib_path = os.path.join(ctx._tmpdir.name, 'torch', 'lib')
ctx.out_wheel=tmp_whl_name
new_lib_path, new_lib_soname = None, None
for filename, elf in elf_file_filter(ctx.iter_files()):
if not filename.startswith('torch/lib'):
continue
libtree = lddtree(filename)
if lib_soname not in libtree['needed']:
continue
lib_path = libtree['libs'][lib_soname]['path']
if lib_path is None:
print(f"Can't embed {lib_soname} as it could not be found")
break
if lib_path.startswith(torchlib_path):
continue
if new_lib_path is None:
new_lib_soname, new_lib_path = copylib(lib_path, torchlib_path, patcher)
patcher.replace_needed(filename, lib_soname, new_lib_soname)
print(f'Replacing {lib_soname} with {new_lib_soname} for {filename}')
if update_tag:
# Add manylinux2014 tag
for filename in ctx.iter_files():
if os.path.basename(filename) != 'WHEEL':
continue
replace_tag(filename)
shutil.move(tmp_whl_name, whl_path)
if __name__ == '__main__':
embed_library(sys.argv[1], 'libgomp.so.1', len(sys.argv) > 2 and sys.argv[2] == '--update-tag')
"""
def run_tests(host: RemoteHost, whl: str, branch="main") -> None:
print("Configuring the system")
update_apt_repo(host)
host.run_cmd("sudo apt-get install -y python3-pip git")
host.run_cmd("sudo pip3 install Cython")
host.run_cmd("sudo pip3 install numpy")
host.upload_file(whl, ".")
host.run_cmd(f"sudo pip3 install {whl}")
host.run_cmd("python3 -c 'import torch;print(torch.rand((3,3))'")
host.run_cmd(f"git clone -b {branch} https://github.com/pytorch/pytorch")
host.run_cmd("cd pytorch/test; python3 test_torch.py -v")
def get_instance_name(instance) -> Optional[str]:
if instance.tags is None:
return None
for tag in instance.tags:
if tag["Key"] == "Name":
return tag["Value"]
return None
def list_instances(instance_type: str) -> None:
print(f"All instances of type {instance_type}")
for instance in ec2_instances_of_type(instance_type):
ifaces = instance.network_interfaces
az = ifaces[0].subnet.availability_zone if len(ifaces) > 0 else None
print(
f"{instance.id} {get_instance_name(instance)} {instance.public_dns_name} {instance.state['Name']} {az}"
)
def terminate_instances(instance_type: str) -> None:
print(f"Terminating all instances of type {instance_type}")
instances = list(ec2_instances_of_type(instance_type))
for instance in instances:
print(f"Terminating {instance.id}")
instance.terminate()
print("Waiting for termination to complete")
for instance in instances:
instance.wait_until_terminated()
def parse_arguments():
from argparse import ArgumentParser
parser = ArgumentParser("Build and test AARCH64 wheels using EC2")
parser.add_argument("--key-name", type=str)
parser.add_argument("--debug", action="store_true")
parser.add_argument("--build-only", action="store_true")
parser.add_argument("--test-only", type=str)
group = parser.add_mutually_exclusive_group()
group.add_argument("--os", type=str, choices=list(os_amis.keys()))
group.add_argument("--ami", type=str)
parser.add_argument(
"--python-version",
type=str,
choices=[f"3.{d}" for d in range(6, 12)],
default=None,
)
parser.add_argument("--alloc-instance", action="store_true")
parser.add_argument("--list-instances", action="store_true")
parser.add_argument("--pytorch-only", action="store_true")
parser.add_argument("--keep-running", action="store_true")
parser.add_argument("--terminate-instances", action="store_true")
parser.add_argument("--instance-type", type=str, default="t4g.2xlarge")
parser.add_argument("--ebs-size", type=int, default=50)
parser.add_argument("--branch", type=str, default="main")
parser.add_argument("--use-docker", action="store_true")
parser.add_argument(
"--compiler",
type=str,
choices=["gcc-7", "gcc-8", "gcc-9", "clang"],
default="gcc-8",
)
parser.add_argument("--use-torch-from-pypi", action="store_true")
parser.add_argument("--pytorch-build-number", type=str, default=None)
parser.add_argument("--disable-mkldnn", action="store_true")
return parser.parse_args()
if __name__ == "__main__":
args = parse_arguments()
ami = (
args.ami
if args.ami is not None
else os_amis[args.os]
if args.os is not None
else ubuntu20_04_ami
)
keyfile_path, key_name = compute_keyfile_path(args.key_name)
if args.list_instances:
list_instances(args.instance_type)
sys.exit(0)
if args.terminate_instances:
terminate_instances(args.instance_type)
sys.exit(0)
if len(key_name) == 0:
raise RuntimeError("""
Cannot start build without key_name, please specify
--key-name argument or AWS_KEY_NAME environment variable.""")
if len(keyfile_path) == 0 or not os.path.exists(keyfile_path):
raise RuntimeError(f"""
Cannot find keyfile with name: [{key_name}] in path: [{keyfile_path}], please
check `~/.ssh/` folder or manually set SSH_KEY_PATH environment variable.""")
# Starting the instance
inst = start_instance(
key_name, ami=ami, instance_type=args.instance_type, ebs_size=args.ebs_size
)
instance_name = f"{args.key_name}-{args.os}"
if args.python_version is not None:
instance_name += f"-py{args.python_version}"
inst.create_tags(
DryRun=False,
Tags=[
{
"Key": "Name",
"Value": instance_name,
}
],
)
addr = inst.public_dns_name
wait_for_connection(addr, 22)
host = RemoteHost(addr, keyfile_path)
host.ami = ami
if args.use_docker:
update_apt_repo(host)
host.start_docker()
if args.test_only:
run_tests(host, args.test_only)
sys.exit(0)
if args.alloc_instance:
if args.python_version is None:
sys.exit(0)
install_condaforge_python(host, args.python_version)
sys.exit(0)
python_version = args.python_version if args.python_version is not None else "3.10"
if args.use_torch_from_pypi:
configure_system(host, compiler=args.compiler, python_version=python_version)
print("Installing PyTorch wheel")
host.run_cmd("pip3 install torch")
build_domains(
host, branch=args.branch, git_clone_flags=" --depth 1 --shallow-submodules"
)
else:
start_build(
host,
branch=args.branch,
compiler=args.compiler,
python_version=python_version,
pytorch_only=args.pytorch_only,
pytorch_build_number=args.pytorch_build_number,
enable_mkldnn=not args.disable_mkldnn,
)
if not args.keep_running:
print(f"Waiting for instance {inst.id} to terminate")
inst.terminate()
inst.wait_until_terminated()

View File

@ -1,87 +0,0 @@
#!/usr/bin/env python3
import os
import shutil
import sys
from subprocess import check_call
from tempfile import TemporaryDirectory
from auditwheel.elfutils import elf_file_filter
from auditwheel.lddtree import lddtree
from auditwheel.patcher import Patchelf
from auditwheel.repair import copylib
from auditwheel.wheeltools import InWheelCtx
def replace_tag(filename):
with open(filename) as f:
lines = f.read().split("\\n")
for i, line in enumerate(lines):
if not line.startswith("Tag: "):
continue
lines[i] = line.replace("-linux_", "-manylinux2014_")
print(f"Updated tag from {line} to {lines[i]}")
with open(filename, "w") as f:
f.write("\\n".join(lines))
class AlignedPatchelf(Patchelf):
def set_soname(self, file_name: str, new_soname: str) -> None:
check_call(
["patchelf", "--page-size", "65536", "--set-soname", new_soname, file_name]
)
def replace_needed(self, file_name: str, soname: str, new_soname: str) -> None:
check_call(
[
"patchelf",
"--page-size",
"65536",
"--replace-needed",
soname,
new_soname,
file_name,
]
)
def embed_library(whl_path, lib_soname, update_tag=False):
patcher = AlignedPatchelf()
out_dir = TemporaryDirectory()
whl_name = os.path.basename(whl_path)
tmp_whl_name = os.path.join(out_dir.name, whl_name)
with InWheelCtx(whl_path) as ctx:
torchlib_path = os.path.join(ctx._tmpdir.name, "torch", "lib")
ctx.out_wheel = tmp_whl_name
new_lib_path, new_lib_soname = None, None
for filename, _ in elf_file_filter(ctx.iter_files()):
if not filename.startswith("torch/lib"):
continue
libtree = lddtree(filename)
if lib_soname not in libtree["needed"]:
continue
lib_path = libtree["libs"][lib_soname]["path"]
if lib_path is None:
print(f"Can't embed {lib_soname} as it could not be found")
break
if lib_path.startswith(torchlib_path):
continue
if new_lib_path is None:
new_lib_soname, new_lib_path = copylib(lib_path, torchlib_path, patcher)
patcher.replace_needed(filename, lib_soname, new_lib_soname)
print(f"Replacing {lib_soname} with {new_lib_soname} for {filename}")
if update_tag:
# Add manylinux2014 tag
for filename in ctx.iter_files():
if os.path.basename(filename) != "WHEEL":
continue
replace_tag(filename)
shutil.move(tmp_whl_name, whl_path)
if __name__ == "__main__":
embed_library(
sys.argv[1], "libgomp.so.1", len(sys.argv) > 2 and sys.argv[2] == "--update-tag"
)

View File

@ -4,14 +4,17 @@ set -ex
SCRIPTPATH="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null 2>&1 && pwd )"
# Source the common build script for architecture-specific configurations (MKLDNN, ACL, etc.)
source "${SCRIPTPATH}/../pytorch/build.sh" || true
case "${GPU_ARCH_TYPE:-BLANK}" in
cuda)
cuda | cuda-aarch64)
bash "${SCRIPTPATH}/build_cuda.sh"
;;
rocm)
bash "${SCRIPTPATH}/build_rocm.sh"
;;
cpu | cpu-cxx11-abi | cpu-s390x)
cpu | cpu-cxx11-abi | cpu-aarch64 | cpu-s390x)
bash "${SCRIPTPATH}/build_cpu.sh"
;;
xpu)

View File

@ -18,12 +18,31 @@ retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# Detect architecture first
ARCH=$(uname -m)
echo "Detected architecture: $ARCH"
PLATFORM=""
# TODO move this into the Docker images
OS_NAME=$(awk -F= '/^NAME/{print $2}' /etc/os-release)
if [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
retry yum install -q -y zip openssl
PLATFORM="manylinux_2_28_x86_64"
# Set platform based on architecture
case $ARCH in
x86_64)
PLATFORM="manylinux_2_28_x86_64"
;;
aarch64)
PLATFORM="manylinux_2_28_aarch64"
;;
s390x)
PLATFORM="manylinux_2_28_s390x"
;;
*)
echo "Unsupported architecture: $ARCH"
exit 1
;;
esac
elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
retry dnf install -q -y zip openssl
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
@ -38,6 +57,8 @@ else
exit 1
fi
echo "Platform set to: $PLATFORM"
# We use the package name to test the package by passing this to 'pip install'
# This is the env variable that setup.py uses to name the package. Note that
# pip 'normalizes' the name first by changing all - to _
@ -299,8 +320,8 @@ for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.w
# ROCm workaround for roctracer dlopens
if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
patchedpath=$(fname_without_so_number $destpath)
# Keep the so number for XPU dependencies and libgomp.so.1 to avoid twice load
elif [[ "$DESIRED_CUDA" == *"xpu"* || "$filename" == "libgomp.so.1" ]]; then
# Keep the so number for XPU dependencies, libgomp.so.1, ACL libraries, and NVPL libraries to avoid twice load
elif [[ "$DESIRED_CUDA" == *"xpu"* || "$filename" == "libgomp.so.1" || "$filename" == libarm_compute* || "$filename" == libnvpl* || "$filename" == "libgfortran.so.5" ]]; then
patchedpath=$destpath
else
patchedpath=$(fname_with_sha256 $destpath)
@ -346,9 +367,22 @@ for pkg in /$WHEELHOUSE_DIR/torch_no_python*.whl /$WHEELHOUSE_DIR/torch*linux*.w
done
# create Manylinux 2_28 tag this needs to happen before regenerate the RECORD
if [[ $PLATFORM == "manylinux_2_28_x86_64" && $GPU_ARCH_TYPE != "cpu-s390x" && $GPU_ARCH_TYPE != "xpu" ]]; then
# Support all architectures (x86_64, aarch64, s390x)
if [[ "$IS_MANYLINUX2_28" == "1" && $GPU_ARCH_TYPE != "xpu" ]]; then
wheel_file=$(echo $(basename $pkg) | sed -e 's/-cp.*$/.dist-info\/WHEEL/g')
sed -i -e s#linux_x86_64#"${PLATFORM}"# $wheel_file;
echo "Updating wheel tag for $ARCH architecture"
# Replace linux_* with manylinux_2_28_* based on architecture
case $ARCH in
x86_64)
sed -i -e 's#linux_x86_64#manylinux_2_28_x86_64#g' $wheel_file
;;
aarch64)
sed -i -e 's#linux_aarch64#manylinux_2_28_aarch64#g' $wheel_file
;;
s390x)
sed -i -e 's#linux_s390x#manylinux_2_28_s390x#g' $wheel_file
;;
esac
fi
# regenerate the RECORD file with new hashes

View File

@ -15,6 +15,10 @@ if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then
EXTRA_CAFFE2_CMAKE_FLAGS=()
fi
# Detect architecture
ARCH=$(uname -m)
echo "Building CPU wheel for architecture: $ARCH"
WHEELHOUSE_DIR="wheelhousecpu"
LIBTORCH_HOUSE_DIR="libtorch_housecpu"
if [[ -z "$PYTORCH_FINAL_PACKAGE_DIR" ]]; then
@ -34,8 +38,10 @@ elif [[ "$OS_NAME" == *"Red Hat Enterprise Linux"* ]]; then
elif [[ "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"
elif [[ "$OS_NAME" == *"Ubuntu"* ]]; then
if [[ "$(uname -m)" == "s390x" ]]; then
if [[ "$ARCH" == "s390x" ]]; then
LIBGOMP_PATH="/usr/lib/s390x-linux-gnu/libgomp.so.1"
elif [[ "$ARCH" == "aarch64" ]]; then
LIBGOMP_PATH="/usr/lib/aarch64-linux-gnu/libgomp.so.1"
else
LIBGOMP_PATH="/usr/lib/x86_64-linux-gnu/libgomp.so.1"
fi
@ -49,6 +55,32 @@ DEPS_SONAME=(
"libgomp.so.1"
)
# Add ARM-specific library dependencies for CPU builds
if [[ "$ARCH" == "aarch64" ]]; then
echo "Adding ARM-specific CPU library dependencies"
# ARM Compute Library (if available)
if [[ -d "/acl/build" ]]; then
echo "Adding ARM Compute Library for CPU"
DEPS_LIST+=(
"/acl/build/libarm_compute.so"
"/acl/build/libarm_compute_graph.so"
)
DEPS_SONAME+=(
"libarm_compute.so"
"libarm_compute_graph.so"
)
fi
# ARM system libraries
DEPS_LIST+=(
"/usr/lib64/libgfortran.so.5"
)
DEPS_SONAME+=(
"libgfortran.so.5"
)
fi
rm -rf /usr/local/cuda*
SOURCE_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" >/dev/null && pwd )"

View File

@ -29,6 +29,10 @@ if [[ -z "$EXTRA_CAFFE2_CMAKE_FLAGS" ]]; then
EXTRA_CAFFE2_CMAKE_FLAGS=()
fi
# Detect architecture
ARCH=$(uname -m)
echo "Building for architecture: $ARCH"
# Determine CUDA version and architectures to build for
#
# NOTE: We should first check `DESIRED_CUDA` when determining `CUDA_VERSION`,
@ -53,34 +57,60 @@ fi
cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
# Function to remove architectures from a list
remove_archs() {
local result="$1"
shift
for arch in "$@"; do
result="${result//${arch};/}"
done
echo "$result"
}
# Function to filter CUDA architectures for aarch64
# aarch64 ARM GPUs only support certain compute capabilities
# Keep: 8.0 (A100), 9.0+ (Hopper, Grace Hopper, newer)
# Remove: < 8.0 (no ARM GPUs), 8.6 (x86_64 RTX 3090/A6000 only)
filter_aarch64_archs() {
local arch_list="$1"
# Explicitly remove architectures not needed on aarch64
arch_list=$(remove_archs "$arch_list" "5.0" "6.0" "7.0" "7.5" "8.6")
echo "$arch_list"
}
# Base: Common architectures across all modern CUDA versions
TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0"
case ${CUDA_VERSION} in
#removing sm_50-sm_60 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
#however we would like to keep sm_70 architecture see: https://github.com/pytorch/pytorch/issues/157517
12.8)
TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0"
;;
12.9)
TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0;8.6;9.0;10.0;12.0+PTX"
# WAR to resolve the ld error in libtorch build with CUDA 12.9
12.6) TORCH_CUDA_ARCH_LIST="5.0;6.0;${TORCH_CUDA_ARCH_LIST}" ;; # Only 12.6 includes Legacy Maxwell/Pascal that will be removed in future releases
12.8) TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};10.0;12.0" ;; # +Hopper/Blackwell support
12.9) TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};10.0;12.0+PTX" # +Hopper/Blackwell support + PTX for forward compatibility
if [[ "$PACKAGE_TYPE" == "libtorch" ]]; then
TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST//7.0;/}" # Remove 7.0 to resolve the ld error
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST//8.6;/}" # Remove 8.6 for libtorch
fi
;;
13.0)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX"
;;
12.6)
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"
;;
*)
echo "unknown cuda version $CUDA_VERSION"
exit 1
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;$([[ "$ARCH" == "aarch64" ]] && echo "11.0;" || echo "")12.0+PTX"
export TORCH_NVCC_FLAGS="-compress-mode=size"
export BUILD_BUNDLE_PTXAS=1
;;
*) echo "unknown cuda version $CUDA_VERSION"; exit 1 ;;
esac
# Filter for aarch64: Remove < 8.0 and 8.6
[[ "$ARCH" == "aarch64" ]] && TORCH_CUDA_ARCH_LIST=$(filter_aarch64_archs "$TORCH_CUDA_ARCH_LIST")
echo "TORCH_CUDA_ARCH_LIST set to: $TORCH_CUDA_ARCH_LIST"
export TORCH_CUDA_ARCH_LIST=${TORCH_CUDA_ARCH_LIST}
echo "${TORCH_CUDA_ARCH_LIST}"
# Disable MAGMA for aarch64 as pre-built libraries are x86-64 only
if [[ "$ARCH" == "aarch64" ]]; then
echo "Disabling MAGMA for aarch64 architecture"
export USE_MAGMA=0
fi
# Package directories
WHEELHOUSE_DIR="wheelhouse$cuda_version_nodot"
LIBTORCH_HOUSE_DIR="libtorch_house$cuda_version_nodot"
@ -244,6 +274,51 @@ else
exit 1
fi
# Add ARM-specific library dependencies
if [[ "$ARCH" == "aarch64" ]]; then
echo "Adding ARM-specific library dependencies"
# ARM Compute Library (if available)
if [[ -d "/acl/build" ]]; then
echo "Adding ARM Compute Library"
DEPS_LIST+=(
"/acl/build/libarm_compute.so"
"/acl/build/libarm_compute_graph.so"
)
DEPS_SONAME+=(
"libarm_compute.so"
"libarm_compute_graph.so"
)
fi
# ARM system libraries
DEPS_LIST+=(
"/lib64/libgomp.so.1"
"/usr/lib64/libgfortran.so.5"
)
DEPS_SONAME+=(
"libgomp.so.1"
"libgfortran.so.5"
)
# NVPL libraries (ARM optimized BLAS/LAPACK)
if [[ -d "/usr/local/lib" && -f "/usr/local/lib/libnvpl_blas_lp64_gomp.so.0" ]]; then
echo "Adding NVPL libraries for ARM"
DEPS_LIST+=(
"/usr/local/lib/libnvpl_lapack_lp64_gomp.so.0"
"/usr/local/lib/libnvpl_blas_lp64_gomp.so.0"
"/usr/local/lib/libnvpl_lapack_core.so.0"
"/usr/local/lib/libnvpl_blas_core.so.0"
)
DEPS_SONAME+=(
"libnvpl_lapack_lp64_gomp.so.0"
"libnvpl_blas_lp64_gomp.so.0"
"libnvpl_lapack_core.so.0"
"libnvpl_blas_core.so.0"
)
fi
fi
# run_tests.sh requires DESIRED_CUDA to know what tests to exclude
export DESIRED_CUDA="$cuda_version_nodot"
@ -251,9 +326,11 @@ export DESIRED_CUDA="$cuda_version_nodot"
rm -rf /usr/local/cuda || true
ln -s "/usr/local/cuda-${CUDA_VERSION}" /usr/local/cuda
# Switch `/usr/local/magma` to the desired CUDA version
rm -rf /usr/local/magma || true
ln -s /usr/local/cuda-${CUDA_VERSION}/magma /usr/local/magma
# Switch `/usr/local/magma` to the desired CUDA version (skip for aarch64)
if [[ "$ARCH" != "aarch64" ]]; then
rm -rf /usr/local/magma || true
ln -s /usr/local/cuda-${CUDA_VERSION}/magma /usr/local/magma
fi
export CUDA_VERSION=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev) # 10.0.130
export CUDA_VERSION_SHORT=$(ls /usr/local/cuda/lib64/libcudart.so.*|sort|tac | head -1 | rev | cut -d"." -f -3 | rev | cut -f1,2 -d".") # 10.0

View File

@ -86,10 +86,20 @@ else
fi
fi
# Enable MKLDNN with ARM Compute Library for ARM builds
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
export USE_MKLDNN=1
# ACL is required for aarch64 builds
if [[ ! -d "/acl" ]]; then
echo "ERROR: ARM Compute Library not found at /acl"
echo "ACL is required for aarch64 builds. Check Docker image setup."
exit 1
fi
export USE_MKLDNN_ACL=1
export ACL_ROOT_DIR=/acl
echo "ARM Compute Library enabled for MKLDNN: ACL_ROOT_DIR=/acl"
fi
if [[ "$BUILD_ENVIRONMENT" == *riscv64* ]]; then

View File

@ -96,7 +96,6 @@ function pip_build_and_install() {
python3 -m pip wheel \
--no-build-isolation \
--no-deps \
--no-use-pep517 \
-w "${wheel_dir}" \
"${build_target}"
fi
@ -308,6 +307,28 @@ function install_torchao() {
pip_build_and_install "git+https://github.com/pytorch/ao.git@${commit}" dist/ao
}
function install_flash_attn_cute() {
echo "Installing FlashAttention CuTe from GitHub..."
# Grab latest main til we have a pinned commit
local flash_attn_commit
flash_attn_commit=$(git ls-remote https://github.com/Dao-AILab/flash-attention.git HEAD | cut -f1)
# Clone the repo to a temporary directory
rm -rf flash-attention-build
git clone --depth 1 --recursive https://github.com/Dao-AILab/flash-attention.git flash-attention-build
pushd flash-attention-build
git checkout "${flash_attn_commit}"
# Install only the 'cute' sub-directory
pip_install -e flash_attn/cute/
popd
# remove the local repo
rm -rf flash-attention-build
echo "FlashAttention CuTe installation complete."
}
function print_sccache_stats() {
echo 'PyTorch Build Statistics'
sccache --show-stats

View File

@ -100,6 +100,337 @@ def check_lib_statically_linked_libstdc_cxx_abi_symbols(lib: str) -> None:
)
def _compile_and_extract_symbols(
cpp_content: str, compile_flags: list[str], exclude_list: list[str] | None = None
) -> list[str]:
"""
Helper to compile a C++ file and extract all symbols.
Args:
cpp_content: C++ source code to compile
compile_flags: Compilation flags
exclude_list: List of symbol names to exclude. Defaults to ["main"].
Returns:
List of all symbols found in the object file (excluding those in exclude_list).
"""
import subprocess
import tempfile
if exclude_list is None:
exclude_list = ["main"]
with tempfile.TemporaryDirectory() as tmpdir:
tmppath = Path(tmpdir)
cpp_file = tmppath / "test.cpp"
obj_file = tmppath / "test.o"
cpp_file.write_text(cpp_content)
result = subprocess.run(
compile_flags + [str(cpp_file), "-o", str(obj_file)],
capture_output=True,
text=True,
timeout=60,
)
if result.returncode != 0:
raise RuntimeError(f"Compilation failed: {result.stderr}")
symbols = get_symbols(str(obj_file))
# Return all symbol names, excluding those in the exclude list
return [name for _addr, _stype, name in symbols if name not in exclude_list]
def check_stable_only_symbols(install_root: Path) -> None:
"""
Test TORCH_STABLE_ONLY and TORCH_TARGET_VERSION by compiling test code and comparing symbol counts.
This approach tests:
1. WITHOUT macros -> many torch symbols exposed
2. WITH TORCH_STABLE_ONLY -> zero torch symbols (all hidden)
3. WITH TORCH_TARGET_VERSION -> zero torch symbols (all hidden)
4. WITH both macros -> zero torch symbols (all hidden)
"""
include_dir = install_root / "include"
assert include_dir.exists(), f"Expected {include_dir} to be present"
test_cpp_content = """
// Main torch C++ API headers
#include <torch/torch.h>
#include <torch/all.h>
// ATen tensor library
#include <ATen/ATen.h>
// Core c10 headers (commonly used)
#include <c10/core/Device.h>
#include <c10/core/DeviceType.h>
#include <c10/core/ScalarType.h>
#include <c10/core/TensorOptions.h>
#include <c10/util/Optional.h>
int main() { return 0; }
"""
base_compile_flags = [
"g++",
"-std=c++17",
f"-I{include_dir}",
f"-I{include_dir}/torch/csrc/api/include",
"-c", # Compile only, don't link
]
# Compile WITHOUT any macros
symbols_without = _compile_and_extract_symbols(
cpp_content=test_cpp_content,
compile_flags=base_compile_flags,
)
# We expect constexpr symbols, inline functions used by other headers etc.
# to produce symbols
num_symbols_without = len(symbols_without)
print(f"Found {num_symbols_without} symbols without any macros defined")
assert num_symbols_without != 0, (
"Expected a non-zero number of symbols without any macros"
)
# Compile WITH TORCH_STABLE_ONLY (expect 0 symbols)
compile_flags_with_stable_only = base_compile_flags + ["-DTORCH_STABLE_ONLY"]
symbols_with_stable_only = _compile_and_extract_symbols(
cpp_content=test_cpp_content,
compile_flags=compile_flags_with_stable_only,
)
num_symbols_with_stable_only = len(symbols_with_stable_only)
assert num_symbols_with_stable_only == 0, (
f"Expected no symbols with TORCH_STABLE_ONLY macro, but found {num_symbols_with_stable_only}"
)
# Compile WITH TORCH_TARGET_VERSION (expect 0 symbols)
compile_flags_with_target_version = base_compile_flags + [
"-DTORCH_TARGET_VERSION=1"
]
symbols_with_target_version = _compile_and_extract_symbols(
cpp_content=test_cpp_content,
compile_flags=compile_flags_with_target_version,
)
num_symbols_with_target_version = len(symbols_with_target_version)
assert num_symbols_with_target_version == 0, (
f"Expected no symbols with TORCH_TARGET_VERSION macro, but found {num_symbols_with_target_version}"
)
# Compile WITH both macros (expect 0 symbols)
compile_flags_with_both = base_compile_flags + [
"-DTORCH_STABLE_ONLY",
"-DTORCH_TARGET_VERSION=1",
]
symbols_with_both = _compile_and_extract_symbols(
cpp_content=test_cpp_content,
compile_flags=compile_flags_with_both,
)
num_symbols_with_both = len(symbols_with_both)
assert num_symbols_with_both == 0, (
f"Expected no symbols with both macros, but found {num_symbols_with_both}"
)
def check_stable_api_symbols(install_root: Path) -> None:
"""
Test that stable API headers still expose symbols with TORCH_STABLE_ONLY.
The torch/csrc/stable/c/shim.h header is tested in check_stable_c_shim_symbols
"""
include_dir = install_root / "include"
assert include_dir.exists(), f"Expected {include_dir} to be present"
stable_dir = include_dir / "torch" / "csrc" / "stable"
assert stable_dir.exists(), f"Expected {stable_dir} to be present"
stable_headers = list(stable_dir.rglob("*.h"))
if not stable_headers:
raise RuntimeError("Could not find any stable headers")
includes = []
for header in stable_headers:
rel_path = header.relative_to(include_dir)
includes.append(f"#include <{rel_path.as_posix()}>")
includes_str = "\n".join(includes)
test_stable_content = f"""
{includes_str}
int main() {{ return 0; }}
"""
compile_flags = [
"g++",
"-std=c++17",
f"-I{include_dir}",
f"-I{include_dir}/torch/csrc/api/include",
"-c",
"-DTORCH_STABLE_ONLY",
]
symbols_stable = _compile_and_extract_symbols(
cpp_content=test_stable_content,
compile_flags=compile_flags,
)
num_symbols_stable = len(symbols_stable)
print(f"Found {num_symbols_stable} symbols in torch/csrc/stable")
assert num_symbols_stable > 0, (
f"Expected stable headers to expose symbols with TORCH_STABLE_ONLY, "
f"but found {num_symbols_stable} symbols"
)
def check_headeronly_symbols(install_root: Path) -> None:
"""
Test that header-only utility headers still expose symbols with TORCH_STABLE_ONLY.
"""
include_dir = install_root / "include"
assert include_dir.exists(), f"Expected {include_dir} to be present"
# Find all headers in torch/headeronly
headeronly_dir = include_dir / "torch" / "headeronly"
assert headeronly_dir.exists(), f"Expected {headeronly_dir} to be present"
headeronly_headers = list(headeronly_dir.rglob("*.h"))
if not headeronly_headers:
raise RuntimeError("Could not find any headeronly headers")
# Filter out platform-specific headers that may not compile everywhere
platform_specific_keywords = [
"cpu/vec",
]
filtered_headers = []
for header in headeronly_headers:
rel_path = header.relative_to(include_dir).as_posix()
if not any(
keyword in rel_path.lower() for keyword in platform_specific_keywords
):
filtered_headers.append(header)
includes = []
for header in filtered_headers:
rel_path = header.relative_to(include_dir)
includes.append(f"#include <{rel_path.as_posix()}>")
includes_str = "\n".join(includes)
test_headeronly_content = f"""
{includes_str}
int main() {{ return 0; }}
"""
compile_flags = [
"g++",
"-std=c++17",
f"-I{include_dir}",
f"-I{include_dir}/torch/csrc/api/include",
"-c",
"-DTORCH_STABLE_ONLY",
]
symbols_headeronly = _compile_and_extract_symbols(
cpp_content=test_headeronly_content,
compile_flags=compile_flags,
)
num_symbols_headeronly = len(symbols_headeronly)
print(f"Found {num_symbols_headeronly} symbols in torch/headeronly")
assert num_symbols_headeronly > 0, (
f"Expected headeronly headers to expose symbols with TORCH_STABLE_ONLY, "
f"but found {num_symbols_headeronly} symbols"
)
def check_aoti_shim_symbols(install_root: Path) -> None:
"""
Test that AOTI shim headers still expose symbols with TORCH_STABLE_ONLY.
"""
include_dir = install_root / "include"
assert include_dir.exists(), f"Expected {include_dir} to be present"
# There are no constexpr symbols etc., so we need to actually use functions
# so that some symbols are found.
test_shim_content = """
#include <torch/csrc/inductor/aoti_torch/c/shim.h>
int main() {
int32_t (*fp1)() = &aoti_torch_device_type_cpu;
int32_t (*fp2)() = &aoti_torch_dtype_float32;
(void)fp1; (void)fp2;
return 0;
}
"""
compile_flags = [
"g++",
"-std=c++17",
f"-I{include_dir}",
f"-I{include_dir}/torch/csrc/api/include",
"-c",
"-DTORCH_STABLE_ONLY",
]
symbols_shim = _compile_and_extract_symbols(
cpp_content=test_shim_content,
compile_flags=compile_flags,
)
num_symbols_shim = len(symbols_shim)
assert num_symbols_shim > 0, (
f"Expected shim headers to expose symbols with TORCH_STABLE_ONLY, "
f"but found {num_symbols_shim} symbols"
)
def check_stable_c_shim_symbols(install_root: Path) -> None:
"""
Test that stable C shim headers still expose symbols with TORCH_STABLE_ONLY.
"""
include_dir = install_root / "include"
assert include_dir.exists(), f"Expected {include_dir} to be present"
# Check if the stable C shim exists
stable_shim = include_dir / "torch" / "csrc" / "stable" / "c" / "shim.h"
if not stable_shim.exists():
raise RuntimeError("Could not find stable c shim")
# There are no constexpr symbols etc., so we need to actually use functions
# so that some symbols are found.
test_stable_shim_content = """
#include <torch/csrc/stable/c/shim.h>
int main() {
// Reference stable C API functions to create undefined symbols
AOTITorchError (*fp1)(const char*, uint32_t*, int32_t*) = &torch_parse_device_string;
AOTITorchError (*fp2)(uint32_t*) = &torch_get_num_threads;
(void)fp1; (void)fp2;
return 0;
}
"""
compile_flags = [
"g++",
"-std=c++17",
f"-I{include_dir}",
f"-I{include_dir}/torch/csrc/api/include",
"-c",
"-DTORCH_STABLE_ONLY",
]
symbols_stable_shim = _compile_and_extract_symbols(
cpp_content=test_stable_shim_content,
compile_flags=compile_flags,
)
num_symbols_stable_shim = len(symbols_stable_shim)
assert num_symbols_stable_shim > 0, (
f"Expected stable C shim headers to expose symbols with TORCH_STABLE_ONLY, "
f"but found {num_symbols_stable_shim} symbols"
)
def check_lib_symbols_for_abi_correctness(lib: str) -> None:
print(f"lib: {lib}")
cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)
@ -129,6 +460,13 @@ def main() -> None:
check_lib_symbols_for_abi_correctness(libtorch_cpu_path)
check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path)
# Check symbols when TORCH_STABLE_ONLY is defined
check_stable_only_symbols(install_root)
check_stable_api_symbols(install_root)
check_headeronly_symbols(install_root)
check_aoti_shim_symbols(install_root)
check_stable_c_shim_symbols(install_root)
if __name__ == "__main__":
main()

View File

@ -353,6 +353,17 @@ def test_linalg(device="cpu") -> None:
torch.linalg.svd(A)
def test_sdpa(device="cpu", dtype=torch.float16) -> None:
"""Regression test for https://github.com/pytorch/pytorch/issues/167602
Without nvrtc_builtins on CuDNN-9.13 on CUDA-13 fails with ` No valid execution plans built.`
"""
print(f"Testing SDPA on {device} using type {dtype}")
k, q, v = torch.rand(3, 1, 16, 77, 64, dtype=dtype, device=device).unbind(0)
attn = torch.rand(1, 1, 77, 77, dtype=dtype, device=device)
rc = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn)
assert rc.isnan().any().item() is False
def smoke_test_compile(device: str = "cpu") -> None:
supported_dtypes = [torch.float16, torch.float32, torch.float64]
@ -489,10 +500,12 @@ def main() -> None:
smoke_test_conv2d()
test_linalg()
test_numpy()
test_sdpa()
if is_cuda_system:
test_linalg("cuda")
test_cuda_gds_errors_captured()
test_sdpa("cuda")
if options.package == "all":
smoke_test_modules()

View File

@ -344,8 +344,18 @@ test_python_smoke() {
}
test_python_smoke_b200() {
# Targeted smoke tests for B200 - staged approach to avoid too many failures
time python test/run_test.py --include test_matmul_cuda test_scaled_matmul_cuda inductor/test_fp8 $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
# Targeted smoke tests for B200 including FlashAttention CuTe coverage
install_flash_attn_cute
time python test/run_test.py \
--include \
test_matmul_cuda \
test_scaled_matmul_cuda \
inductor/test_fp8 \
nn/attention/test_fa4 \
nn/attention/test_open_registry \
inductor/test_flex_flash \
$PYTHON_TEST_EXTRA_OPTION \
--upload-artifacts-while-running
assert_git_not_dirty
}
@ -1670,6 +1680,22 @@ test_operator_microbenchmark() {
done
}
test_attention_microbenchmark() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
TEST_DIR=$(pwd)
# Install attention-gym dependency
echo "Installing attention-gym..."
python -m pip install git+https://github.com/meta-pytorch/attention-gym.git@main
pip show triton
cd "${TEST_DIR}"/benchmarks/transformer
$TASKSET python score_mod.py --config configs/config_basic.yaml \
--output-json-for-dashboard "${TEST_REPORTS_DIR}/attention_microbenchmark.json"
}
if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
(cd test && python -c "import torch; print(torch.__config__.show())")
(cd test && python -c "import torch; print(torch.__config__.parallel_info())")
@ -1727,6 +1753,8 @@ elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then
fi
elif [[ "${TEST_CONFIG}" == *operator_microbenchmark* ]]; then
test_operator_microbenchmark
elif [[ "${TEST_CONFIG}" == *attention_microbenchmark* ]]; then
test_attention_microbenchmark
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then

View File

@ -63,7 +63,7 @@ self-hosted-runner:
- linux.rocm.gpu.gfx942.1
- linux.rocm.gpu.gfx942.2
- linux.rocm.gpu.gfx942.4
- rocm-docker
- linux.rocm.gfx942.docker-cache
# Org wise AWS `mac2.metal` runners (2020 Mac mini hardware powered by Apple silicon M1 processors)
- macos-m1-stable
- macos-m1-14

View File

@ -1 +1 @@
ad5816f0eee1c873df1b7d371c69f1f811a89387
07b6cbde121417a70e4dc871adb6d27030e0ce3f

View File

@ -1 +1 @@
ccb801b88af136454798b945175c4c87e636ac33
acccf86477759b2d3500f1ae1be065f7b1e409ec

View File

@ -50,7 +50,7 @@ def get_tag() -> str:
def get_base_version() -> str:
root = get_pytorch_root()
dirty_version = open(root / "version.txt").read().strip()
dirty_version = Path(root / "version.txt").read_text().strip()
# Strips trailing a0 from version.txt, not too sure why it's there in the
# first place
return re.sub(LEGACY_BASE_VERSION_SUFFIX_PATTERN, "", dirty_version)

View File

@ -260,11 +260,8 @@ jobs:
"${DOCKER_IMAGE}"
)
docker exec -t -w "${PYTORCH_ROOT}" "${container_name}" bash -c "bash .circleci/scripts/binary_populate_env.sh"
if [[ ${BUILD_ENVIRONMENT} == *"aarch64"* ]]; then
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/aarch64_linux/aarch64_ci_build.sh"
else
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh"
fi
# Unified build script for all architectures (x86_64, aarch64, s390x)
docker exec -t "${container_name}" bash -c "source ${BINARY_ENV_FILE} && bash /pytorch/.ci/${{ inputs.PACKAGE_TYPE }}/build.sh"
- name: Chown artifacts
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' && inputs.build_environment != 'linux-s390x-binary-manywheel' }}

View File

@ -0,0 +1,73 @@
name: attention_op_microbenchmark
on:
push:
tags:
- ciflow/op-benchmark/*
workflow_dispatch:
schedule:
# Run at 06:00 UTC everyday
- cron: 0 7 * * *
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
attn-microbenchmark-build:
if: github.repository_owner == 'pytorch'
uses: ./.github/workflows/_linux-build.yml
with:
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '8.0 9.0'
test-matrix: |
{ include: [
{ config: "attention_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.aws.a100" },
{ config: "attention_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.aws.h100" },
]}
secrets: inherit
attn-microbenchmark-test:
name: attn-microbenchmark-test
uses: ./.github/workflows/_linux-test.yml
needs: attn-microbenchmark-build
with:
timeout-minutes: 500
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image: ${{ needs.attn-microbenchmark-build.outputs.docker-image }}
test-matrix: ${{ needs.attn-microbenchmark-build.outputs.test-matrix }}
secrets: inherit
# B200 runner
opmicrobenchmark-build-b200:
if: github.repository_owner == 'pytorch'
name: opmicrobenchmark-build-b200
uses: ./.github/workflows/_linux-build.yml
with:
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'
test-matrix: |
{ include: [
{ config: "operator_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.dgx.b200" },
]}
secrets: inherit
opmicrobenchmark-test-b200:
name: opmicrobenchmark-test-b200
uses: ./.github/workflows/_linux-test.yml
needs: opmicrobenchmark-build-b200
with:
timeout-minutes: 500
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100
docker-image: ${{ needs.opmicrobenchmark-build-b200.outputs.docker-image }}
test-matrix: ${{ needs.opmicrobenchmark-build-b200.outputs.test-matrix }}
aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
secrets: inherit

View File

@ -37,6 +37,7 @@ jobs:
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-distributed-b200
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'

View File

@ -37,6 +37,7 @@ jobs:
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'

View File

@ -119,6 +119,22 @@ jobs:
with:
docker-image: ${{ steps.build-docker-image.outputs.docker-image }}
- name: Generate output
if: contains(matrix.docker-image-name, 'rocm')
id: generate_output
run: |
docker_image_name="${{ matrix.docker-image-name }}"
docker_image_tag="${{ steps.build-docker-image.outputs.docker-image }}"
echo "${docker_image_name}=${docker_image_tag}" >> docker-builds-output-${docker_image_name}.txt
- name: Upload artifacts
uses: actions/upload-artifact@v4.4.0
if: contains(matrix.docker-image-name, 'rocm')
with:
name: docker-builds-artifacts-${{ matrix.docker-image-name }}
retention-days: 14
path: ./docker-builds-output-${{ matrix.docker-image-name }}.txt
- uses: nick-fields/retry@7152eba30c6575329ac0576536151aca5a72780e # v3.0.0
name: Push to https://ghcr.io/
id: push-to-ghcr-io

View File

@ -1,55 +0,0 @@
name: docker-cache-mi300
on:
# run every 6 hours
schedule:
- cron: 0 0,6,12,18 * * *
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
docker-cache:
if: github.repository_owner == 'pytorch'
runs-on: rocm-docker
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
no-sudo: true
- name: configure aws credentials
id: aws_creds
uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Login to Amazon ECR
id: login-ecr
continue-on-error: false
uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
push: false
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Tar and upload to S3 bucket
run: |
sudo docker save -o ~/docker-data/pytorch/pytorch_docker_image.tar ${{ steps.calculate-docker-image.outputs.docker-image }}
sudo rclone copy -P --s3-upload-concurrency 64 --s3-chunk-size 200M --s3-upload-cutoff 300M ~/docker-data/pytorch/pytorch_docker_image.tar oci:pytorchbucket0002/pytorch_docker_image --progress

105
.github/workflows/docker-cache-rocm.yml vendored Normal file
View File

@ -0,0 +1,105 @@
name: docker-cache-rocm
on:
workflow_run:
workflows: [docker-builds]
branches: [main, release]
types:
- completed
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
actions: read
jobs:
download-docker-builds-artifacts:
if: github.repository_owner == 'pytorch'
name: download-docker-builds-artifacts
runs-on: ubuntu-latest
outputs:
pytorch-linux-jammy-rocm-n-py3: ${{ steps.process-artifacts.outputs.pytorch-linux-jammy-rocm-n-py3 }}
pytorch-linux-noble-rocm-n-py3: ${{ steps.process-artifacts.outputs.pytorch-linux-noble-rocm-n-py3 }}
pytorch-linux-jammy-rocm-n-py3-benchmarks: ${{ steps.process-artifacts.outputs.pytorch-linux-jammy-rocm-n-py3-benchmarks }}
steps:
- name: Download artifacts
uses: actions/download-artifact@v4.1.7
with:
run-id: ${{ github.event.workflow_run.id }}
path: ./docker-builds-artifacts
merge-multiple: true
github-token: ${{ secrets.GITHUB_TOKEN }}
- name: Process artifacts
id: process-artifacts
run: |
ls -R ./docker-builds-artifacts
cat ./docker-builds-artifacts/*txt >> "${GITHUB_OUTPUT}"
cat "${GITHUB_OUTPUT}"
docker-cache:
if: github.repository_owner == 'pytorch'
needs: download-docker-builds-artifacts
strategy:
fail-fast: false
matrix:
runner: [linux.rocm.gfx942.docker-cache]
docker-image: [
"${{ needs.download-docker-builds-artifacts.outputs.pytorch-linux-jammy-rocm-n-py3 }}",
"${{ needs.download-docker-builds-artifacts.outputs.pytorch-linux-noble-rocm-n-py3 }}",
"${{ needs.download-docker-builds-artifacts.outputs.pytorch-linux-jammy-rocm-n-py3-benchmarks }}"
]
runs-on: "${{ matrix.runner }}"
steps:
- name: debug
run: |
JSON_STRINGIFIED="${{ toJSON(needs.download-docker-builds-artifacts.outputs) }}"
echo "Outputs of download-docker-builds-artifacts job: ${JSON_STRINGIFIED}"
- name: configure aws credentials
id: aws_creds
uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Login to Amazon ECR
id: login-ecr
continue-on-error: false
uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1
- name: Generate ghrc.io tag
id: ghcr-io-tag
run: |
ecr_image="${{ matrix.docker-image }}"
ghcr_image="ghcr.io/pytorch/ci-image:${ecr_image##*:}"
echo "ghcr_image=${ghcr_image}" >> "$GITHUB_OUTPUT"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.ghcr-io-tag.outputs.ghcr_image }}
- name: Save as tarball
run: |
docker_image_tag=${{ matrix.docker-image }}
docker_image_tag="${docker_image_tag#*:}" # Remove everything before and including first ":"
docker_image_tag="${docker_image_tag%-*}" # Remove everything after and including last "-"
ref_name=${{ github.event.workflow_run.head_branch }}
if [[ $ref_name =~ "release/" ]]; then
ref_suffix="release"
elif [[ $ref_name == "main" ]]; then
ref_suffix="main"
else
echo "Unexpected branch in ref_name: ${ref_name}" && exit 1
fi
docker tag ${{ steps.ghcr-io-tag.outputs.ghcr_image }} ${{ matrix.docker-image }}
# mv is atomic operation, so we use intermediate tar.tmp file to prevent read-write contention
docker save -o ~/pytorch-data/docker/${docker_image_tag}.tar.tmp ${{ matrix.docker-image }}
mv ~/pytorch-data/docker/${docker_image_tag}.tar.tmp ~/pytorch-data/docker/${docker_image_tag}_${ref_suffix}.tar

View File

@ -5,7 +5,9 @@
# Flow:
# 1. Builds PyTorch with CUDA 12.8+ and sm100 architecture for B200
# 2. Runs smoke tests on linux.dgx.b200 runner
# 3. Tests executed are defined in .ci/pytorch/test.sh -> test_python_smoke() function
# 3. Tests executed are defined in .ci/pytorch/test.sh -> test_python_smoke_b200() function
# - Includes matmul, scaled_matmul, FP8, and FlashAttention CuTe tests
# - FlashAttention CuTe DSL is installed as part of test execution
#
# Triggered by:
# - Pull requests modifying this workflow file
@ -52,6 +54,7 @@ jobs:
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'
@ -72,4 +75,4 @@ jobs:
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build.outputs.test-matrix }}
aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
secrets: inherit
secrets: inherit

83
.github/workflows/trunk-rocm-mi300.yml vendored Normal file
View File

@ -0,0 +1,83 @@
name: trunk-rocm-mi300
on:
push:
branches:
- main
- release/*
workflow_dispatch:
schedule:
- cron: 29 8 * * * # about 1:29am PDT
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
llm-td:
if: github.repository_owner == 'pytorch'
name: before-test
uses: ./.github/workflows/llm_td_retrieval.yml
permissions:
id-token: write
contents: read
target-determination:
name: before-test
uses: ./.github/workflows/target_determination.yml
needs: llm-td
permissions:
id-token: write
contents: read
get-label-type:
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-rocm-py3_10-build:
name: linux-jammy-rocm-py3.10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-rocm-py3.10
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
sync-tag: rocm-build
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 6, runner: "linux.rocm.gpu.gfx942.1.b" },
{ config: "default", shard: 2, num_shards: 6, runner: "linux.rocm.gpu.gfx942.1.b" },
{ config: "default", shard: 3, num_shards: 6, runner: "linux.rocm.gpu.gfx942.1.b" },
{ config: "default", shard: 4, num_shards: 6, runner: "linux.rocm.gpu.gfx942.1.b" },
{ config: "default", shard: 5, num_shards: 6, runner: "linux.rocm.gpu.gfx942.1.b" },
{ config: "default", shard: 6, num_shards: 6, runner: "linux.rocm.gpu.gfx942.1.b" },
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.gfx942.4.b" },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.gfx942.4.b" },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.gfx942.4.b" },
]}
secrets: inherit
linux-jammy-rocm-py3_10-test:
permissions:
id-token: write
contents: read
name: linux-jammy-rocm-py3.10
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-jammy-rocm-py3_10-build
- target-determination
with:
build-environment: linux-jammy-rocm-py3.10
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.test-matrix }}
secrets: inherit

View File

@ -5,6 +5,7 @@ on:
workflows:
- pull
- trunk
- trunk-rocm-mi300
- periodic
- periodic-rocm-mi200
- periodic-rocm-mi300

View File

@ -18,6 +18,8 @@
#include <unordered_set>
#include <utility>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wswitch-default")
namespace torch {
class TORCH_API CustomClassHolder : public c10::intrusive_ptr_target {};
namespace jit {
@ -1630,4 +1632,6 @@ struct TORCH_API WeakOrStrongTypePtr {
} // namespace c10
C10_DIAGNOSTIC_POP()
#include <ATen/core/ivalue_inl.h> // IWYU pragma: keep

View File

@ -29,6 +29,8 @@
#include <c10/util/intrusive_ptr.h>
#include <c10/util/irange.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wswitch-default")
namespace torch {
namespace jit {
struct Function;
@ -2567,3 +2569,5 @@ TypePtr IValue::type() const {
}
} // namespace c10
C10_DIAGNOSTIC_POP()

View File

@ -11,6 +11,8 @@
#include <sleef.h>
#endif
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wswitch-default")
// Sleef offers vectorized versions of some transcedentals
// such as sin, cos, tan etc..
// However for now opting for STL, since we are not building
@ -650,3 +652,5 @@ inline Vectorized<float> Vectorized<float>::erf() const {
} // namespace CPU_CAPABILITY
} // namespace at::vec
C10_DIAGNOSTIC_POP()

View File

@ -1,6 +1,7 @@
#include <ATen/cuda/CUDAGeneratorImpl.h>
#include <ATen/cuda/CUDAGraph.h>
#include <ATen/cuda/Exceptions.h>
#include <ATen/cuda/MemPool.h>
#include <ATen/Functions.h>
#include <c10/cuda/CUDAFunctions.h>
@ -13,7 +14,7 @@ static bool _cuda_graphs_debug = false;
MempoolId_t graph_pool_handle() {
// Sets just the second value, to distinguish it from MempoolId_ts created from
// cudaStreamGetCaptureInfo id_s in capture_begin.
return c10::cuda::MemPool::graph_pool_handle();
return at::cuda::MemPool::graph_pool_handle();
}
/**
@ -90,7 +91,7 @@ void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/, cudaStreamCaptureMode capt
} else {
// User did not ask us to share a mempool. Create graph pool handle using is_user_created=false.
// Sets just the first value, to distinguish it from MempoolId_ts created by graph_pool_handle().
mempool_id_ = c10::cuda::MemPool::graph_pool_handle(false);
mempool_id_ = at::cuda::MemPool::graph_pool_handle(false);
TORCH_INTERNAL_ASSERT(mempool_id_.first > 0);
}

View File

@ -0,0 +1,69 @@
#include <ATen/core/CachingHostAllocator.h>
#include <ATen/cuda/MemPool.h>
namespace at::cuda {
// uid_ is incremented when a user creates a MemPool,
// for example: using graph_pool_handle() or c10::cuda::MemPool().
//
// uuid_ is incremented when CUDAGraph creates a MemPool
// as a result of a user not providing a pool.
//
// MempoolId_t of {0, 0} is used to denote when no MemPool has been
// passed to a function, either by user or CUDAGraphs. For example,
// default value of MempoolId_t for capture_begin function is {0, 0}.
// That's why uid_ and uuid_ start at 1.
std::atomic<CaptureId_t> MemPool::uid_{1};
std::atomic<CaptureId_t> MemPool::uuid_{1};
MemPool::MemPool(
CUDACachingAllocator::CUDAAllocator* allocator,
bool is_user_created,
bool use_on_oom)
: allocator_(allocator), is_user_created_(is_user_created) {
if (is_user_created_) {
id_ = {0, uid_++};
} else {
id_ = {uuid_++, 0};
}
device_ = c10::cuda::current_device();
CUDACachingAllocator::createOrIncrefPool(device_, id_, allocator);
if (use_on_oom) {
CUDACachingAllocator::setUseOnOOM(device_, id_);
}
}
MemPool::~MemPool() {
// TORCH_INTERNAL_ASSERT(use_count() == 1);
// We used to assert that TORCH_INTERNAL_ASSERT(use_count() == 1);
// However, this assertion is not true if a memory pool is shared
// with a cuda graph. That CUDAGraph will increase the use count
// until it is reset.
CUDACachingAllocator::releasePool(device_, id_);
c10::cuda::CUDACachingAllocator::emptyCache(id_);
}
MempoolId_t MemPool::id() {
return id_;
}
CUDACachingAllocator::CUDAAllocator* MemPool::allocator() {
return allocator_;
}
int MemPool::use_count() {
return CUDACachingAllocator::getPoolUseCount(device_, id_);
}
c10::DeviceIndex MemPool::device() {
return device_;
}
MempoolId_t MemPool::graph_pool_handle(bool is_user_created) {
if (is_user_created) {
return {0, uid_++};
}
return {uuid_++, 0};
}
} // namespace at::cuda

View File

@ -0,0 +1,44 @@
#pragma once
#include <c10/core/Allocator.h>
#include <c10/cuda/CUDACachingAllocator.h>
namespace at::cuda {
// Keep BC only
using c10::CaptureId_t;
using c10::MempoolId_t;
// MemPool represents a pool of memory in a caching allocator. Currently,
// it's just the ID of the pool object maintained in the CUDACachingAllocator.
//
// An allocator pointer can be passed to the MemPool to define how the
// allocations should be done in the pool. For example: using a different
// system allocator such as ncclMemAlloc.
struct TORCH_CUDA_CPP_API MemPool {
MemPool(
c10::cuda::CUDACachingAllocator::CUDAAllocator* allocator = nullptr,
bool is_user_created = true,
bool use_on_oom = false);
MemPool(const MemPool&) = delete;
MemPool(MemPool&&) = default;
MemPool& operator=(const MemPool&) = delete;
MemPool& operator=(MemPool&&) = default;
~MemPool();
MempoolId_t id();
c10::cuda::CUDACachingAllocator::CUDAAllocator* allocator();
int use_count();
c10::DeviceIndex device();
static MempoolId_t graph_pool_handle(bool is_user_created = true);
private:
static std::atomic<CaptureId_t> uid_;
static std::atomic<CaptureId_t> uuid_;
c10::cuda::CUDACachingAllocator::CUDAAllocator* allocator_;
bool is_user_created_;
MempoolId_t id_;
c10::DeviceIndex device_;
};
} // namespace at::cuda

View File

@ -1936,7 +1936,7 @@ static bool should_fold(const Tensor& tensor1, const Tensor& tensor2, bool has_o
// We order the tensors. t1 will be the larger tensor
// We can always transpose tensor2 as the dimensions are always >= 1 (precondition from matmul)
// and tensor1_larger iff tensor2.dim() > tensor1.dim(9
// and tensor1_larger iff tensor2.dim() > tensor1.dim()
const auto t1 = tensor1_larger ? MaybeOwned<Tensor>::borrowed(tensor1)
: MaybeOwned<Tensor>::owned(tensor2.mT());
const int64_t dim_t1 = t1->dim();
@ -1948,20 +1948,11 @@ static bool should_fold(const Tensor& tensor1, const Tensor& tensor2, bool has_o
return false;
}
// In this case we *do* incur in an extra copy to avoid creating an unnecessary large tensor in the backward
// Suppose we don't fold here. Let t1.shape = [b, m, n] t2.shape = [n, k] like in a transformer
// t2 will be expanded to a tensor of shape [b, n, k] and then we do t1.bmm(t2_expanded)
// The issue appears in the backward.
// The output gradient g of this operation would have shape [b, m, k]
// The backward wrt. t2 of bmm would be given by t1.mH @ g, which has shape [b, n, k]
// Then, the backward of expand is simply `sum(0)`. As such, we are instantiating a tensor
// of shape [b, n, k] unnecessarily, which may cause a large memory footprint, and in the
// worst case, an OOM
bool t2_requires_grad = tensor1_larger ? tensor2.requires_grad() : tensor1.requires_grad();
if (t2_requires_grad && !has_out) {
// We should be checking !at::GradMode::is_enabled(), but apparently
// this regresses performance in some cases:
// https://github.com/pytorch/pytorch/issues/118548#issuecomment-1916022394
// If we require a gradient, we should fold to minimize backward memory usage - even if this
// leads to a copy in forward because is needed in backward,
// only time we avoid this strict pre-allocated memory usage (has_out = True)
bool requires_grad = tensor1.requires_grad() || tensor2.requires_grad();
if (requires_grad && !has_out) {
return true;
}

View File

@ -142,6 +142,7 @@ Tensor _pack_padded_sequence_backward_symint(const Tensor& grad, c10::SymIntArra
std::tuple<Tensor, Tensor> _pad_packed_sequence(const Tensor& data, const Tensor& _batch_sizes, bool batch_first, const Scalar& padding_value, int64_t total_length) {
auto batch_sizes_t = _batch_sizes.contiguous();
checkLongTensor(batch_sizes_t);
TORCH_CHECK(batch_sizes_t.numel() > 0, "batch_sizes can not be empty");
int64_t * batch_sizes = batch_sizes_t.data_ptr<int64_t>();
int64_t max_batch_size = batch_sizes[0];

View File

@ -1,6 +1,8 @@
#pragma once
#include <c10/util/Exception.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wswitch-default")
namespace at::native {
// Used as an interface between the different BLAS-like libraries
@ -21,3 +23,5 @@ static inline char to_blas(TransposeType trans) {
}
} // namespace at::native
C10_DIAGNOSTIC_POP()

View File

@ -1,6 +1,7 @@
#pragma once
#include <ATen/native/CompositeRandomAccessorCommon.h>
#include <thrust/swap.h>
#include <thrust/tuple.h>
namespace at { namespace native {

View File

@ -75,30 +75,52 @@ static inline bool can_use_int32_nhwc(
return true;
}
static inline bool can_use_int32_nchw(
int64_t nbatch, int64_t channels,
int64_t height, int64_t width,
int64_t pooled_height, int64_t pooled_width) {
int64_t hw = height * width;
return can_use_int32_nhwc(
nbatch, channels, height, width,
pooled_height, pooled_width,
channels * hw, // in_stride_n
hw, // in_stride_c
width, // in_stride_h
1 // in_stride_w
);
}
// kernels borrowed from Caffe
template <typename scalar_t>
__global__ void max_pool_forward_nchw(const int nthreads, const scalar_t* bottom_data,
const int64_t channels, const int64_t height,
const int64_t width, const int pooled_height, const int pooled_width,
const int kernel_h, const int kernel_w, const int stride_h,
const int stride_w, const int pad_h, const int pad_w,
const int dilation_h, const int dilation_w, scalar_t* top_data,
template <typename scalar_t, typename index_t>
__global__ void max_pool_forward_nchw(
const index_t nthreads,
const scalar_t* bottom_data,
const int64_t channels,
const int64_t height,
const int64_t width,
const int pooled_height,
const int pooled_width,
const int kernel_h, const int kernel_w,
const int stride_h, const int stride_w,
const int pad_h, const int pad_w,
const int dilation_h, const int dilation_w,
scalar_t* top_data,
int64_t* top_mask) {
CUDA_KERNEL_LOOP(index, nthreads) {
int pw = index % pooled_width;
int ph = (index / pooled_width) % pooled_height;
int c = (index / pooled_width / pooled_height) % channels;
int n = index / pooled_width / pooled_height / channels;
int hstart = ph * stride_h - pad_h;
int wstart = pw * stride_w - pad_w;
int hend = min(hstart + (kernel_h - 1) * dilation_h + 1, height);
int wend = min(wstart + (kernel_w - 1) * dilation_w + 1, width);
CUDA_KERNEL_LOOP_TYPE(index, nthreads, index_t) {
index_t pw = index % pooled_width;
index_t ph = (index / pooled_width) % pooled_height;
index_t c = (index / pooled_width / pooled_height) % channels;
index_t n = index / pooled_width / pooled_height / channels;
index_t hstart = ph * stride_h - pad_h;
index_t wstart = pw * stride_w - pad_w;
index_t hend = min(hstart + (kernel_h - 1) * dilation_h + 1, height);
index_t wend = min(wstart + (kernel_w - 1) * dilation_w + 1, width);
while(hstart < 0)
hstart += dilation_h;
while(wstart < 0)
wstart += dilation_w;
scalar_t maxval = at::numeric_limits<scalar_t>::lower_bound(); // -Infinity
int maxidx = hstart * width + wstart;
index_t maxidx = hstart * width + wstart;
const scalar_t* btm_data = bottom_data + (n * channels + c) * height * width;
for (int h = hstart; h < hend; h += dilation_h) {
for (int w = wstart; w < wend; w += dilation_w) {
@ -251,32 +273,39 @@ __global__ void max_pool_forward_nhwc(
static constexpr int BLOCK_THREADS = 256;
template <typename scalar_t, typename accscalar_t>
template <typename scalar_t, typename accscalar_t, typename index_t>
#if defined (USE_ROCM)
C10_LAUNCH_BOUNDS_2(BLOCK_THREADS, 4)
#else
C10_LAUNCH_BOUNDS_2(BLOCK_THREADS, 8)
#endif
__global__ void max_pool_backward_nchw(const scalar_t* top_diff,
const int64_t* top_mask, const int num, const int64_t channels,
const int64_t height, const int64_t width, const int pooled_height,
const int pooled_width, const int kernel_h, const int kernel_w,
const int stride_h, const int stride_w, const int pad_h, const int pad_w,
__global__ void max_pool_backward_nchw(
const scalar_t* top_diff,
const int64_t* top_mask,
const index_t num,
const index_t channels,
const index_t height,
const index_t width,
const index_t pooled_height,
const index_t pooled_width,
const int kernel_h, const int kernel_w,
const int stride_h, const int stride_w,
const int pad_h, const int pad_w,
const int dilation_h, const int dilation_w,
scalar_t* bottom_diff) {
CUDA_KERNEL_LOOP(index, height*width) {
int h = index / width;
int w = index - h * width;
int phstart = p_start(h, pad_h, kernel_h, dilation_h, stride_h);
int phend = p_end(h, pad_h, pooled_height, stride_h);
int pwstart = p_start(w, pad_w, kernel_w, dilation_w, stride_w);
int pwend = p_end(w, pad_w, pooled_width, stride_w);
for (int n = blockIdx.y; n < num; n += gridDim.y) {
for (int c = blockIdx.z; c < channels; c+= gridDim.z) {
CUDA_KERNEL_LOOP_TYPE(index, height*width, index_t) {
index_t h = index / width;
index_t w = index - h * width;
index_t phstart = p_start(h, pad_h, kernel_h, dilation_h, stride_h);
index_t phend = p_end(h, pad_h, pooled_height, stride_h);
index_t pwstart = p_start(w, pad_w, kernel_w, dilation_w, stride_w);
index_t pwend = p_end(w, pad_w, pooled_width, stride_w);
for (index_t n = blockIdx.y; n < num; n += gridDim.y) {
for (index_t c = blockIdx.z; c < channels; c += gridDim.z) {
accscalar_t gradient = accscalar_t(0);
int offset = (n * channels + c) * pooled_height * pooled_width;
for (int ph = phstart; ph < phend; ++ph) {
for (int pw = pwstart; pw < pwend; ++pw) {
index_t offset = (n * channels + c) * pooled_height * pooled_width;
for (index_t ph = phstart; ph < phend; ++ph) {
for (index_t pw = pwstart; pw < pwend; ++pw) {
if (top_mask[ph * pooled_width + pw + offset] == h * width + w) {
gradient += static_cast<accscalar_t>(top_diff[ph * pooled_width + pw + offset]);
}
@ -469,8 +498,6 @@ const Tensor& indices) {
const int64_t in_stride_h = input.stride(-2);
const int64_t in_stride_w = input.stride(-1);
const int count = safe_downcast<int, int64_t>(output.numel());
AT_DISPATCH_FLOATING_TYPES_AND2(kHalf, kBFloat16, input.scalar_type(),
"max_pool2d_with_indices_out_cuda_frame",
[&] {
@ -553,14 +580,42 @@ const Tensor& indices) {
break;
}
case MemoryFormat::Contiguous: {
const int num_threads = std::min(at::cuda::getCurrentDeviceProperties()->maxThreadsPerBlock,
BLOCK_THREADS);
max_pool_forward_nchw<scalar_t>
<<<ceil_div(count, num_threads), num_threads, 0, at::cuda::getCurrentCUDAStream()>>>(
count, input_data,
nInputPlane, inputHeight, inputWidth, outputHeight, outputWidth,
kH, kW, dH, dW, padH, padW, dilationH, dilationW,
output_data, indices_data);
const int threads = std::min(
at::cuda::getCurrentDeviceProperties()->maxThreadsPerBlock,
BLOCK_THREADS);
const int64_t nthreads = output.numel();
bool use_int32 = can_use_int32_nchw(
nbatch, nInputPlane, inputHeight, inputWidth, outputHeight, outputWidth);
const int maxGridX = at::cuda::getCurrentDeviceProperties()->maxGridSize[0];
const int blocks = static_cast<int>(std::min<int64_t>(
ceil_div(nthreads, static_cast<int64_t>(threads)),
static_cast<int64_t>(maxGridX)));
auto stream = at::cuda::getCurrentCUDAStream();
if (use_int32) {
max_pool_forward_nchw<scalar_t, int32_t>
<<<blocks, threads, 0, stream>>>(
static_cast<int32_t>(nthreads),
input_data,
static_cast<int32_t>(nInputPlane),
static_cast<int32_t>(inputHeight),
static_cast<int32_t>(inputWidth),
static_cast<int32_t>(outputHeight),
static_cast<int32_t>(outputWidth),
kH, kW, dH, dW, padH, padW, dilationH, dilationW,
output_data, indices_data);
} else {
max_pool_forward_nchw<scalar_t, int64_t>
<<<blocks, threads, 0, stream>>>(
nthreads,
input_data,
nInputPlane,
inputHeight,
inputWidth,
outputHeight,
outputWidth,
kH, kW, dH, dW, padH, padW, dilationH, dilationW,
output_data, indices_data);
}
C10_CUDA_KERNEL_LAUNCH_CHECK();
break;
}
@ -633,8 +688,6 @@ const Tensor& gradInput) {
gradInput.zero_();
int64_t count = input.numel();
AT_DISPATCH_FLOATING_TYPES_AND2(kHalf, kBFloat16, input.scalar_type(),
"max_pool2d_with_indices_out_cuda_frame",
[&] {
@ -692,25 +745,45 @@ const Tensor& gradInput) {
break;
}
case MemoryFormat::Contiguous: {
int imgcount = inputWidth * inputHeight;
dim3 grid;
const int blocks = (imgcount + BLOCK_THREADS - 1) / BLOCK_THREADS;
grid.x = blocks;
grid.y = nbatch;
uint64_t maxGridY = at::cuda::getCurrentDeviceProperties()->maxGridSize[1];
if (maxGridY < grid.y) grid.y = maxGridY;
grid.z = nInputPlane;
uint64_t maxGridZ = at::cuda::getCurrentDeviceProperties()->maxGridSize[2];
if (maxGridZ < grid.z) grid.z = maxGridZ;
max_pool_backward_nchw<scalar_t, accscalar_t>
<<<grid, BLOCK_THREADS, 0, at::cuda::getCurrentCUDAStream()>>>(
gradOutput_data,
indices_data,
nbatch,
nInputPlane, inputHeight, inputWidth, outputHeight, outputWidth,
kH, kW, dH, dW, padH, padW, dilationH, dilationW,
gradInput_data);
const int threads = std::min(
at::cuda::getCurrentDeviceProperties()->maxThreadsPerBlock,
BLOCK_THREADS);
const int imgcount = inputWidth * inputHeight;
const int maxGridX = at::cuda::getCurrentDeviceProperties()->maxGridSize[0];
const int maxGridY = at::cuda::getCurrentDeviceProperties()->maxGridSize[1];
const int maxGridZ = at::cuda::getCurrentDeviceProperties()->maxGridSize[2];
const int blocks_x = std::min(ceil_div(imgcount, threads), maxGridX);
dim3 grid(blocks_x, static_cast<unsigned>(std::min<int64_t>(nbatch, maxGridY)), static_cast<unsigned>(std::min<int64_t>(nInputPlane, maxGridZ)));
bool use_int32 = can_use_int32_nchw(
nbatch, nInputPlane, inputHeight, inputWidth, outputHeight, outputWidth);
auto stream = at::cuda::getCurrentCUDAStream();
if (use_int32) {
max_pool_backward_nchw<scalar_t, accscalar_t, int32_t>
<<<grid, threads, 0, stream>>>(
gradOutput_data,
indices_data,
static_cast<int32_t>(nbatch),
static_cast<int32_t>(nInputPlane),
static_cast<int32_t>(inputHeight),
static_cast<int32_t>(inputWidth),
static_cast<int32_t>(outputHeight),
static_cast<int32_t>(outputWidth),
kH, kW, dH, dW, padH, padW, dilationH, dilationW,
gradInput_data);
} else {
max_pool_backward_nchw<scalar_t, accscalar_t, int64_t>
<<<grid, threads, 0, stream>>>(
gradOutput_data,
indices_data,
nbatch,
nInputPlane,
inputHeight,
inputWidth,
outputHeight,
outputWidth,
kH, kW, dH, dW, padH, padW, dilationH, dilationW,
gradInput_data);
}
C10_CUDA_KERNEL_LAUNCH_CHECK();
break;
}

View File

@ -669,9 +669,12 @@ std::optional<c10::ScalarType> out_dtype) {
// _scaled_mm_allowed_device is used here within _grouped_mm_cuda which seems incorrect since scale is not used.
// the _grouped_mm_fallback should be safe for any ROCm GPU since it's just calling typical mm/bmm
bool use_fast_path = false;
// On non CK system(w/ ROCm), make sure use_fast_path is false
#if defined(USE_ROCM_CK_GEMM)
if (at::detail::getCUDAHooks().isGPUArch({"gfx942", "gfx950"})) {
use_fast_path = true;
}
#endif //USE_ROCM_CK_GEMM
#endif
const auto out_dtype_ = _resolve_grouped_mm_out_dtype(mat_a, mat_b, out_dtype);
Tensor out = create_grouped_gemm_output_tensor(mat_a, mat_b, offs, out_dtype_);
@ -680,7 +683,11 @@ std::optional<c10::ScalarType> out_dtype) {
#ifndef USE_ROCM
at::cuda::detail::bf16bf16_grouped_mm(mat_a, mat_b, offs, bias, out);
#else
#if defined(USE_ROCM_CK_GEMM)
at::hip::detail::group_gemm_ck(mat_a, mat_b, offs, bias, out);
#else
TORCH_WARN("ROCm: Group Gemm through CK not selected.");
#endif //USE_ROCM_CK_GEMM
#endif
} else {
_grouped_mm_fallback(mat_a, mat_b, offs, bias, out_dtype, out);

View File

@ -267,15 +267,15 @@ void scan_dim_with_indices(const TensorBase& self, const TensorBase& values, con
* outer dimensions, which contains several "inner rows").
* Each thread processes a single inner row at a time.
*/
template<typename scalar_t, class BinaryOp>
template<typename scalar_t, typename index_t, class BinaryOp>
__global__ void tensor_kernel_scan_outer_dim(scalar_t *tgt_, const scalar_t *src_,
const uint32_t num_orows, const uint32_t num_irows, const uint32_t row_size,
const scalar_t init, BinaryOp binary_op)
{
for (uint32_t orow = blockIdx.x; orow < num_orows; orow += gridDim.x) {
for (uint32_t irow = blockIdx.y * blockDim.x + threadIdx.x; irow < num_irows; irow += gridDim.y * blockDim.x) {
const scalar_t *src = src_ + orow * row_size * num_irows + irow;
scalar_t *tgt = tgt_ + orow * row_size * num_irows + irow;
const scalar_t *src = src_ + static_cast<index_t>(orow) * row_size * num_irows + irow;
scalar_t *tgt = tgt_ + (index_t) orow * row_size * num_irows + irow;
scalar_t acc = init;
for (uint32_t col = 0; col < row_size; ++col) {
@ -409,10 +409,15 @@ __host__ void scan_outer_dim(const TensorBase& self, const TensorBase& result,
check_fits_in_unsigned(num_irows, "num_irows");
check_fits_in_unsigned(num_orows, "num_orows");
check_fits_in_unsigned(row_size, "row_size");
tensor_kernel_scan_outer_dim<scalar_t><<<grid, threads, 0, at::cuda::getCurrentCUDAStream()>>>(
if (static_cast<size_t>(num_irows) * num_orows * row_size <= UINT_MAX) {
tensor_kernel_scan_outer_dim<scalar_t, uint32_t><<<grid, threads, 0, at::cuda::getCurrentCUDAStream()>>>(
result.mutable_data_ptr<scalar_t>(), self.const_data_ptr<scalar_t>(),
num_orows, num_irows, row_size, init, binary_op);
} else {
tensor_kernel_scan_outer_dim<scalar_t, size_t><<<grid, threads, 0, at::cuda::getCurrentCUDAStream()>>>(
result.mutable_data_ptr<scalar_t>(), self.const_data_ptr<scalar_t>(),
num_orows, num_irows, row_size, init, binary_op);
}
C10_CUDA_KERNEL_LAUNCH_CHECK();
}

View File

@ -82,6 +82,7 @@ NSArray<NSNumber*>* getTensorAxes(const TensorBase& t);
NSArray<NSNumber*>* getTensorAxes(const IntArrayRef& sizes, at::OptionalIntArrayRef dim);
std::string getMPSShapeString(MPSShape* shape);
std::string getTensorsStringKey(const TensorList& tensors, bool short_dtype = true, bool exclude_shape = false);
std::string to_hex_key(float);
std::string getArrayRefString(const IntArrayRef s);
// use has_storage() on the returned tensor to determine if src actually is a view
Tensor gatherViewTensor(const Tensor& src, Tensor& dst);

View File

@ -301,6 +301,10 @@ std::string getArrayRefString(const IntArrayRef s) {
return fmt::to_string(fmt::join(s, ","));
}
std::string to_hex_key(float f) {
return fmt::format("{:a}", f);
}
std::string getTensorsStringKey(const TensorList& tensors, bool short_dtype, bool exclude_shape) {
fmt::basic_memory_buffer<char, 100> buffer;
auto buf_iterator = std::back_inserter(buffer);

View File

@ -40,7 +40,7 @@ inline c10::metal::opmath_t<T> matmul_inner(
threadgroup_barrier(mem_flags::mem_threadgroup);
for (uint k = 0; k < TILE_DIM; k++) {
sum += A_tile[tid.y][k] * B_tile[k][tid.x];
sum += c10::metal::mul(A_tile[tid.y][k], B_tile[k][tid.x]);
}
threadgroup_barrier(mem_flags::mem_threadgroup);
@ -832,6 +832,10 @@ INSTANTIATE_MM_OPS(float);
INSTANTIATE_MM_OPS(half);
INSTANTIATE_MM_OPS(bfloat);
// Complex MM
INSTANTIATE_MM_OPS(float2);
INSTANTIATE_MM_OPS(half2);
// Integral MM
INSTANTIATE_MM_OPS(long);
INSTANTIATE_MM_OPS(int);

View File

@ -190,10 +190,16 @@ std::tuple<MPSGraphTensor*, MPSGraphTensor*, MPSGraphTensor*> do_mm(MPSGraph* gr
bool use_metal_mm(const Tensor& self, const Tensor& other, const Tensor& output) {
static bool always_use_metal = c10::utils::has_env("PYTORCH_MPS_PREFER_METAL");
constexpr auto max_stride_size = 32768;
constexpr auto max_complex_inner_size = 2048;
static bool is_macos_14_4_or_newer = is_macos_13_or_newer(MacOSVersion::MACOS_VER_14_4_PLUS);
if (always_use_metal || c10::isIntegralType(self.scalar_type(), true)) {
return true;
}
// multiplicationWithPrimaryTensor: returns incorrect results if inner size exceeds 2048
// See https://github.com/pytorch/pytorch/issues/167727#issuecomment-3529308548
if (c10::isComplexType(self.scalar_type()) && self.size(1) > max_complex_inner_size) {
return true;
}
return !is_macos_14_4_or_newer &&
(self.stride(0) > max_stride_size || self.stride(1) > max_stride_size || self.size(0) > max_stride_size ||
self.size(1) > max_stride_size || other.stride(0) > max_stride_size || other.stride(1) > max_stride_size ||

View File

@ -244,8 +244,8 @@ static void clamp_scalar_out_mps(const Tensor& input_t,
@autoreleasepool {
// the optional min/max refs could affect how we build the cached graph
std::string key = op_name + (has_min ? ("_min:" + std::to_string(min_scalar)) : "") +
(has_max ? ("_max:" + std::to_string(max_scalar)) : "") + "_scalar:" + getTensorsStringKey({input_t});
std::string key = op_name + (has_min ? ("_min:" + to_hex_key(min_scalar)) : "") +
(has_max ? ("_max:" + to_hex_key(max_scalar)) : "") + "_scalar:" + getTensorsStringKey({input_t});
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
if (has_min)
newCachedGraph->minTensor = [mpsGraph constantWithScalar:min_scalar

View File

@ -7518,7 +7518,7 @@
- func: _sparse_mask_projection(Tensor self, Tensor mask, bool accumulate_matches=False) -> Tensor
variants: method
dispatch:
SparseCPU, SparseCUDA: sparse_mask_projection
SparseCPU, SparseCUDA, SparseMPS: sparse_mask_projection
autogen: _sparse_mask_projection.out
- func: _to_cpu(Tensor[] tensors) -> Tensor[]

View File

@ -30,10 +30,12 @@
#include <thrust/binary_search.h>
#include <thrust/device_ptr.h>
#include <thrust/distance.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/scan.h>
#include <thrust/sequence.h>
#include <thrust/sort.h>
#include <thrust/system/cuda/execution_policy.h>
#include <thrust/iterator/constant_iterator.h>
#include <cuda_runtime_api.h>
#include <cusparse.h>
@ -47,6 +49,7 @@
#include <c10/macros/Macros.h>
#include <thrust/copy.h>
#include <thrust/device_ptr.h>
#include <thrust/distance.h>
#include <thrust/for_each.h>
#include <thrust/functional.h>
#include <thrust/gather.h>

View File

@ -445,6 +445,33 @@ static SparseTensor& mul_out_dense_sparse_mps(
return out;
}
static std::tuple<Tensor, Tensor, int64_t> mps_intersect_binary_search(
const Tensor& A_keys,
const Tensor& B_keys,
int64_t lenA,
int64_t lenB,
bool boolean_flag) {
auto stream = getCurrentMPSStream();
auto outA_idx = at::empty({lenA}, A_keys.options().dtype(at::kLong));
auto outB_idx = at::empty({lenA}, A_keys.options().dtype(at::kLong));
auto counter = at::zeros({1}, A_keys.options().dtype(at::kInt));
dispatch_sync_with_rethrow(stream->queue(), ^() {
@autoreleasepool {
auto pso = lib.getPipelineStateForFunc("intersect_binary_search");
auto enc = stream->commandEncoder();
[enc setComputePipelineState:pso];
mtl_setArgs(enc, A_keys, B_keys, outA_idx, outB_idx, counter,
static_cast<uint32_t>(lenB), boolean_flag);
mtl_dispatch1DJob(enc, pso, static_cast<uint32_t>(lenA));
}
});
const auto match_count = static_cast<int64_t>(counter.item<int32_t>());
return std::make_tuple(std::move(outA_idx), std::move(outB_idx), match_count);
}
SparseTensor& mul_out_sparse_mps(const Tensor& t_, const Tensor& src_, SparseTensor& r_) {
TORCH_CHECK(r_.is_mps(), "mul: expected 'out' to be MPS, but got ", r_.device());
@ -523,22 +550,10 @@ SparseTensor& mul_out_sparse_mps(const Tensor& t_, const Tensor& src_, SparseTen
auto A_keys = A_is_lhs ? lhs_keys : rhs_keys;
auto B_keys = A_is_lhs ? rhs_keys : lhs_keys;
auto outA_idx = at::empty({lenA}, at::device(device).dtype(kLong));
auto outB_idx = at::empty({lenA}, at::device(device).dtype(kLong));
auto counter = at::zeros({1}, at::device(device).dtype(kInt));
auto [outA_idx, outB_idx, M_int64] = mps_intersect_binary_search(
A_keys, B_keys, lenA, lenB, A_is_lhs);
dispatch_sync_with_rethrow(stream->queue(), ^() {
@autoreleasepool {
auto pso = lib.getPipelineStateForFunc("intersect_binary_search");
auto enc = stream->commandEncoder();
[enc setComputePipelineState:pso];
mtl_setArgs(enc, A_keys, B_keys, outA_idx, outB_idx, counter,
static_cast<uint32_t>(lenB), A_is_lhs);
mtl_dispatch1DJob(enc, pso, static_cast<uint32_t>(lenA));
}
});
const uint32_t M = counter.item<int32_t>(); // number of structural matches
const auto M = static_cast<uint32_t>(M_int64); // number of structural matches
r_.resize_as_(lhs);
@ -762,6 +777,14 @@ SparseTensor& add_out_sparse_mps(const SparseTensor& self,
using OptTensor = std::optional<Tensor>;
static Tensor create_sparse_output_values(
const Tensor& template_values,
int64_t output_nnz,
ScalarType dtype) {
auto out_val_sizes = template_values.sizes().vec();
out_val_sizes[0] = output_nnz;
return at::zeros(out_val_sizes, template_values.options().dtype(dtype));
}
static void sparse_mask_apply_out_mps_kernel(
Tensor& result,
@ -783,9 +806,9 @@ static void sparse_mask_apply_out_mps_kernel(
auto src = src_in.coalesce();
auto mask = coalesce_mask ? mask_in.coalesce() : mask_in;
const int64_t src_nnz = src._nnz();
const int64_t mask_nnz = mask._nnz();
const int64_t sd = src.sparse_dim();
const auto src_nnz = src._nnz();
const auto mask_nnz = mask._nnz();
const auto sd = src.sparse_dim();
result.sparse_resize_(mask.sizes(), mask.sparse_dim(), mask.dense_dim());
auto commonDtype = at::result_type(src, mask);
@ -814,53 +837,27 @@ static void sparse_mask_apply_out_mps_kernel(
return;
}
auto mask_indices = mask._indices().contiguous();
auto src_values = src._values().to(commonDtype).contiguous();
auto out_values = create_sparse_output_values(src_values, mask_nnz, commonDtype);
if (src_nnz == 0) {
auto out_indices = mask._indices().contiguous();
auto src_values = src._values().to(commonDtype);
auto out_val_sizes = src_values.sizes().vec();
out_val_sizes[0] = mask_nnz;
auto out_values = at::zeros(out_val_sizes, src_values.options());
alias_into_sparse(result, out_indices, out_values);
alias_into_sparse(result, mask_indices, out_values);
result._coalesced_(mask.is_coalesced());
return;
}
auto mask_indices = mask._indices().contiguous();
auto src_indices = src._indices().contiguous();
auto src_values = src._values().to(commonDtype).contiguous();
auto mask_keys = flatten_indices(mask._indices().contiguous(), mask.sizes().slice(0, sd)).contiguous();
auto src_keys = flatten_indices(src._indices().contiguous(), src.sizes().slice(0, sd)).contiguous();
auto mask_keys = flatten_indices(mask_indices, mask.sizes().slice(0, sd)).contiguous();
auto src_keys = flatten_indices(src_indices, src.sizes().slice(0, sd)).contiguous();
const bool A_is_src = (src_nnz <= mask_nnz);
const int64_t lenA = A_is_src ? src_nnz : mask_nnz;
const int64_t lenB = A_is_src ? mask_nnz : src_nnz;
const auto A_is_src = (src_nnz <= mask_nnz);
const auto lenA = A_is_src ? src_nnz : mask_nnz;
const auto lenB = A_is_src ? mask_nnz : src_nnz;
auto A_keys = A_is_src ? src_keys : mask_keys;
auto B_keys = A_is_src ? mask_keys : src_keys;
const auto device = result.device();
auto stream = getCurrentMPSStream();
auto outA_idx = at::empty({lenA}, at::device(device).dtype(at::kLong));
auto outB_idx = at::empty({lenA}, at::device(device).dtype(at::kLong));
auto counter = at::zeros({1}, at::device(device).dtype(at::kInt));
dispatch_sync_with_rethrow(stream->queue(), ^() {
@autoreleasepool {
auto pso = lib.getPipelineStateForFunc("intersect_binary_search");
auto enc = stream->commandEncoder();
[enc setComputePipelineState:pso];
mtl_setArgs(enc, A_keys, B_keys, outA_idx, outB_idx, counter,
static_cast<uint32_t>(lenB), A_is_src);
mtl_dispatch1DJob(enc, pso, static_cast<uint32_t>(lenA));
}
});
const int64_t M = static_cast<int64_t>(counter.item<int32_t>());
auto out_val_sizes = src_values.sizes().vec();
out_val_sizes[0] = mask_nnz;
auto out_values = at::zeros(out_val_sizes, src_values.options());
auto [outA_idx, outB_idx, M] = mps_intersect_binary_search(
A_keys, B_keys, lenA, lenB, A_is_src);
if (M > 0) {
auto src_match = outA_idx.narrow(0, 0, M);
@ -878,6 +875,70 @@ static void sparse_mask_apply_out_mps_kernel(
result._coalesced_(mask.is_coalesced());
}
static void sparse_mask_projection_out_mps_kernel(
Tensor& result,
const Tensor& lhs,
const Tensor& rhs,
const OptTensor& /*x_hash_opt*/,
bool accumulate_matches) {
TORCH_CHECK(lhs.is_sparse() && rhs.is_sparse(), "sparse_mask_projection: expected sparse COO");
TORCH_CHECK(lhs.is_mps() && rhs.is_mps(), "sparse_mask_projection: expected MPS tensors");
TORCH_CHECK(lhs.sparse_dim() == rhs.sparse_dim(), "sparse_dim mismatch");
auto lhs_c = lhs.coalesce();
auto rhs_c = rhs.coalesce();
const auto sd = lhs_c.sparse_dim();
const auto lhs_nnz = lhs_c._nnz();
const auto rhs_nnz = rhs_c._nnz();
auto commonDtype = at::result_type(lhs_c, rhs_c);
TORCH_CHECK(canCast(commonDtype, result.scalar_type()),
"Can't convert ", commonDtype, " to output ", result.scalar_type());
result.sparse_resize_(lhs.sizes(), lhs.sparse_dim(), lhs.dense_dim());
auto lhs_indices = lhs_c._indices().contiguous();
auto rhs_values = rhs_c._values().to(commonDtype).contiguous();
auto out_values = create_sparse_output_values(rhs_values, lhs_nnz, commonDtype);
if (lhs_nnz > 0 && rhs_nnz > 0) {
auto lhs_keys = flatten_indices(lhs_indices, lhs_c.sizes().slice(0, sd)).contiguous();
auto rhs_keys = flatten_indices(rhs_c._indices().contiguous(), rhs_c.sizes().slice(0, sd)).contiguous();
const auto A_is_lhs = (lhs_nnz <= rhs_nnz);
const auto lenA = A_is_lhs ? lhs_nnz : rhs_nnz;
const auto lenB = A_is_lhs ? rhs_nnz : lhs_nnz;
auto A_keys = A_is_lhs ? lhs_keys : rhs_keys;
auto B_keys = A_is_lhs ? rhs_keys : lhs_keys;
auto [outA_idx, outB_idx, M] = mps_intersect_binary_search(
A_keys, B_keys, lenA, lenB, A_is_lhs);
if (M > 0) {
auto idx_in_A = outA_idx.narrow(0, 0, M);
auto idx_in_B = outB_idx.narrow(0, 0, M);
auto idx_in_lhs = A_is_lhs ? idx_in_A : idx_in_B;
auto idx_in_rhs = A_is_lhs ? idx_in_B : idx_in_A;
const auto view_cols = rhs_values.numel() / std::max<int64_t>(rhs_nnz, 1);
auto rhs_rows = rhs_values.index_select(0, idx_in_rhs).contiguous();
auto rhs_rows_2d = rhs_rows.view({M, view_cols});
auto out_2d = out_values.view({lhs_nnz, view_cols});
if (accumulate_matches) {
out_2d.index_add_(0, idx_in_lhs, rhs_rows_2d);
} else {
out_2d.index_copy_(0, idx_in_lhs, rhs_rows_2d);
}
}
}
alias_into_sparse(result, lhs._indices(), out_values);
result._coalesced_(lhs.is_coalesced());
}
static void sparse_mask_intersection_out_mps_kernel(
Tensor& result,
const Tensor& lhs,
@ -1002,4 +1063,5 @@ Tensor sparse_sparse_matmul_mps(const Tensor& mat1_, const Tensor& mat2_) {
}
REGISTER_MPS_DISPATCH(sparse_mask_intersection_out_stub, &sparse_mask_intersection_out_mps_kernel);
REGISTER_MPS_DISPATCH(sparse_mask_projection_out_stub, &sparse_mask_projection_out_mps_kernel);
} // namespace at::native

View File

@ -1,191 +1,3 @@
#pragma once
#include <ATen/xpu/XPUContext.h>
#include <optional>
namespace at::xpu {
/*
* XPUEvent are movable not copyable wrappers around SYCL event. XPUEvent are
* constructed lazily when first recorded. It has a device, and this device is
* acquired from the first recording stream. Later streams that record the event
* must match the same device.
*
* Currently, XPUEvent does NOT support to export an inter-process event from
* another process via inter-process communication(IPC). So it means that
* inter-process communication for event handles between different processes is
* not available. This could impact some applications that rely on cross-process
* synchronization and communication.
*/
struct TORCH_XPU_API XPUEvent {
// Constructors
XPUEvent(bool enable_timing = false) noexcept
: enable_timing_{enable_timing} {}
~XPUEvent() {
if (isCreated()) {
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_deletion(
at::kXPU, reinterpret_cast<uintptr_t>(event_.get()));
}
}
}
XPUEvent(const XPUEvent&) = delete;
XPUEvent& operator=(const XPUEvent&) = delete;
XPUEvent(XPUEvent&& other) = default;
XPUEvent& operator=(XPUEvent&& other) = default;
operator sycl::event&() const {
return event();
}
std::optional<at::Device> device() const {
if (isCreated()) {
return at::Device(at::kXPU, device_index_);
} else {
return std::nullopt;
}
}
inline bool isCreated() const {
return (event_.get() != nullptr);
}
DeviceIndex device_index() const {
return device_index_;
}
sycl::event& event() const {
return *event_;
}
bool query() const {
using namespace sycl::info;
if (!isCreated()) {
return true;
}
return event().get_info<event::command_execution_status>() ==
event_command_status::complete;
}
void record() {
record(getCurrentXPUStream());
}
void recordOnce(const XPUStream& stream) {
if (!isCreated()) {
record(stream);
}
}
void record(const XPUStream& stream) {
if (!isCreated()) {
device_index_ = stream.device_index();
assignEvent(stream.queue());
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_creation(
at::kXPU, reinterpret_cast<uintptr_t>(event_.get()));
}
} else {
TORCH_CHECK(
device_index_ == stream.device_index(),
"Event device ",
device_index_,
" does not match recording stream's device ",
stream.device_index(),
".");
reassignEvent(stream.queue());
}
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_record(
at::kXPU,
reinterpret_cast<uintptr_t>(event_.get()),
reinterpret_cast<uintptr_t>(&stream.queue()));
}
}
void block(const XPUStream& stream) {
if (isCreated()) {
std::vector<sycl::event> event_list{event()};
// Make this stream wait until event_ is completed.
stream.queue().ext_oneapi_submit_barrier(event_list);
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_wait(
at::kXPU,
reinterpret_cast<uintptr_t>(event_.get()),
reinterpret_cast<uintptr_t>(&stream.queue()));
}
}
}
double elapsed_time(const XPUEvent& other) const {
TORCH_CHECK(
isCreated() && other.isCreated(),
"Both events must be recorded before calculating elapsed time.");
TORCH_CHECK(
query() && other.query(),
"Both events must be completed before calculating elapsed time.");
TORCH_CHECK(
enable_timing_ && other.enable_timing_,
"Both events must be created with argument 'enable_timing=True'.");
#if SYCL_COMPILER_VERSION < 20250000
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"elapsed_time of XPUEvent requires PyTorch to be built with SYCL compiler version 2025.0.0 or newer.");
#endif
using namespace sycl::info::event_profiling;
// Block until both of the recorded events are completed.
uint64_t end_time_ns = other.event().get_profiling_info<command_end>();
uint64_t start_time_ns = event().get_profiling_info<command_end>();
// Return the eplased time in milliseconds.
return 1e-6 *
(static_cast<double>(end_time_ns) - static_cast<double>(start_time_ns));
}
void synchronize() const {
if (isCreated()) {
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_synchronization(
at::kXPU, reinterpret_cast<uintptr_t>(event_.get()));
}
event().wait_and_throw();
}
}
private:
void assignEvent(sycl::queue& queue) {
#if SYCL_COMPILER_VERSION >= 20250000
if (enable_timing_) {
event_ = std::make_unique<sycl::event>(
sycl::ext::oneapi::experimental::submit_profiling_tag(queue));
} else {
event_ = std::make_unique<sycl::event>(queue.ext_oneapi_submit_barrier());
}
#else
event_ = std::make_unique<sycl::event>(queue.ext_oneapi_submit_barrier());
#endif
}
void reassignEvent(sycl::queue& queue) {
event_.reset();
assignEvent(queue);
}
bool enable_timing_ = false;
DeviceIndex device_index_ = -1;
// Only need to track the last event, as events in an in-order queue are
// executed sequentially.
std::unique_ptr<sycl::event> event_;
};
} // namespace at::xpu
#include <c10/xpu/XPUEvent.h>

View File

@ -50,6 +50,7 @@ def check_accuracy(actual_csv, expected_csv, expected_filename):
"mobilenet_v2",
"pytorch_CycleGAN_and_pix2pix",
"pytorch_stargan",
"repvgg_a2",
"resnet152",
"resnet18",
"resnet50",

View File

@ -10,7 +10,7 @@ beit_base_patch16_224,pass,7
convnextv2_nano.fcmae_ft_in22k_in1k,pass,7
convnextv2_nano.fcmae_ft_in22k_in1k,fail_accuracy,7
@ -66,7 +66,7 @@ visformer_small,pass,7
vit_base_patch14_dinov2.lvd142m,pass,7
vit_base_patch14_dinov2.lvd142m,fail_accuracy,7

1 name accuracy graph_breaks
10 mobilenetv2_100 pass 7
11 mobilenetv3_large_100 pass 7
12 mobilevit_s pass 6
13 nfnet_l0 pass 7
14 repvgg_a2 pass 7
15 swin_base_patch4_window7_224 pass 7
16 tf_efficientnet_b0 pass 6
66
67
68
69
70
71
72

View File

@ -50,7 +50,7 @@ nfnet_l0,pass,7
repvgg_a2,fail_accuracy,7
repvgg_a2,pass,7

1 name accuracy graph_breaks
50
51
52
53
54
55
56

View File

@ -2288,11 +2288,9 @@ class BenchmarkRunner:
)
):
is_same = False
except Exception as e:
except Exception:
# Sometimes torch.allclose may throw RuntimeError
exception_string = str(e)
accuracy_status = f"fail_exception: {exception_string}"
return record_status(accuracy_status, dynamo_start_stats=start_stats)
is_same = False
if not is_same:
accuracy_status = "eager_two_runs_differ"
@ -2409,11 +2407,9 @@ class BenchmarkRunner:
force_max_multiplier=force_max_multiplier,
):
is_same = False
except Exception as e:
except Exception:
# Sometimes torch.allclose may throw RuntimeError
exception_string = str(e)
accuracy_status = f"fail_exception: {exception_string}"
return record_status(accuracy_status, dynamo_start_stats=start_stats)
is_same = False
if not is_same:
if self.args.skip_accuracy_check:

View File

@ -0,0 +1,62 @@
import sys
from benchmark_base import BenchmarkBase
import torch
from torch.distributed._tensor import DTensor, Replicate
from torch.testing._internal.distributed.fake_pg import FakeStore
class BenchmarkDTensorDispatch(BenchmarkBase):
def __init__(self, operator, world_size) -> None:
super().__init__(
category=f"dtensor_dispatch_{operator}",
device="cuda",
)
self.world_size = world_size
def name(self) -> str:
prefix = f"{self.category()}"
return prefix
def description(self) -> str:
return f"DTensor dispatch time for {self.category()}"
def _prepare_once(self) -> None:
self.mesh = torch.distributed.device_mesh.init_device_mesh(
"cuda", (self.world_size,), mesh_dim_names=("dp",)
)
self.a = DTensor.from_local(
torch.ones(10, 10, device=self.device()), self.mesh, [Replicate()]
)
self.b = DTensor.from_local(
torch.ones(10, 10, device=self.device()), self.mesh, [Replicate()]
)
def _prepare(self) -> None:
pass
class BenchmarkDetach(BenchmarkDTensorDispatch):
def __init__(self, world_size) -> None:
super().__init__(operator="detach", world_size=world_size)
def _work(self) -> None:
self.a.detach()
def main():
world_size = 256
fake_store = FakeStore()
torch.distributed.init_process_group(
"fake", store=fake_store, rank=0, world_size=world_size
)
result_path = sys.argv[1]
BenchmarkDetach(world_size).enable_instruction_count().collect_all().append_results(
result_path
)
torch.distributed.destroy_process_group()
if __name__ == "__main__":
main()

View File

@ -125,6 +125,17 @@ AttentionType = Literal[
]
DtypeString = Literal["bfloat16", "float16", "float32"]
SpeedupType = Literal["fwd", "bwd"]
# Operator Name mapping
backend_to_operator_name = {
"math": "math attention kernel",
"efficient": "efficient attention kernel",
"cudnn": "cudnn attention kernel",
"fav2": "flash attention 2 kernel",
"fav3": "flash attention 3 kernel",
"fakv": "flash attention kv cache kernel",
"og-eager": "eager attention kernel",
"flex": "flex attention kernel",
}
def benchmark_torch_function_in_microseconds(func: Callable, *args, **kwargs) -> float:
@ -1265,12 +1276,14 @@ def _output_json_for_dashboard(
model: ModelInfo
metric: MetricInfo
operator_name = backend_to_operator_name.get(backend, backend)
# Benchmark extra info
benchmark_extra_info = {
"input_config": input_config,
"device": device,
"arch": device_arch,
"operator_name": backend,
"operator_name": operator_name,
"attn_type": config.attn_type,
"shape": str(config.shape),
"max_autotune": config.max_autotune,
@ -1288,7 +1301,7 @@ def _output_json_for_dashboard(
type="attention-benchmark",
origins=["pytorch"],
extra_info={
"operator_name": backend,
"operator_name": operator_name,
"attn_type": config.attn_type,
},
),
@ -1315,7 +1328,7 @@ def _output_json_for_dashboard(
type="attention-benchmark",
origins=["pytorch"],
extra_info={
"operator_name": backend,
"operator_name": operator_name,
},
),
metric=MetricInfo(
@ -1341,7 +1354,7 @@ def _output_json_for_dashboard(
type="attention-benchmark",
origins=["pytorch"],
extra_info={
"operator_name": backend,
"operator_name": operator_name,
},
),
metric=MetricInfo(
@ -1371,7 +1384,7 @@ def _output_json_for_dashboard(
type="attention-benchmark",
origins=["pytorch"],
extra_info={
"operator_name": backend,
"operator_name": operator_name,
},
),
metric=MetricInfo(

View File

@ -19,6 +19,17 @@
namespace c10 {
using CaptureId_t = unsigned long long;
// first is set if the instance is created by CUDAGraph::capture_begin.
// second is set if the instance is created by at::cuda::graph_pool_handle.
using MempoolId_t = std::pair<CaptureId_t, CaptureId_t>;
struct MempoolIdHash {
std::size_t operator()(const MempoolId_t& mempool_id) const noexcept {
return mempool_id.first != 0 ? mempool_id.first : mempool_id.second;
}
};
// A DataPtr is a unique pointer (with an attached deleter and some
// context for the deleter) to some memory, which also records what
// device is for its data.

View File

@ -99,7 +99,10 @@ struct C10_API DeviceAllocator : public c10::Allocator {
// Return the free memory size and total memory size in bytes for the
// specified device.
virtual std::pair<size_t, size_t> getMemoryInfo(c10::DeviceIndex device) = 0;
virtual std::pair<size_t, size_t> getMemoryInfo(c10::DeviceIndex device) {
TORCH_CHECK_NOT_IMPLEMENTED(
false, "getMemoryInfo is not implemented for this allocator yet.");
}
};
// This function is used to get the DeviceAllocator for a specific device type

View File

@ -27,6 +27,7 @@
#include <torch/headeronly/core/ScalarType.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wswitch-enum")
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wswitch-default")
namespace c10 {
@ -205,6 +206,12 @@ inline bool isSignedType(ScalarType t) {
break;
// Do not add default here, but rather define behavior of every new entry
// here. `-Wswitch-enum` would raise a warning in those cases.
// TODO: get PyTorch to adopt exhaustive switches by default with a way to
// opt specific switches to being non-exhaustive.
// Exhaustive:
// `-Wswitch-enum`, `-Wswitch-default`, `-Wno-covered-switch-default`
// Non-Exhaustive:
// `-Wno-switch-enum`, `-Wswitch-default`, `-Wcovered-switch-default`
}
TORCH_CHECK(false, "Unknown ScalarType ", t);
#undef CASE_ISSIGNED

View File

@ -57,6 +57,8 @@ C10_DECLARE_bool(caffe2_keep_on_shrink);
// respect caffe2_keep_on_shrink.
C10_DECLARE_int64(caffe2_max_keep_on_shrink_memory);
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wswitch-default")
namespace at {
class Tensor;
class TensorBase;
@ -3303,3 +3305,5 @@ static_assert(
#undef C10_GCC_VERSION_MINOR
} // namespace c10
C10_DIAGNOSTIC_POP()

View File

@ -1012,12 +1012,6 @@ PrivatePoolState::PrivatePoolState(
}
}
struct MempoolIdHash {
std::size_t operator()(const MempoolId_t& mempool_id) const noexcept {
return mempool_id.first != 0 ? mempool_id.first : mempool_id.second;
}
};
cudaError_t allocPrimitive(void** ptr, size_t size, AllocParams& p) {
if (p.pool->owner_PrivatePool && p.pool->owner_PrivatePool->allocator()) {
*ptr = p.pool->owner_PrivatePool->allocator()->raw_alloc(size);
@ -4510,66 +4504,3 @@ std::atomic<CUDAAllocator*> allocator;
static BackendStaticInitializer backend_static_initializer;
} // namespace cuda::CUDACachingAllocator
} // namespace c10
namespace c10::cuda {
// uid_ is incremented when a user creates a MemPool,
// for example: using graph_pool_handle() or c10::cuda::MemPool().
//
// uuid_ is incremented when CUDAGraph creates a MemPool
// as a result of a user not providing a pool.
//
// MempoolId_t of {0, 0} is used to denote when no MemPool has been
// passed to a function, either by user or CUDAGraphs. For example,
// default value of MempoolId_t for capture_begin function is {0, 0}.
// That's why uid_ and uuid_ start at 1.
std::atomic<CaptureId_t> MemPool::uid_{1};
std::atomic<CaptureId_t> MemPool::uuid_{1};
MemPool::MemPool(
CUDACachingAllocator::CUDAAllocator* allocator,
bool is_user_created,
bool use_on_oom)
: allocator_(allocator), is_user_created_(is_user_created) {
if (is_user_created_) {
id_ = {0, uid_++};
} else {
id_ = {uuid_++, 0};
}
device_ = c10::cuda::current_device();
CUDACachingAllocator::createOrIncrefPool(device_, id_, allocator);
if (use_on_oom) {
CUDACachingAllocator::setUseOnOOM(device_, id_);
}
}
MemPool::~MemPool() {
TORCH_INTERNAL_ASSERT(use_count() == 1);
CUDACachingAllocator::releasePool(device_, id_);
c10::cuda::CUDACachingAllocator::emptyCache(id_);
}
MempoolId_t MemPool::id() {
return id_;
}
CUDACachingAllocator::CUDAAllocator* MemPool::allocator() {
return allocator_;
}
int MemPool::use_count() {
return CUDACachingAllocator::getPoolUseCount(device_, id_);
}
c10::DeviceIndex MemPool::device() {
return device_;
}
MempoolId_t MemPool::graph_pool_handle(bool is_user_created) {
if (is_user_created) {
return {0, uid_++};
}
return {uuid_++, 0};
}
} // namespace c10::cuda

View File

@ -562,41 +562,7 @@ inline std::string getUserMetadata() {
} // namespace c10::cuda::CUDACachingAllocator
namespace c10::cuda {
// Keep BC only
using c10::CaptureId_t;
using c10::MempoolId_t;
// MemPool represents a pool of memory in a caching allocator. Currently,
// it's just the ID of the pool object maintained in the CUDACachingAllocator.
//
// An allocator pointer can be passed to the MemPool to define how the
// allocations should be done in the pool. For example: using a different
// system allocator such as ncclMemAlloc.
struct C10_CUDA_API MemPool {
MemPool(
CUDACachingAllocator::CUDAAllocator* allocator = nullptr,
bool is_user_created = true,
bool use_on_oom = false);
MemPool(const MemPool&) = delete;
MemPool(MemPool&&) = default;
MemPool& operator=(const MemPool&) = delete;
MemPool& operator=(MemPool&&) = default;
~MemPool();
MempoolId_t id();
CUDACachingAllocator::CUDAAllocator* allocator();
int use_count();
c10::DeviceIndex device();
static MempoolId_t graph_pool_handle(bool is_user_created = true);
private:
static std::atomic<CaptureId_t> uid_;
static std::atomic<CaptureId_t> uuid_;
CUDACachingAllocator::CUDAAllocator* allocator_;
bool is_user_created_;
MempoolId_t id_;
c10::DeviceIndex device_;
};
} // namespace c10::cuda

View File

@ -295,11 +295,19 @@ DeviceAssertionsData* CUDAKernelLaunchRegistry::
C10_CUDA_CHECK_WO_DSA(
cudaMallocManaged(&uvm_assertions_ptr, sizeof(DeviceAssertionsData)));
#if CUDART_VERSION >= 13000
cudaMemLocation cpuDevice;
cpuDevice.type = cudaMemLocationTypeDevice;
cpuDevice.id = cudaCpuDeviceId;
#else
const auto cpuDevice = cudaCpuDeviceId;
#endif
C10_CUDA_CHECK_WO_DSA(cudaMemAdvise(
uvm_assertions_ptr,
sizeof(DeviceAssertionsData),
cudaMemAdviseSetPreferredLocation,
cudaCpuDeviceId));
cpuDevice));
// GPU will establish direct mapping of data in CPU memory, no page faults
// will be generated
@ -307,7 +315,7 @@ DeviceAssertionsData* CUDAKernelLaunchRegistry::
uvm_assertions_ptr,
sizeof(DeviceAssertionsData),
cudaMemAdviseSetAccessedBy,
cudaCpuDeviceId));
cpuDevice));
// Initialize the memory from the CPU; otherwise, pages may have to be created
// on demand. We think that UVM documentation indicates that first access may

View File

@ -24,6 +24,7 @@ set(C10_XPU_HEADERS
XPUCachingAllocator.h
XPUDeviceProp.h
XPUException.h
XPUEvent.h
XPUFunctions.h
XPUMacros.h
XPUStream.h

178
c10/xpu/XPUEvent.h Normal file
View File

@ -0,0 +1,178 @@
#pragma once
#include <c10/xpu/XPUStream.h>
namespace c10::xpu {
/*
* XPUEvent are movable not copyable wrappers around SYCL event. XPUEvent are
* constructed lazily when first recorded. It has a device, and this device is
* acquired from the first recording stream. Later streams that record the event
* must match the same device.
*
* Currently, XPUEvent does NOT support to export an inter-process event from
* another process via inter-process communication(IPC). So it means that
* inter-process communication for event handles between different processes is
* not available. This could impact some applications that rely on cross-process
* synchronization and communication.
*/
struct XPUEvent {
// Constructors
XPUEvent(bool enable_timing = false) noexcept
: enable_timing_{enable_timing} {}
~XPUEvent() {
if (isCreated()) {
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_deletion(
c10::kXPU, reinterpret_cast<uintptr_t>(event_.get()));
}
}
}
C10_DISABLE_COPY_AND_ASSIGN(XPUEvent);
XPUEvent(XPUEvent&& other) = default;
XPUEvent& operator=(XPUEvent&& other) = default;
operator sycl::event&() const {
return event();
}
std::optional<c10::Device> device() const {
if (isCreated()) {
return c10::Device(c10::kXPU, device_index_);
} else {
return std::nullopt;
}
}
inline bool isCreated() const {
return (event_.get() != nullptr);
}
DeviceIndex device_index() const {
return device_index_;
}
sycl::event& event() const {
return *event_;
}
bool query() const {
using namespace sycl::info;
if (!isCreated()) {
return true;
}
return event().get_info<event::command_execution_status>() ==
event_command_status::complete;
}
void record() {
record(getCurrentXPUStream());
}
void recordOnce(const XPUStream& stream) {
if (!isCreated()) {
record(stream);
}
}
void record(const XPUStream& stream) {
if (!isCreated()) {
device_index_ = stream.device_index();
assignEvent(stream.queue());
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_creation(
c10::kXPU, reinterpret_cast<uintptr_t>(event_.get()));
}
} else {
TORCH_CHECK(
device_index_ == stream.device_index(),
"Event device ",
device_index_,
" does not match recording stream's device ",
stream.device_index(),
".");
reassignEvent(stream.queue());
}
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_record(
c10::kXPU,
reinterpret_cast<uintptr_t>(event_.get()),
reinterpret_cast<uintptr_t>(&stream.queue()));
}
}
void block(const XPUStream& stream) {
if (isCreated()) {
std::vector<sycl::event> event_list{event()};
// Make this stream wait until event_ is completed.
stream.queue().ext_oneapi_submit_barrier(event_list);
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_wait(
c10::kXPU,
reinterpret_cast<uintptr_t>(event_.get()),
reinterpret_cast<uintptr_t>(&stream.queue()));
}
}
}
double elapsed_time(const XPUEvent& other) const {
TORCH_CHECK(
isCreated() && other.isCreated(),
"Both events must be recorded before calculating elapsed time.");
TORCH_CHECK(
query() && other.query(),
"Both events must be completed before calculating elapsed time.");
TORCH_CHECK(
enable_timing_ && other.enable_timing_,
"Both events must be created with argument 'enable_timing=True'.");
using namespace sycl::info::event_profiling;
// Block until both of the recorded events are completed.
uint64_t end_time_ns = other.event().get_profiling_info<command_end>();
uint64_t start_time_ns = event().get_profiling_info<command_end>();
// Return the eplased time in milliseconds.
return 1e-6 *
(static_cast<double>(end_time_ns) - static_cast<double>(start_time_ns));
}
void synchronize() const {
if (isCreated()) {
const c10::impl::PyInterpreter* interp = c10::impl::GPUTrace::get_trace();
if (C10_UNLIKELY(interp)) {
(*interp)->trace_gpu_event_synchronization(
c10::kXPU, reinterpret_cast<uintptr_t>(event_.get()));
}
event().wait_and_throw();
}
}
private:
void assignEvent(sycl::queue& queue) {
if (enable_timing_) {
event_ = std::make_unique<sycl::event>(
sycl::ext::oneapi::experimental::submit_profiling_tag(queue));
} else {
event_ = std::make_unique<sycl::event>(queue.ext_oneapi_submit_barrier());
}
}
void reassignEvent(sycl::queue& queue) {
event_.reset();
assignEvent(queue);
}
bool enable_timing_ = false;
c10::DeviceIndex device_index_ = -1;
// Only need to track the last event, as events in an in-order queue are
// executed sequentially.
std::unique_ptr<sycl::event> event_;
};
} // namespace c10::xpu

View File

@ -1,7 +1,7 @@
# This will define the following variables:
# SYCL_FOUND : True if the system has the SYCL library.
# SYCL_INCLUDE_DIR : Include directories needed to use SYCL.
# SYCL_LIBRARY_DIR The path to the SYCL library.
# SYCL_LIBRARY_DIR : The path to the SYCL library.
# SYCL_LIBRARY : SYCL library fullname.
# SYCL_COMPILER_VERSION : SYCL compiler version.

View File

@ -0,0 +1,164 @@
# Accelerator Hooks
## Background
OpenReg hooks provide a mechanism for integrating custom accelerator devices into PyTorch's runtime system. OpenReg (Open Registration) is PyTorch's extensibility framework that allows accelerator vendors to register custom device backends without modifying PyTorch core code.
## Design
The following tables list all hooks that accelerator vendors need to implement when integrating a new device backend. These hooks are categorized into two priority levels:
- **High Priority Hooks**: Core APIs that PyTorch runtime directly depends on. Accelerator vendors are recommended to implement all high priority hooks to ensure full PyTorch compatibility and enable basic device functionality.
- **Low Priority Hooks**: Device management and utility APIs that PyTorch does not directly depend on. These hooks enhance user experience and multi-device support but are *optional*. Accelerator vendors can choose to implement them based on their specific requirements and use cases.
### High Priority Hooks
| Hook Method | Description | Application Scenario |
| ---------------------------------- | --------------------------------------------------------- | -------------------------------------------------------------------------------- |
| `init()` | Initializes the accelerator runtime and device contexts | Set up necessary state when PyTorch first accesses the device |
| `hasPrimaryContext(DeviceIndex)` | Checks if a primary context exists for the device | Determine whether device initialization has occurred |
| `getDefaultGenerator(DeviceIndex)` | Returns the default random number generator for a device | Access the device's primary RNG for reproducible random operations |
| `getNewGenerator(DeviceIndex)` | Creates a new independent random number generator | Create isolated RNG instances for parallel operations |
| `getDeviceFromPtr(void*)` | Determines which device a memory pointer belongs to | Identify the accelerator device associated with a memory allocation |
| `getPinnedMemoryAllocator()` | Returns an allocator for pinned (page-locked) host memory | Allocate host memory that can be efficiently transferred to/from the accelerator |
| `isPinnedPtr(void*)` | Checks if a pointer points to pinned memory | Validate memory types before performing operations |
### Low Priority Hooks
| Hook Method | Description | Application Scenario |
| ---------------------------------- | ---------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| `isBuilt()` | Returns whether the accelerator backend is built/compiled into the extension | Check whether the accelerator library is available at compile time |
| `isAvailable()` | Returns whether the accelerator hardware is available at runtime | Verify whether accelerator devices can be detected and initialized |
| `deviceCount()` | Returns the number of available accelerator devices | Enumerate all available accelerator devices for device selection |
| `setCurrentDevice(DeviceIndex)` | Sets the active device for the current thread | Switch the current thread's context to a specific accelerator device |
| `getCurrentDevice()` | Returns the currently active device index | Query which accelerator device is active in the current thread |
| `exchangeDevice(DeviceIndex)` | Atomically exchanges the current device and returns the previous one | Temporarily switch devices and restore the previous device afterward |
| `maybeExchangeDevice(DeviceIndex)` | Conditionally exchanges device only if the index is valid | Safely attempt device switching with validation |
## Implementation
We can just take `getDefaultGenerator` as an implementation example:
```{eval-rst}
.. literalinclude:: ../../../test/cpp_extensions/open_registration_extension/torch_openreg/csrc/runtime/OpenRegHooks.h
:language: c++
:start-after: LITERALINCLUDE START: OPENREG HOOK EXAMPLES
:end-before: LITERALINCLUDE END: OPENREG HOOK EXAMPLES
:linenos:
```
In this implementation:
1. **Override the base interface**: The `getDefaultGenerator` method overrides the virtual method from `at::PrivateUse1HooksInterface`.
2. **Delegate to device-specific implementation**: It calls `getDefaultOpenRegGenerator(device_index)`, which manages a per-device generator instance.
3. **Return device-specific generator**: The returned `at::Generator` wraps an `OpenRegGeneratorImpl` that implements device-specific random number generation.
This pattern applies to all hooks: override the interface method, validate inputs, delegate to your device-specific API, and return results in PyTorch's expected format.
## Integration Example
The following sections demonstrate how PyTorch integrates with accelerator hooks when accessing the default random number generator. The example traces the complete flow from user-facing Python code down to the device-specific implementation.
### Layer 1: User Code
User code initiates the operation by calling `manual_seed` to set the random seed for reproducible results:
```python
import torch
torch.openreg.manual_seed(42)
```
### Layer 2: Extension Python API
The Python API layer handles device management and calls into the C++ extension (defined in [`torch_openreg/openreg/random.py`][random.py]):
```{eval-rst}
.. literalinclude:: ../../../test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg/openreg/random.py
:language: python
:start-after: LITERALINCLUDE START: OPENREG MANUAL SEED
:end-before: LITERALINCLUDE END: OPENREG MANUAL SEED
:linenos:
```
The `manual_seed` function gets the current device index and calls `torch_openreg._C._get_default_generator(idx)` to obtain the device-specific generator, then sets the seed on it.
### Layer 3: Python/C++ Bridge
The C++ extension exposes `_getDefaultGenerator` to Python, which bridges to PyTorch's core runtime:
```{eval-rst}
.. literalinclude:: ../../../test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg/csrc/Module.cpp
:language: c++
:start-after: LITERALINCLUDE START: OPENREG GET DEFAULT GENERATOR
:end-before: LITERALINCLUDE END: OPENREG GET DEFAULT GENERATOR
:linenos:
:emphasize-lines: 10-11
```
```{eval-rst}
.. literalinclude:: ../../../test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg/csrc/Module.cpp
:language: c++
:start-after: LITERALINCLUDE START: OPENREG MODULE METHODS
:end-before: LITERALINCLUDE END: OPENREG MODULE METHODS
:linenos:
:emphasize-lines: 3
```
This function unpacks the device index from Python, creates a `PrivateUse1` device object, and calls `at::globalContext().defaultGenerator()`. PyTorch's context then dispatches to the registered hooks.
### Layer 4: PyTorch Core Context
PyTorch's Context class dispatches to the appropriate accelerator hooks ([`aten/src/ATen/Context.h`][Context.h]):
```{eval-rst}
.. literalinclude:: ../../../aten/src/ATen/Context.h
:language: c++
:lines: 60-103
:linenos:
:emphasize-lines: 8-9, 24-25
```
This layered architecture enables PyTorch to remain device-agnostic while delegating hardware-specific operations to accelerator implementations. The hooks are registered once at module load time:
```{eval-rst}
.. literalinclude:: ../../../test/cpp_extensions/open_registration_extension/torch_openreg/csrc/runtime/OpenRegHooks.cpp
:language: c++
:start-after: LITERALINCLUDE START: OPENREG HOOK REGISTER
:end-before: LITERALINCLUDE END: OPENREG HOOK REGISTER
:linenos:
:emphasize-lines: 4
```
### Layer 5: Accelerator Hooks
The hooks interface provides the abstraction that PyTorch uses to delegate to device-specific implementations:
```{eval-rst}
.. literalinclude:: ../../../test/cpp_extensions/open_registration_extension/torch_openreg/csrc/runtime/OpenRegHooks.h
:language: c++
:start-after: LITERALINCLUDE START: OPENREG HOOK EXAMPLES
:end-before: LITERALINCLUDE END: OPENREG HOOK EXAMPLES
:linenos:
```
The `getDefaultGenerator` hook method overrides the base interface and delegates to `getDefaultOpenRegGenerator`, which manages the actual generator instances.
### Layer 6: Device-Specific Implementation
The device-specific implementation manages per-device generator instances:
```{eval-rst}
.. literalinclude:: ../../../test/cpp_extensions/open_registration_extension/torch_openreg/csrc/runtime/OpenRegGenerator.cpp
:language: c++
:start-after: LITERALINCLUDE START: OPENREG GET DEFAULT GENERATOR IMPL
:end-before: LITERALINCLUDE END: OPENREG GET DEFAULT GENERATOR IMPL
:linenos:
```
This function maintains a static vector of generators (one per device), initializes them on first access, validates the device index, and returns the appropriate generator instance.
[random.py]: https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg/openreg/random.py#L48-L53 "random.py"
[Context.h]: https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/Context.h#L61-L102 "Context.h"

View File

@ -42,6 +42,7 @@ Next, we will delve into each chapter of this guide. Each chapter focuses on a k
:glob:
:maxdepth: 1
hooks
autoload
operators
amp

View File

@ -14,6 +14,10 @@ Utils
sdpa_kernel
SDPBackend
register_flash_attention_impl
activate_flash_attention_impl
list_flash_attention_impls
current_flash_attention_impl
Submodules
----------

View File

@ -24,15 +24,11 @@ def gen_data(special_op_lists, analysis_name):
all_ops = get_ops_for_key(None)
composite_ops = get_ops_for_key("CompositeImplicitAutograd")
noncomposite_ops = all_ops - composite_ops
with open("../../aten/src/ATen/native/native_functions.yaml") as f:
ops = yaml.load(f.read(), Loader=yaml.CLoader)
ops = yaml.load(
open("../../aten/src/ATen/native/native_functions.yaml").read(),
Loader=yaml.CLoader,
)
annotated_ops = {
a.strip(): b.strip() for a, b in list(csv.reader(open("annotated_ops")))
}
with open("annotated_ops") as f:
annotated_ops = {a.strip(): b.strip() for a, b in csv.reader(f)}
uniq_ops = []
uniq_names = set()

View File

@ -10,7 +10,7 @@ tp2_dir="$top_dir/third_party"
pip install ninja
# Install onnx
pip install --no-use-pep517 -e "$tp2_dir/onnx"
pip install -e "$tp2_dir/onnx"
# Install caffe2 and pytorch
pip install -r "$top_dir/caffe2/requirements.txt"

View File

@ -1358,6 +1358,45 @@ class concat_license_files:
# Need to create the proper LICENSE.txt for the wheel
class bdist_wheel(setuptools.command.bdist_wheel.bdist_wheel):
def _wrap_headers_with_macro(self, bdist_dir: Path) -> None:
"""Wrap all header files with #if !defined(TORCH_STABLE_ONLY) && !defined(TORCH_TARGET_VERSION).
Excludes:
- torch/include/torch/headeronly/*
- torch/include/torch/csrc/stable/*
- torch/include/torch/csrc/inductor/aoti_torch/c/ (only shim headers)
- torch/include/torch/csrc/inductor/aoti_torch/generated/
"""
header_extensions = (".h", ".hpp", ".cuh")
header_files = [
f for ext in header_extensions for f in bdist_dir.rglob(f"*{ext}")
]
# Paths to exclude from wrapping
exclude_dir_patterns = [
"torch/include/torch/headeronly/",
"torch/include/torch/csrc/stable/",
"torch/include/torch/csrc/inductor/aoti_torch/c/",
"torch/include/torch/csrc/inductor/aoti_torch/generated/",
]
for header_file in header_files:
rel_path = header_file.relative_to(bdist_dir).as_posix()
if any(rel_path.startswith(pattern) for pattern in exclude_dir_patterns):
report(f"Skipping header: {rel_path}")
continue
original_content = header_file.read_text(encoding="utf-8")
wrapped_content = (
"#if !defined(TORCH_STABLE_ONLY) && !defined(TORCH_TARGET_VERSION)\n"
f"{original_content}"
"\n#endif // !defined(TORCH_STABLE_ONLY) && !defined(TORCH_TARGET_VERSION)\n"
)
header_file.write_text(wrapped_content, encoding="utf-8")
report(f"Wrapped header: {rel_path}")
def run(self) -> None:
with concat_license_files(include_files=True):
super().run()
@ -1380,6 +1419,14 @@ class bdist_wheel(setuptools.command.bdist_wheel.bdist_wheel):
# need an __init__.py file otherwise we wouldn't have a package
(bdist_dir / "torch" / "__init__.py").touch()
# Wrap all header files with TORCH_STABLE_ONLY macro
assert self.bdist_dir is not None, "bdist_dir should be set during wheel build"
bdist_dir = Path(self.bdist_dir)
report(
"-- Wrapping header files with if !defined(TORCH_STABLE_ONLY) && !defined(TORCH_TARGET_VERSION)"
)
self._wrap_headers_with_macro(bdist_dir)
class clean(Command):
user_options: ClassVar[list[tuple[str, str | None, str]]] = []

View File

@ -308,12 +308,16 @@ class StepcurrentPlugin:
self.report_status = ""
assert config.cache is not None
self.cache: pytest.Cache = config.cache
self.directory = f"{STEPCURRENT_CACHE_DIR}/{config.getoption('stepcurrent')}"
self.lastrun: Optional[str] = self.cache.get(self.directory, None)
directory = f"{STEPCURRENT_CACHE_DIR}/{config.getoption('stepcurrent')}"
self.lastrun_location = f"{directory}/lastrun"
self.lastrun: Optional[str] = self.cache.get(self.lastrun_location, None)
self.initial_val = self.lastrun
self.skip: bool = config.getoption("stepcurrent_skip")
self.run_single: bool = config.getoption("run_single")
self.made_failing_xml_location = f"{directory}/made_failing_xml"
self.cache.set(self.made_failing_xml_location, False)
def pytest_collection_modifyitems(self, config: Config, items: list[Any]) -> None:
if not self.lastrun:
self.report_status = "Cannot find last run test, not skipping"
@ -349,8 +353,10 @@ class StepcurrentPlugin:
def pytest_runtest_protocol(self, item, nextitem) -> None:
self.lastrun = item.nodeid
self.cache.set(self.directory, self.lastrun)
self.cache.set(self.lastrun_location, self.lastrun)
def pytest_sessionfinish(self, session, exitstatus):
if exitstatus == 0:
self.cache.set(self.directory, self.initial_val)
self.cache.set(self.lastrun_location, self.initial_val)
if exitstatus != 0:
self.cache.set(self.made_failing_xml_location, True)

View File

@ -38,7 +38,7 @@ using torch::stable::Tensor;
Tensor sgd_out_of_place(
const Tensor param,
const Tensor grad,
const float weight_decay,
const double weight_decay,
const double lr,
const bool maximize) {
STD_TORCH_CHECK(param.dim() == 1, "param must be 1D");
@ -57,7 +57,7 @@ Tensor sgd_out_of_place(
reinterpret_cast<float*>(param.data_ptr()),
reinterpret_cast<float*>(grad.data_ptr()),
reinterpret_cast<float*>(out.data_ptr()),
weight_decay,
float(weight_decay),
lr,
maximize,
param.numel()
@ -66,44 +66,29 @@ Tensor sgd_out_of_place(
return out;
}
void boxed_sgd_out_of_place(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
Tensor res = sgd_out_of_place(
torch::stable::detail::to<Tensor>(stack[0]),
torch::stable::detail::to<Tensor>(stack[1]),
float(torch::stable::detail::to<double>(stack[2])),
torch::stable::detail::to<double>(stack[3]),
torch::stable::detail::to<bool>(stack[4]));
stack[0] = torch::stable::detail::from(res);
}
STABLE_TORCH_LIBRARY(libtorch_agnostic, m) {
m.def("sgd_out_of_place(Tensor param, Tensor grad, float weight_decay, float lr, bool maximize) -> Tensor");
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CPU, m) {
m.impl("sgd_out_of_place", &boxed_sgd_out_of_place);
m.impl("sgd_out_of_place", TORCH_BOX(&sgd_out_of_place));
}
Tensor identity(Tensor t) {
return t;
}
void boxed_identity(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
Tensor res = identity(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("identity(Tensor t) -> Tensor");
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CUDA, m) {
m.impl("identity", &boxed_identity);
m.impl("identity", TORCH_BOX(&identity));
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CPU, m) {
m.impl("identity", &boxed_identity);
m.impl("identity", TORCH_BOX(&identity));
}
Tensor my_abs(Tensor t) {
@ -114,17 +99,12 @@ Tensor my_abs(Tensor t) {
return torch::stable::detail::to<Tensor>(stack[0]);
}
void boxed_my_abs(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
Tensor tensor_res = my_abs(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(tensor_res);
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("my_abs(Tensor t) -> Tensor");
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CompositeExplicitAutograd, m) {
m.impl("my_abs", &boxed_my_abs);
m.impl("my_abs", TORCH_BOX(&my_abs));
}
Tensor my_ones_like(Tensor t, StableIValue device) {
@ -145,17 +125,12 @@ Tensor my_ones_like(Tensor t, StableIValue device) {
return torch::stable::detail::to<Tensor>(stack[0]);
}
void boxed_my_ones_like(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
Tensor res = my_ones_like(torch::stable::detail::to<Tensor>(stack[0]), stack[1]);
stack[0] = torch::stable::detail::from(res);
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("my_ones_like(Tensor t, Device d) -> Tensor");
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CompositeExplicitAutograd, m) {
m.impl("my_ones_like", &boxed_my_ones_like);
m.impl("my_ones_like", TORCH_BOX(&my_ones_like));
}
std::tuple<Tensor, Tensor, bool> exp_neg_is_leaf(Tensor t1, Tensor t2, Tensor t3) {
@ -177,19 +152,12 @@ std::tuple<Tensor, Tensor, bool> exp_neg_is_leaf(Tensor t1, Tensor t2, Tensor t3
torch::stable::detail::to<bool>(stack_is_leaf[0]));
}
void boxed_exp_neg_is_leaf(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
auto tuple = exp_neg_is_leaf(torch::stable::detail::to<Tensor>(stack[0]), torch::stable::detail::to<Tensor>(stack[1]), torch::stable::detail::to<Tensor>(stack[2]));
stack[0] = torch::stable::detail::from(std::get<0>(tuple));
stack[1] = torch::stable::detail::from(std::get<1>(tuple));
stack[2] = torch::stable::detail::from(std::get<2>(tuple));
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("exp_neg_is_leaf(Tensor t1, Tensor t2, Tensor t3) -> (Tensor, Tensor, bool)");
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CompositeExplicitAutograd, m) {
m.impl("exp_neg_is_leaf", &boxed_exp_neg_is_leaf);
m.impl("exp_neg_is_leaf", TORCH_BOX(&exp_neg_is_leaf));
}
Tensor neg_exp(Tensor t) {
@ -200,17 +168,12 @@ Tensor neg_exp(Tensor t) {
return torch::stable::detail::to<Tensor>(stack[0]);
}
void boxed_neg_exp(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
Tensor res = neg_exp(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("neg_exp(Tensor t) -> Tensor");
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CompositeExplicitAutograd, m) {
m.impl("neg_exp", &boxed_neg_exp);
m.impl("neg_exp", TORCH_BOX(&neg_exp));
}
Tensor divide_neg_exp(Tensor t) {
@ -229,108 +192,53 @@ Tensor divide_neg_exp(Tensor t) {
return torch::stable::detail::to<Tensor>(stack_div[0]);
}
void boxed_divide_neg_exp(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
Tensor res = divide_neg_exp(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("divide_neg_exp(Tensor t) -> Tensor");
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CompositeExplicitAutograd, m) {
m.impl("divide_neg_exp", &boxed_divide_neg_exp);
m.impl("divide_neg_exp", TORCH_BOX(&divide_neg_exp));
}
bool is_contiguous(Tensor t) {
return t.is_contiguous();
}
void boxed_is_contiguous(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
bool res = is_contiguous(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("is_contiguous(Tensor t) -> bool");
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CompositeExplicitAutograd, m) {
m.impl("is_contiguous", &boxed_is_contiguous);
m.impl("is_contiguous", TORCH_BOX(&is_contiguous));
}
Tensor my_transpose(Tensor t, int64_t dim0, int64_t dim1) {
return transpose(t, dim0, dim1);
}
void boxed_my_transpose(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
auto res = my_transpose(torch::stable::detail::to<Tensor>(stack[0]), torch::stable::detail::to<int64_t>(stack[1]), torch::stable::detail::to<int64_t>(stack[2]));
stack[0] = torch::stable::detail::from(res);
}
Tensor my_empty_like(Tensor t) {
return empty_like(t);
}
void boxed_empty_like(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
auto res = my_empty_like(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
bool my_is_cpu(Tensor t) {
return t.is_cpu();
}
void boxed_my_is_cpu(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
auto res = my_is_cpu(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
Tensor fill_infinity(Tensor t) {
auto value = std::numeric_limits<float>::infinity();
return fill_(t, value);
}
void boxed_fill_infinity(
StableIValue* stack,
uint64_t num_args,
uint64_t num_outputs) {
auto res = fill_infinity(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
Tensor my_pad(Tensor t) {
std::string mode = "constant";
double value = 0.0;
return pad(t, {1, 2, 2, 1}, mode, value);
}
void boxed_my_pad(
StableIValue* stack,
uint64_t num_args,
uint64_t num_outputs) {
auto res = my_pad(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
Tensor my_narrow(Tensor t, int64_t dim, int64_t start, int64_t length) {
return narrow(t, dim, start, length);
}
void boxed_my_narrow(
StableIValue* stack,
uint64_t num_args,
uint64_t num_outputs) {
auto res = my_narrow(
torch::stable::detail::to<Tensor>(stack[0]),
torch::stable::detail::to<int64_t>(stack[1]),
torch::stable::detail::to<int64_t>(stack[2]),
torch::stable::detail::to<int64_t>(stack[3]));
stack[0] = torch::stable::detail::from(res);
}
Tensor my_new_empty_dtype_variant(Tensor t) {
// Still using a std::vector below even though people can just pass in an
// initializer list (which will be implicitly converted to an HeaderOnlyArrayRef)
@ -342,40 +250,19 @@ Tensor my_new_empty_dtype_variant(Tensor t) {
return new_empty(t, sizes, dtype);
}
void boxed_my_new_empty_dtype_variant(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
auto res = my_new_empty_dtype_variant(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
Tensor my_new_zeros_dtype_variant(Tensor t) {
auto dtype = std::make_optional(at::ScalarType::Float);
return new_zeros(t, {2, 5}, dtype);
}
void boxed_my_new_zeros_dtype_variant(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
auto res = my_new_zeros_dtype_variant(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
Tensor my_copy_(Tensor dst, Tensor src, bool non_blocking) {
return copy_(dst, src, non_blocking);
}
void boxed_my_copy_(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
Tensor tensor_res = my_copy_(torch::stable::detail::to<Tensor>(stack[0]), torch::stable::detail::to<Tensor>(stack[1]), torch::stable::detail::to<bool>(stack[2]));
stack[0] = torch::stable::detail::from(tensor_res);
}
Tensor my_clone(Tensor t) {
return clone(t);
}
void boxed_my_clone(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
Tensor tensor_res = my_clone(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(tensor_res);
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("my_transpose(Tensor t, int dim0, int dim1) -> Tensor");
m.def("my_empty_like(Tensor t) -> Tensor");
@ -389,57 +276,39 @@ STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CompositeExplicitAutograd, m) {
m.impl("my_transpose", &boxed_my_transpose);
m.impl("my_empty_like", &boxed_empty_like);
m.impl("fill_infinity", &boxed_fill_infinity);
m.impl("my_is_cpu", &boxed_my_is_cpu);
m.impl("my_new_empty_dtype_variant", &boxed_my_new_empty_dtype_variant);
m.impl("my_new_zeros_dtype_variant", &boxed_my_new_zeros_dtype_variant);
m.impl("my_copy_", &boxed_my_copy_);
m.impl("my_clone", &boxed_my_clone);
m.impl("my_transpose", TORCH_BOX(&my_transpose));
m.impl("my_empty_like", TORCH_BOX(&my_empty_like));
m.impl("fill_infinity", TORCH_BOX(&fill_infinity));
m.impl("my_is_cpu", TORCH_BOX(&my_is_cpu));
m.impl("my_new_empty_dtype_variant", TORCH_BOX(&my_new_empty_dtype_variant));
m.impl("my_new_zeros_dtype_variant", TORCH_BOX(&my_new_zeros_dtype_variant));
m.impl("my_copy_", TORCH_BOX(&my_copy_));
m.impl("my_clone", TORCH_BOX(&my_clone));
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CompositeImplicitAutograd, m) {
m.impl("my_pad", &boxed_my_pad);
m.impl("my_narrow", &boxed_my_narrow);
m.impl("my_pad", TORCH_BOX(&my_pad));
m.impl("my_narrow", TORCH_BOX(&my_narrow));
}
Tensor my_zero_(Tensor t) {
return zero_(t);
}
void boxed_my_zero_(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
auto res = my_zero_(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
Tensor my_amax(Tensor t) {
return amax(t, 0, false);
}
void boxed_my_amax(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
auto res = my_amax(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
Tensor my_amax_vec(Tensor t) {
return amax(t, {0,1}, false);
}
void boxed_my_amax_vec(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
auto res = my_amax_vec(torch::stable::detail::to<Tensor>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("my_zero_(Tensor(a!) t) -> Tensor(a!)");
m.def("my_amax(Tensor a) -> Tensor");
m.def("my_amax_vec(Tensor a) -> Tensor");
m.def("my_is_cpu(Tensor t) -> bool");
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CPU, m) {
m.impl("my_zero_", &boxed_my_zero_);
m.def("test_default_constructor(bool undefined) -> bool");
}
bool test_default_constructor(bool defined) {
@ -461,22 +330,12 @@ bool test_default_constructor(bool defined) {
return out.defined();
}
void boxed_test_default_constructor(
StableIValue* stack,
uint64_t num_args,
uint64_t num_outputs) {
bool res = test_default_constructor(torch::stable::detail::to<bool>(stack[0]));
stack[0] = torch::stable::detail::from(res);
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("test_default_constructor(bool undefined) -> bool");
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CompositeExplicitAutograd, m) {
m.impl("test_default_constructor", &boxed_test_default_constructor);
m.impl("my_amax", &boxed_my_amax);
m.impl("my_amax_vec", &boxed_my_amax_vec);
m.impl("my_zero_", TORCH_BOX(&my_zero_));
m.impl("my_amax", TORCH_BOX(&my_amax));
m.impl("my_amax_vec", TORCH_BOX(&my_amax_vec));
m.impl("test_default_constructor", TORCH_BOX(&test_default_constructor));
}
std::vector<Tensor> my__foreach_mul(torch::headeronly::HeaderOnlyArrayRef<Tensor> self, torch::headeronly::HeaderOnlyArrayRef<Tensor> other) {
@ -485,23 +344,11 @@ std::vector<Tensor> my__foreach_mul(torch::headeronly::HeaderOnlyArrayRef<Tensor
return torch::stable::detail::to<std::vector<Tensor>>(stack[0]);
}
void boxed_my__foreach_mul(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
// Why is the following NOT torch::stable::detail::to<HeaderOnlyArrayRef<Tensor>>(stack[0])? Because calling `to`
// on a StableIValue means that the result is owning its underlying data now! HeaderOnlyArrayRef
// is not owning, so it cannot safely steward the result of the torch::stable::detail::to<>.
auto res = my__foreach_mul(torch::stable::detail::to<std::vector<Tensor>>(stack[0]), torch::stable::detail::to<std::vector<Tensor>>(stack[1]));
stack[0] = torch::stable::detail::from(res);
}
void my__foreach_mul_(torch::headeronly::HeaderOnlyArrayRef<Tensor> self, torch::headeronly::HeaderOnlyArrayRef<Tensor> other) {
std::array<StableIValue, 2> stack = {torch::stable::detail::from(self), torch::stable::detail::from(other)};
aoti_torch_call_dispatcher("aten::_foreach_mul_", "List", stack.data());
}
void boxed_my__foreach_mul_(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
my__foreach_mul_(torch::stable::detail::to<std::vector<Tensor>>(stack[0]), torch::stable::detail::to<std::vector<Tensor>>(stack[1]));
}
std::vector<Tensor> make_tensor_clones_and_call_foreach(Tensor t1, Tensor t2) {
// This function tests that my__foreach_mul can take in std::initializer_lists
// in addition to std::vectors.
@ -512,11 +359,6 @@ std::vector<Tensor> make_tensor_clones_and_call_foreach(Tensor t1, Tensor t2) {
return my__foreach_mul({t1_1, t2_1}, {t1_2, t2_2});
}
void boxed_make_tensor_clones_and_call_foreach(StableIValue* stack, uint64_t num_args, uint64_t num_outputs) {
auto res = make_tensor_clones_and_call_foreach(torch::stable::detail::to<Tensor>(stack[0]), torch::stable::detail::to<Tensor>(stack[1]));
stack[0] = torch::stable::detail::from(res);
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("my__foreach_mul(Tensor[] self, Tensor[] other) -> Tensor[]");
m.def("my__foreach_mul_(Tensor(a!)[] self, Tensor[] other) -> ()");
@ -524,9 +366,9 @@ STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CompositeExplicitAutograd, m) {
m.impl("my__foreach_mul", &boxed_my__foreach_mul);
m.impl("my__foreach_mul_", &boxed_my__foreach_mul_);
m.impl("make_tensor_clones_and_call_foreach", &boxed_make_tensor_clones_and_call_foreach);
m.impl("my__foreach_mul", TORCH_BOX(&my__foreach_mul));
m.impl("my__foreach_mul_", TORCH_BOX(&my__foreach_mul_));
m.impl("make_tensor_clones_and_call_foreach", TORCH_BOX(&make_tensor_clones_and_call_foreach));
}
// Test functions for torch::stable::Tensor device method
@ -690,14 +532,6 @@ int64_t test_device_guard(int64_t device_index) {
return currentDevice;
}
void boxed_test_device_guard(
StableIValue* stack,
uint64_t num_args,
uint64_t num_outputs) {
int res = test_device_guard(static_cast<int64_t>(torch::stable::detail::to<int64_t>(stack[0])));
stack[0] = torch::stable::detail::from(res);
}
int64_t test_device_guard_set_index() {
using torch::stable::accelerator::DeviceGuard;
@ -709,14 +543,6 @@ int64_t test_device_guard_set_index() {
return currentDevice;
}
void boxed_test_device_guard_set_index(
StableIValue* stack,
uint64_t num_args,
uint64_t num_outputs) {
int64_t res = test_device_guard_set_index();
stack[0] = torch::stable::detail::from(res);
}
int64_t test_stream(int32_t device_index) {
STD_TORCH_CHECK(
device_index >= std::numeric_limits<int32_t>::min() &&
@ -726,26 +552,10 @@ int64_t test_stream(int32_t device_index) {
return torch::stable::accelerator::getCurrentStream(device_index).id();
}
void boxed_test_stream(
StableIValue* stack,
uint64_t num_args,
uint64_t num_outputs) {
int64_t res = test_stream(static_cast<int64_t>(torch::stable::detail::to<int64_t>(stack[0])));
stack[0] = torch::stable::detail::from(res);
}
int64_t test_get_current_device_index() {
return torch::stable::accelerator::getCurrentDeviceIndex();
}
void boxed_test_get_current_device_index(
StableIValue* stack,
uint64_t num_args,
uint64_t num_outputs) {
int64_t res = test_get_current_device_index();
stack[0] = torch::stable::detail::from(res);
}
STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
m.def("test_device_guard(int device_index) -> int");
m.def("test_device_guard_set_index() -> int");
@ -754,10 +564,10 @@ STABLE_TORCH_LIBRARY_FRAGMENT(libtorch_agnostic, m) {
}
STABLE_TORCH_LIBRARY_IMPL(libtorch_agnostic, CompositeExplicitAutograd, m) {
m.impl("test_device_guard", &boxed_test_device_guard);
m.impl("test_device_guard_set_index", &boxed_test_device_guard_set_index);
m.impl("test_stream", &boxed_test_stream);
m.impl("test_get_current_device_index", &boxed_test_get_current_device_index);
m.impl("test_device_guard", TORCH_BOX(&test_device_guard));
m.impl("test_device_guard_set_index", TORCH_BOX(&test_device_guard_set_index));
m.impl("test_stream", TORCH_BOX(&test_stream));
m.impl("test_get_current_device_index", TORCH_BOX(&test_get_current_device_index));
}
#endif // LAE_USE_CUDA

View File

@ -33,7 +33,7 @@ class clean(distutils.command.clean.clean):
def get_extension():
extra_compile_args = {
"cxx": ["-fdiagnostics-color=always"],
"cxx": ["-fdiagnostics-color=always", "-DTORCH_STABLE_ONLY"],
}
extension = CppExtension

View File

@ -5,6 +5,7 @@ static std::vector<at::Generator> default_generators;
namespace c10::openreg {
// LITERALINCLUDE START: OPENREG GET DEFAULT GENERATOR IMPL
const at::Generator& getDefaultOpenRegGenerator(c10::DeviceIndex device_index) {
static bool flag [[maybe_unused]] = []() {
auto deivce_nums = device_count();
@ -24,5 +25,6 @@ const at::Generator& getDefaultOpenRegGenerator(c10::DeviceIndex device_index) {
}
return default_generators[idx];
}
// LITERALINCLUDE END: OPENREG GET DEFAULT GENERATOR IMPL
} // namespace c10::openreg

View File

@ -1,5 +1,6 @@
#include "OpenRegHooks.h"
// LITERALINCLUDE START: OPENREG HOOK REGISTER
namespace c10::openreg {
static bool register_hook_flag [[maybe_unused]] = []() {
@ -9,3 +10,4 @@ static bool register_hook_flag [[maybe_unused]] = []() {
}();
} // namespace c10::openreg
// LITERALINCLUDE END: OPENREG HOOK REGISTER

View File

@ -8,17 +8,58 @@
#include <include/openreg.h>
#include "OpenRegFunctions.h"
#include "OpenRegGenerator.h"
namespace c10::openreg {
struct OpenRegHooksInterface : public at::PrivateUse1HooksInterface {
struct OPENREG_EXPORT OpenRegHooksInterface : public at::PrivateUse1HooksInterface {
OpenRegHooksInterface() {};
~OpenRegHooksInterface() override = default;
bool hasPrimaryContext(c10::DeviceIndex device_index) const override {
void init() const override {
// Initialize OpenReg runtime if needed
// This is called when PyTorch first accesses the device
}
bool hasPrimaryContext(DeviceIndex device_index) const override {
return true;
}
bool isBuilt() const override {
// This extension is compiled as part of the OpenReg test extension.
return true;
}
bool isAvailable() const override {
// Consider OpenReg available if there's at least one device reported.
return device_count() > 0;
}
DeviceIndex deviceCount() const override {
return device_count();
}
void setCurrentDevice(DeviceIndex device) const override {
set_device(device);
}
DeviceIndex getCurrentDevice() const override {
return current_device();
}
DeviceIndex exchangeDevice(DeviceIndex device) const override {
return ExchangeDevice(device);
}
DeviceIndex maybeExchangeDevice(DeviceIndex device) const override {
// Only exchange if the requested device is valid; otherwise, no-op and return current
auto count = device_count();
if (device < 0 || device >= count) {
return getCurrentDevice();
}
return exchangeDevice(device);
}
at::Allocator* getPinnedMemoryAllocator() const override {
return at::getHostAllocator(at::kPrivateUse1);
}
@ -30,12 +71,23 @@ struct OpenRegHooksInterface : public at::PrivateUse1HooksInterface {
return attr.type == orMemoryTypeHost;
}
const at::Generator& getDefaultGenerator(
c10::DeviceIndex device_index) const override {
at::Device getDeviceFromPtr(void* data) const override {
orPointerAttributes attr{};
auto err = orPointerGetAttributes(&attr, data);
if (err == orSuccess && attr.type == orMemoryTypeDevice) {
return at::Device(at::DeviceType::PrivateUse1, static_cast<int>(attr.device));
} else {
TORCH_CHECK(false, "failed to get device from pointer");
}
return at::Device(at::DeviceType::PrivateUse1, current_device());
}
// LITERALINCLUDE START: OPENREG HOOK EXAMPLES
const at::Generator& getDefaultGenerator(DeviceIndex device_index) const override {
return getDefaultOpenRegGenerator(device_index);
}
// LITERALINCLUDE END: OPENREG HOOK EXAMPLES
at::Generator getNewGenerator(c10::DeviceIndex device_index) const override {
at::Generator getNewGenerator(DeviceIndex device_index) const override {
return at::make_generator<OpenRegGeneratorImpl>(device_index);
}
};

View File

@ -140,6 +140,11 @@ static void initDeviceStreamState(DeviceIndex device_index) {
static void initOpenRegStreamsOnce() {
c10::call_once(init_flag, initGlobalStreamState);
for (const auto i : c10::irange(num_devices)) {
c10::call_once(
device_flags[i], initDeviceStreamState, static_cast<DeviceIndex>(i));
}
if (current_streams) {
return;
}
@ -202,8 +207,6 @@ OpenRegStream getStreamFromPool(const int priority, DeviceIndex device_index) {
if (device_index == -1) {
device_index = current_device();
}
c10::call_once(
device_flags[device_index], initDeviceStreamState, device_index);
auto pri_idx =
std::clamp(priority, 0, max_compile_time_stream_priorities - 1);
const auto idx = get_idx(priority_counters[device_index][pri_idx]);

View File

@ -17,6 +17,7 @@ static PyObject* _initExtension(PyObject* self, PyObject* noargs) {
END_HANDLE_TH_ERRORS
}
// LITERALINCLUDE START: OPENREG GET DEFAULT GENERATOR
static PyObject* _getDefaultGenerator(PyObject* self, PyObject* arg) {
HANDLE_TH_ERRORS
TORCH_CHECK(
@ -31,6 +32,7 @@ static PyObject* _getDefaultGenerator(PyObject* self, PyObject* arg) {
END_HANDLE_TH_ERRORS
}
// LITERALINCLUDE END: OPENREG GET DEFAULT GENERATOR
PyObject* _setDevice(PyObject* self, PyObject* arg) {
HANDLE_TH_ERRORS
@ -73,6 +75,7 @@ PyObject* _getDeviceCount(PyObject* self, PyObject* noargs) {
END_HANDLE_TH_ERRORS
}
// LITERALINCLUDE START: OPENREG MODULE METHODS
static PyMethodDef methods[] = {
{"_init", _initExtension, METH_NOARGS, nullptr},
{"_get_default_generator", _getDefaultGenerator, METH_O, nullptr},
@ -81,7 +84,7 @@ static PyMethodDef methods[] = {
{"_exchangeDevice", _exchangeDevice, METH_O, nullptr},
{"_get_device_count", _getDeviceCount, METH_NOARGS, nullptr},
{nullptr, nullptr, 0, nullptr}};
// LITERALINCLUDE END: OPENREG MODULE METHODS
/*
* When ASAN is enabled, PyTorch modifies the dlopen flag during import,
* causing all global and weak symbols in _C.so and its dependent libraries

View File

@ -45,6 +45,7 @@ def initial_seed() -> int:
return default_generator.initial_seed()
# LITERALINCLUDE START: OPENREG MANUAL SEED
def manual_seed(seed: int) -> None:
seed = int(seed)
@ -53,6 +54,9 @@ def manual_seed(seed: int) -> None:
default_generator.manual_seed(seed)
# LITERALINCLUDE END: OPENREG MANUAL SEED
def manual_seed_all(seed: int) -> None:
seed = int(seed)

View File

@ -1,67 +0,0 @@
import distutils.command.clean
import shutil
from pathlib import Path
from setuptools import find_packages, setup
from torch.utils.cpp_extension import BuildExtension, CppExtension
ROOT_DIR = Path(__file__).parent
CSRC_DIR = ROOT_DIR / "torch_stable_test" / "csrc"
class clean(distutils.command.clean.clean):
def run(self):
# Run default behavior first
distutils.command.clean.clean.run(self)
# Remove extension
for path in (ROOT_DIR / "torch_stable_test").glob("**/*.so"):
path.unlink()
# Remove build and dist and egg-info directories
dirs = [
ROOT_DIR / "build",
ROOT_DIR / "dist",
ROOT_DIR / "torch_stable_test.egg-info",
]
for path in dirs:
if path.exists():
shutil.rmtree(str(path), ignore_errors=True)
def get_extension():
extra_compile_args = {
"cxx": ["-fdiagnostics-color=always", "-DTORCH_STABLE_ONLY"],
}
sources = list(CSRC_DIR.glob("**/*.cpp"))
return [
CppExtension(
"torch_stable_test._C",
sources=sorted(str(s) for s in sources),
py_limited_api=True,
extra_compile_args=extra_compile_args,
extra_link_args=[],
)
]
setup(
name="torch_stable_test",
version="0.0",
author="PyTorch Core Team",
description="Test extension to verify TORCH_STABLE_ONLY flag",
packages=find_packages(exclude=("test",)),
package_data={"torch_stable_test": ["*.dll", "*.dylib", "*.so"]},
install_requires=[
"torch",
],
ext_modules=get_extension(),
cmdclass={
"build_ext": BuildExtension.with_options(no_python_abi_suffix=True),
"clean": clean,
},
options={"bdist_wheel": {"py_limited_api": "cp39"}},
)

View File

@ -1 +0,0 @@
#include <ATen/core/TensorBase.h> // This should trigger the TORCH_STABLE_ONLY error

View File

@ -1,22 +0,0 @@
# Owner(s): ["module: cpp"]
from pathlib import Path
from torch.testing._internal.common_utils import (
install_cpp_extension,
IS_WINDOWS,
run_tests,
TestCase,
)
if not IS_WINDOWS:
class TestTorchStable(TestCase):
def test_setup_fails(self):
with self.assertRaisesRegex(RuntimeError, "build failed for cpp extension"):
install_cpp_extension(extension_root=Path(__file__).parent.parent)
if __name__ == "__main__":
run_tests()

View File

@ -180,6 +180,47 @@ class TestTrackerFullyShard1DTrainingCore(FSDPTest):
del model
del optim
def _test_tracker_multihandler_hook(self):
"""Should run without KeyError."""
class TestModule(nn.Module):
def __init__(self, dim: int):
super().__init__()
self.norm1 = nn.RMSNorm(dim)
self.output1 = nn.Linear(dim, dim)
self.norm2 = nn.RMSNorm(dim)
self.output2 = nn.Linear(dim, dim)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.norm1(x)
x = self.output1(x)
x = self.norm2(x)
x = self.output2(x)
return x
gc.collect()
torch.manual_seed(42)
dev = torch.device(torch.accelerator.current_device_index())
with torch.device(dev):
model = TestModule(128)
mesh = init_device_mesh(dev.type, (self.world_size,))
fully_shard([model.norm1, model.output1], mesh=mesh)
fully_shard([model.norm2, model.output2], mesh=mesh)
fully_shard(model, mesh=mesh)
fmt = FSDPMemTracker(model)
with fmt:
inp = torch.randn(16, 128, device=dev)
y = model(inp)
loss = y.sum()
loss.backward()
del inp
del model
class TestTrackerFullyShard1DTrainingCompose(FSDPTest):
@property

View File

@ -225,9 +225,11 @@ class ApiTest(unittest.TestCase):
raise_child_failure_error_fn("trainer", trainer_error_file)
pf = cm.exception.get_first_failure()[1]
# compare worker error file with reply file and overridden error code
expect = json.load(open(pf.error_file))
with open(pf.error_file) as f:
expect = json.load(f)
expect["message"]["errorCode"] = pf.exitcode
actual = json.load(open(self.test_error_file))
with open(self.test_error_file) as f:
actual = json.load(f)
self.assertTrue(
json.dumps(expect, sort_keys=True),
json.dumps(actual, sort_keys=True),

View File

@ -1,9 +1,11 @@
# Owner(s): ["oncall: distributed"]
import contextlib
import unittest
import torch
import torch.distributed as dist
from torch._dynamo.testing import CompileCounterWithBackend
from torch._subclasses.fake_tensor import FakeTensorMode
from torch.distributed.tensor import (
DeviceMesh,
@ -23,8 +25,17 @@ from torch.testing._internal.common_utils import (
TestCase,
)
from torch.testing._internal.distributed.fake_pg import FakeStore
from torch.utils._debug_mode import _OpCall, _RedistributeCall, DebugMode
from torch.testing._internal.inductor_utils import GPU_TYPE, HAS_GPU
from torch.utils._debug_mode import (
_OpCall,
_RedistributeCall,
_TritonKernelCall,
DebugMode,
hash_tensor_fn,
norm_hash_fn,
)
from torch.utils._python_dispatch import TorchDispatchMode
from torch.utils._triton import has_triton_package
@requires_cuda
@ -106,6 +117,28 @@ class TestDTensorDebugMode(TestCase):
"aten::sum(t: f32[1, 32]) # {'hash': " in debug_mode.debug_string()
)
# check tuple hash functions
with (
DebugMode() as debug_mode,
DebugMode.log_tensor_hashes(hash_fn=["norm", "hash_tensor"]),
):
mm(x_dtensor, y_dtensor)
output_hash = debug_mode.operators[-1].log["hash"]
norm_ = lambda x: norm_hash_fn(x, use_scalar=True) # noqa: E731
hash_ = lambda x: hash_tensor_fn(x, use_scalar=True) # noqa: E731
self.assertEqual(output_hash[0], norm_(eager_out))
self.assertEqual(output_hash[1], hash_(eager_out))
# some edge cases
self.assertEqual(norm_(torch.tensor(torch.nan)), torch.nan)
self.assertEqual(norm_(torch.tensor(torch.inf)), torch.inf)
self.assertEqual(norm_(torch.complex(torch.ones(4), torch.zeros(4))), 4)
self.assertEqual(hash_(torch.ones(4, dtype=torch.float8_e5m2)), 0)
self.assertEqual(hash_(torch.ones(4, dtype=torch.int8)), 0)
self.assertEqual(hash_(torch.ones(5, dtype=torch.int8)), 1)
def test_debug_string_inside_context(self):
mesh = DeviceMesh(self.device_type, list(range(self.world_size)))
@ -376,14 +409,22 @@ class TestDTensorDebugMode(TestCase):
self.assertIn("torch.ops.higher_order.cond", debug_mode.debug_string())
def test_compile(self):
@torch.compile
cnt = CompileCounterWithBackend("inductor")
@torch.compile(backend=cnt)
def f(x):
return x.sin().cos()
x = torch.randn(8)
f(x)
with DebugMode() as debug_mode:
f(x)
self.assertEqual(len(debug_mode.debug_string()), 0)
self.assertEqual(len(debug_mode.debug_string()), 0)
f(x)
f(x)
self.assertEqual(
cnt.frame_count, 1
) # check DebugMode doesn't trigger additional recompilations
def test_nn_module(self):
class Foo(torch.nn.Module):
@ -433,6 +474,113 @@ class TestDTensorDebugMode(TestCase):
op for op in debug_mode.operators if str(op.op) == "aten.sum.dim_IntList"
][-1]
self.assertTrue("self.l2(self.l1(x))" in sum_op.fwd_stack_trace)
self.assertTrue(
"self.l2(self.l1(x))" in debug_mode.debug_string(show_stack_trace=True)
)
@unittest.skipIf(not HAS_GPU, "requires GPU")
@unittest.skipIf(not has_triton_package(), "requires triton")
def test_triton_kernel_logs(self):
import triton
from torch.testing._internal.triton_utils import add_kernel_autotuned
def call_triton(x, y):
output = torch.zeros_like(x)
n_elements = output.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),) # noqa: E731
add_kernel_autotuned[grid](x, y, output, n_elements)
return output
x = torch.randn(128, device=GPU_TYPE)
y = torch.randn(128, device=GPU_TYPE)
with DebugMode() as debug_mode:
torch.compile(call_triton)(x, y)
triton_calls = [
op for op in debug_mode.operators if isinstance(op, _TritonKernelCall)
]
self.assertGreater(len(triton_calls), 0)
self.assertIn("[triton]", triton_calls[0].render([]))
def test_check_hash_mismatches(self):
x = torch.randn(64, 64, device=GPU_TYPE)
x_different = torch.randn(64, 64, device=GPU_TYPE)
# Identical runs should have no mismatches
with DebugMode() as dm1, DebugMode.log_tensor_hashes():
x.sin().sum()
with DebugMode() as dm2, DebugMode.log_tensor_hashes():
x.sin().sum()
mismatches = DebugMode.check_hash_mismatches(dm1.logs, dm2.logs)
self.assertEqual(len(mismatches), 0)
# Different inputs should produce hash mismatches
with DebugMode() as dm3, DebugMode.log_tensor_hashes():
x_different.sin().sum()
# Check that mismatches are detected
mismatches = DebugMode.check_hash_mismatches(dm1.logs, dm3.logs)
self.assertEqual(len(mismatches), 2)
self.assertEqual(
[call["call"] for call in mismatches], ["aten::sin", "aten::sum"]
)
@unittest.skipIf(not HAS_GPU, "requires GPU")
@unittest.skipIf(not has_triton_package(), "requires triton")
def test_check_triton_hash_mismatches(self):
import triton
from torch.testing._internal.triton_utils import add_kernel_autotuned
def call_triton(x, y):
output = torch.zeros_like(x)
n_elements = output.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta["BLOCK_SIZE"]),) # noqa: E731
add_kernel_autotuned[grid](x, y, output, n_elements)
return output
a = torch.randn(128, device=GPU_TYPE)
b = torch.randn(128, device=GPU_TYPE)
c = torch.randn(128, device=GPU_TYPE)
# Run with hash logging to verify triton kernels can be hashed
with DebugMode() as dm_t1, DebugMode.log_tensor_hashes(hash_inputs=True):
torch.compile(call_triton)(a, b)
# Different inputs should have different hashes in triton kernels
with DebugMode() as dm_t2, DebugMode.log_tensor_hashes(hash_inputs=True):
torch.compile(call_triton)(a, c)
# Compare triton kernel hashes
mismatches = DebugMode.check_hash_mismatches(
dm_t1.logs, dm_t2.logs, compare_inputs=True
)
triton_mismatches = [m for m in mismatches if m["call_type"] == "triton kernel"]
self.assertGreater(len(triton_mismatches), 0)
# check both input & output hash mismatches are detected
self.assertGreater(len([m for m in triton_mismatches if m["is_input_hash"]]), 0)
self.assertGreater(
len([m for m in triton_mismatches if not m["is_input_hash"]]), 0
)
def test_check_structure_mismatches(self):
x = torch.randn(32, 32, device=self.device_type)
with DebugMode() as dm1, DebugMode.log_tensor_hashes():
x.sin()
with DebugMode() as dm2, DebugMode.log_tensor_hashes():
x.cos()
with DebugMode() as dm3, DebugMode.log_tensor_hashes():
x.sin().cos()
with self.assertRaisesRegex(ValueError, "Operators don't match"):
DebugMode.check_hash_mismatches(dm1.logs, dm2.logs)
with self.assertRaisesRegex(ValueError, "Log lengths don't match"):
DebugMode.check_hash_mismatches(dm1.logs, dm3.logs)
def test_pretty_print_dtensor_make_fx(self):
mesh = DeviceMesh(self.device_type, list(range(self.world_size)))

View File

@ -1,14 +1,12 @@
# Owner(s): ["oncall: distributed"]
import contextlib
import unittest
import torch
import torch.distributed as dist
import torch.fx.traceback as fx_traceback
from torch._dynamo.functional_export import (
_dynamo_graph_capture_for_export,
dynamo_graph_capture_for_export,
)
from torch._dynamo.functional_export import dynamo_graph_capture_for_export
from torch._functorch.aot_autograd import aot_export_joint_with_descriptors
from torch._functorch.partitioners import min_cut_rematerialization_partition
from torch._guards import tracing, TracingContext
@ -152,17 +150,6 @@ def graph_capture_and_aot_export_joint_with_descriptors_v2(model, args, kwargs=N
return aot_export_joint_with_descriptors_alone(gm, args, kwargs)
def graph_capture_and_aot_export_joint_with_descriptors(model, args, kwargs=None):
if kwargs is None:
kwargs = {}
with torch._dynamo.config.patch(install_free_tensors=True):
# TODO: switch to use the official graph_capture API once it is ready
gm = _dynamo_graph_capture_for_export(model)(*args, **kwargs)
fake_mode = gm.meta.get("fake_mode", None)
with tracing(TracingContext(fake_mode)):
return aot_export_joint_with_descriptors_alone(gm, args, kwargs)
def aot_export_joint_with_descriptors_alone(model, args, kwargs=None):
if kwargs is None:
kwargs = {}
@ -359,7 +346,6 @@ class DTensorExportTest(TestCase):
"export_fn",
[
graph_capture_and_aot_export_joint_with_descriptors_v2,
graph_capture_and_aot_export_joint_with_descriptors,
aot_export_joint_with_descriptors_alone,
],
)
@ -371,6 +357,7 @@ class DTensorExportTest(TestCase):
# aot_export_joint_with_descriptors on strict-exported exported_program.module()
# is producing a joint graph with backward region missing
@unittest.expectedFailure
def test_strict_export_parallelize_module_with_dtensor_input(self):
self._run_test(strict_export_and_aot_export_joint_with_descriptors)
@ -384,10 +371,6 @@ class DTensorExportTest(TestCase):
graph_capture_and_aot_export_joint_with_descriptors_v2,
"[[4, 10], [4], [10, 4], [10], [4, 10], [4], [10, 4], [10], [s64, 10], [s64, 10]]",
),
(
graph_capture_and_aot_export_joint_with_descriptors,
"[[4, 10], [4], [10, 4], [10], [s22, 10], [s22, 10]]",
),
],
)
def test_dynamic_shapes(self, export_fn_with_answer):
@ -432,7 +415,6 @@ class DTensorExportTest(TestCase):
"export_fn",
[
dynamo_graph_capture_for_export,
_dynamo_graph_capture_for_export,
],
)
def test_einsum_dtensor_export(self, export_fn):
@ -454,11 +436,7 @@ class DTensorExportTest(TestCase):
# Run model to verify it works
output = model(*inputs)
with torch._dynamo.config.patch(
install_free_tensors=(export_fn is _dynamo_graph_capture_for_export)
):
# TODO: switch to use the official graph_capture API once it is ready
gm = export_fn(model)(*inputs)
gm = export_fn(model)(*inputs)
output_gm = gm(*inputs)
self.assertEqual(output, output_gm)
@ -466,7 +444,6 @@ class DTensorExportTest(TestCase):
"export_fn",
[
graph_capture_and_aot_export_joint_with_descriptors_v2,
graph_capture_and_aot_export_joint_with_descriptors,
],
)
def test_flex_attention_dtensor_export(self, export_fn):
@ -529,7 +506,7 @@ class DTensorExportTest(TestCase):
return nest_fn(leaf) + 1
z = torch.randn(16, 16)
gm = graph_capture_and_aot_export_joint_with_descriptors(fn, (z,))
gm = graph_capture_and_aot_export_joint_with_descriptors_v2(fn, (z,))
self.assertEqual(fn(z), gm(z)[0])
@ -544,7 +521,7 @@ class DTensorExportTest(TestCase):
y = torch.randint(1, (10,)).bool()
x_dt = distribute_tensor(x, device_mesh, placements=[Replicate()])
y_dt = distribute_tensor(y, device_mesh, placements=[Replicate()])
_dynamo_graph_capture_for_export(Foo())(x_dt, y_dt)
dynamo_graph_capture_for_export(Foo())(x_dt, y_dt)
class Bar(torch.nn.Module):
def forward(self, x):
@ -554,25 +531,25 @@ class DTensorExportTest(TestCase):
x = torch.randint(1000, (4, 64, 16))
x_dt = distribute_tensor(x, device_mesh, placements=[Replicate()])
gm = _dynamo_graph_capture_for_export(Bar())(x_dt)
gm = dynamo_graph_capture_for_export(Bar())(x_dt)
self.assertExpectedInline(
str(gm.graph).strip(),
"""\
graph():
%l_flat_args_0_ : [num_users=2] = placeholder[target=arg_0]
%max_1 : [num_users=1] = call_method[target=max](args = (%l_flat_args_0_,), kwargs = {})
%l_x_ : torch.distributed.tensor.DTensor [num_users=2] = placeholder[target=L_x_]
%max_1 : [num_users=1] = call_method[target=max](args = (%l_x_,), kwargs = {})
%clamp : [num_users=1] = call_function[target=torch.clamp](args = (%max_1,), kwargs = {min: 1})
%item : [num_users=2] = call_method[target=item](args = (%clamp,), kwargs = {})
%ge_1 : [num_users=1] = call_function[target=operator.ge](args = (%item, 1), kwargs = {})
%_assert_scalar_default : [num_users=0] = call_function[target=torch.ops.aten._assert_scalar.default](args = (%ge_1, Runtime assertion failed for expression u0 >= 1 on node 'ge_1'), kwargs = {})
%res : [num_users=2] = call_function[target=operator.getitem](args = (%l_flat_args_0_, slice(None, item, None)), kwargs = {})
%getattr_1 : [num_users=1] = call_function[target=builtins.getattr](args = (%res, _local_tensor), kwargs = {})
%getitem : [num_users=2] = call_function[target=operator.getitem](args = (%l_x_, slice(None, item, None)), kwargs = {})
%getattr_1 : [num_users=1] = call_function[target=builtins.getattr](args = (%getitem, _local_tensor), kwargs = {})
%sym_size_int : [num_users=2] = call_function[target=torch.ops.aten.sym_size.int](args = (%getattr_1, 0), kwargs = {})
%ge_2 : [num_users=1] = call_function[target=operator.ge](args = (%sym_size_int, 0), kwargs = {})
%_assert_scalar_default_1 : [num_users=0] = call_function[target=torch.ops.aten._assert_scalar.default](args = (%ge_2, Runtime assertion failed for expression u2 >= 0 on node 'ge_2'), kwargs = {})
%le : [num_users=1] = call_function[target=operator.le](args = (%sym_size_int, 4), kwargs = {})
%_assert_scalar_default_2 : [num_users=0] = call_function[target=torch.ops.aten._assert_scalar.default](args = (%le, Runtime assertion failed for expression u2 <= 4 on node 'le'), kwargs = {})
return (res,)""", # noqa: B950
str(gm.graph).strip(),
return (getitem,)""", # noqa: B950
)

View File

@ -706,11 +706,11 @@ class DistTensorOpsTest(DTensorTestBase):
@with_comms
def test_dtensor_dtype_conversion(self):
from torch.distributed.tensor.debug import (
_clear_sharding_prop_cache,
_get_sharding_prop_cache_info,
_clear_fast_path_sharding_prop_cache,
_get_fast_path_sharding_prop_cache_stats,
)
_clear_sharding_prop_cache()
_clear_fast_path_sharding_prop_cache()
device_mesh = self.build_device_mesh()
shard_spec = [Shard(0)]
# by default we start from bf16 dtype
@ -730,13 +730,13 @@ class DistTensorOpsTest(DTensorTestBase):
self.assertEqual(bf16_sharded_dtensor1.to_local().dtype, torch.bfloat16)
# by this point we only have cache misses
hits, misses, _, _ = _get_sharding_prop_cache_info()
hits, misses = _get_fast_path_sharding_prop_cache_stats()
self.assertEqual(hits, 0)
self.assertEqual(misses, 2)
# convert to fp32 again and see if there's cache hit
bf16_sharded_dtensor1.float()
hits, misses, _, _ = _get_sharding_prop_cache_info()
hits, misses = _get_fast_path_sharding_prop_cache_stats()
# by now we should have cache hit
self.assertEqual(hits, 1)
self.assertEqual(misses, 2)

View File

@ -664,6 +664,101 @@ class TestViewOps(DTensorTestBase):
)
self.assertEqual(dist_x.placements, [Partial(), Shard(0)])
@with_comms
def test_storage_offset_slice(self):
"""
Test that storage_offset is properly tracked on DTensor when slicing
a replicated tensor.
"""
mesh = init_device_mesh(self.device_type, (self.world_size,))
# Create a replicated DTensor
tensor = torch.randn(10, device=self.device_type)
dtensor = distribute_tensor(tensor, mesh, [Replicate()])
# Perform a slice operation [1:]
with CommDebugMode() as comm_mode:
sliced_dtensor = dtensor[1:]
# Slicing should not trigger any communication
self.assertEqual(comm_mode.get_total_counts(), 0)
# Verify that the DTensor's storage_offset matches the expected value
self.assertEqual(sliced_dtensor.storage_offset(), 1)
# Verify that the local tensor also has the correct storage_offset
self.assertEqual(sliced_dtensor.to_local().storage_offset(), 1)
# Verify the shape is correct
self.assertEqual(sliced_dtensor.shape, torch.Size([9]))
# Verify the values are correct
expected = tensor[1:]
self.assertEqual(sliced_dtensor.full_tensor(), expected)
@with_comms
def test_storage_offset_shard_dim0_slice_dim1(self):
"""
Test that storage_offset is properly tracked when tensor is sharded on dim 0
and sliced on dim 1.
"""
mesh = init_device_mesh(self.device_type, (self.world_size,))
# Create a 2D tensor and shard on dim 0
tensor = torch.randn(12, 8, device=self.device_type)
dtensor = distribute_tensor(tensor, mesh, [Shard(0)])
# Perform a slice operation [:, 2:]
with CommDebugMode() as comm_mode:
sliced_dtensor = dtensor[:, 2:]
# Slicing should not trigger any communication
self.assertEqual(comm_mode.get_total_counts(), 0)
# The storage_offset should be 2 (skipping 2 elements in each row)
self.assertEqual(sliced_dtensor.storage_offset(), 2)
# Verify that the local tensor also has the correct storage_offset
self.assertEqual(sliced_dtensor.to_local().storage_offset(), 2)
# Verify the shape is correct
expected_shape = torch.Size([12, 6])
self.assertEqual(sliced_dtensor.shape, expected_shape)
# Verify the values are correct
expected = tensor[:, 2:]
self.assertEqual(sliced_dtensor.full_tensor(), expected)
@with_comms
def test_storage_offset_shard_dim1_slice_dim0(self):
"""
Test that storage_offset is properly tracked when tensor is sharded on dim 1
and sliced on dim 0.
"""
mesh = init_device_mesh(self.device_type, (self.world_size,))
# Create a 2D tensor and shard on dim 1
tensor = torch.randn(10, 12, device=self.device_type)
dtensor = distribute_tensor(tensor, mesh, [Shard(1)])
# Perform a slice operation [2:, :]
with CommDebugMode() as comm_mode:
sliced_dtensor = dtensor[2:, :]
# Slicing should not trigger any communication
self.assertEqual(comm_mode.get_total_counts(), 0)
local_dim1_size = 12 // self.world_size
expected_offset = 2 * local_dim1_size
self.assertEqual(sliced_dtensor.storage_offset(), expected_offset)
self.assertEqual(sliced_dtensor.to_local().storage_offset(), expected_offset)
# Verify the shape is correct
expected_shape = torch.Size([8, 12])
self.assertEqual(sliced_dtensor.shape, expected_shape)
# Verify the values are correct
expected = tensor[2:, :]
self.assertEqual(sliced_dtensor.full_tensor(), expected)
TestViewOpsWithLocalTensor = create_local_tensor_test_class(
TestViewOps,

View File

@ -1062,6 +1062,307 @@ class TestComputeCommReorderingBucketing(TestComputeCommReorderingMultiProc):
self.assertTrue(same(out, correct))
def get_toy_model(device_type: str):
"""
Helper to construct a small multi-layer ToyModel
"""
class ToyBlock(torch.nn.Module):
def __init__(self):
super().__init__()
self.wq = torch.nn.Linear(4, 4)
self.wk = torch.nn.Linear(4, 4)
self.proj = torch.nn.Linear(4, 4)
def forward(self, x):
attn = self.wq(x) + self.wk(x)
return self.proj(torch.nn.functional.relu(attn))
class ToyModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.layers = torch.nn.ModuleList([ToyBlock() for _ in range(2)])
self.norm = torch.nn.LayerNorm(4)
def forward(self, x):
for blk in self.layers:
x = blk(x)
return self.norm(x)
model = ToyModel().to(device_type)
return model
def apply_manual_reordering_and_get_graph(graph, module_bucket_plans, out_li) -> None:
gm = graph.owning_module
from torch._inductor.fx_passes.overlap_manual_scheduling import (
ManualOverlapScheduler,
)
for node in list(gm.graph.nodes):
if (
node.name == "all_gather_into_tensor"
or node.name == "all_gather_into_tensor_1"
or node.name == "wait_tensor"
or node.name == "wait_tensor_1"
):
node.meta["nn_module_stack"] = {"test": ["module_1", ""]}
if (
node.name == "all_gather_into_tensor_2"
or node.name == "all_gather_into_tensor_3"
or node.name == "wait_tensor_2"
or node.name == "wait_tensor_3"
):
node.meta["nn_module_stack"] = {"test": ["module_2", ""]}
overlapped_gm = ManualOverlapScheduler(
gm, module_bucket_plans, insert_overlap_deps=False
).run()
overlapped_gm.graph.lint()
out_li.append(overlapped_gm.graph)
def run_and_get_manual_aten_graph(fn, module_bucket_plans, *inputs):
li = []
apply = functools.partial(
apply_manual_reordering_and_get_graph,
module_bucket_plans=module_bucket_plans,
out_li=li,
)
with torch._inductor.config.patch(post_grad_custom_post_pass=apply):
out = fn(*inputs)
return out, li[0]
class TestManualOverlapBucketing(TestComputeCommReorderingMultiProc):
"""
Tests for manual overlap scheduling and subgraph utilities.
"""
@unittest.skipIf(not HAS_GPU, "Inductor+gpu needs triton and recent GPU arch")
def test_make_graph_view_and_get_subgraph_by_path(self):
from torch._inductor.fx_passes.graph_view import (
get_subgraph_by_path,
make_graph_view,
)
model = get_toy_model(device_type)
gm = torch.fx.symbolic_trace(model)
graph_view = make_graph_view(gm.graph)
# Fetch subgraph for first transformer layer
sub_nodes = get_subgraph_by_path(graph_view, "layers.0.wq")
self.assertEqual([n.name for n in sub_nodes], ["layers_0_wq"])
# Fetch multiple paths at once
multi_nodes = get_subgraph_by_path(graph_view, ["layers.0.wq", "layers.0.proj"])
self.assertEqual(
[n.name for n in multi_nodes], ["layers_0_wq", "layers_0_proj"]
)
# Fetch non existing paths
non_exist_nodes = get_subgraph_by_path(graph_view, "nonexistent.module.path")
self.assertEqual(non_exist_nodes, [])
# Fetch mixed of existing and non existing paths
mixed_nodes = get_subgraph_by_path(
graph_view, ["layers.0.wq", "nonexistent.module.path"]
)
self.assertEqual([n.name for n in mixed_nodes], ["layers_0_wq"])
@unittest.skipIf(not HAS_GPU, "Inductor+gpu needs triton and recent GPU arch")
def test_manual_reordering_bucketing_pass_separate_buckets(
self,
):
def func(a, b, c, d, *, ranks):
# All 4 all-gathers are independent - COULD be bucketed together
ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks)
ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks)
ag3 = _functional_collectives.all_gather_tensor(c[:4], 0, ranks)
ag4 = _functional_collectives.all_gather_tensor(d[:4], 0, ranks)
# First compute - can hide ag1 and ag2
e = a * 5 # Use a to avoid fusion
mm1 = torch.matmul(e, e.T)
# Force ag1/ag2 to complete before mm2 (but ag3/ag4 can still be deferred)
# Use first 8x8 elements to match mm1's shape
intermediate = ag1[:8, :8] + ag2[:8, :8]
# Second compute - depends on ag1/ag2 through intermediate, can hide ag3/ag4
mm2 = torch.matmul(mm1 + intermediate, c[:8])
# Use all results
result = (
ag1.sum() * 1.1
+ ag2.sum() * 1.2
+ ag3.sum() * 1.3
+ ag4.sum() * 1.4
+ mm1.sum()
+ mm2.sum()
)
return result
with _dynamo_dist_per_rank_init(
self.rank,
self.world_size,
self.backend(device_type),
fake_pg=not at_least_x_gpu(2),
):
a = torch.ones(8, 8, dtype=torch.float, device=device_type)
b = torch.ones(8, 8, dtype=torch.float, device=device_type) * 2
c = torch.ones(8, 8, dtype=torch.float, device=device_type) * 3
d = torch.ones(8, 8, dtype=torch.float, device=device_type) * 4
ranks = list(range(self.world_size))
func_c = functools.partial(func, ranks=ranks)
compiled = torch.compile(func_c)
out, aten_graph = run_and_get_manual_aten_graph(
compiled, ["module_1", "module_2"], a, b, c, d
)
(
FileCheck()
.check("_pre_bucket_all_gather")
.check("all_gather_into_tensor_out")
.check("_pre_bucket_all_gather_1")
.check("all_gather_into_tensor_out_1")
.check("wait_tensor_4")
.check("wait_tensor_5")
.run(str(aten_graph))
)
correct = func(a, b, c, d, ranks=ranks)
self.assertTrue(same(out, correct))
@unittest.skipIf(not HAS_GPU, "Inductor+gpu needs triton and recent GPU arch")
def test_bucketing_reordering_pass_no_bucket(
self,
):
def func(a, b, c, d, *, ranks):
# All 4 all-gathers are independent - COULD be bucketed together
ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks)
ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks)
ag3 = _functional_collectives.all_gather_tensor(c[:4], 0, ranks)
ag4 = _functional_collectives.all_gather_tensor(d[:4], 0, ranks)
# First compute - can hide ag1 and ag2
e = a * 5 # Use a to avoid fusion
mm1 = torch.matmul(e, e.T)
# Force ag1/ag2 to complete before mm2 (but ag3/ag4 can still be deferred)
# Use first 8x8 elements to match mm1's shape
intermediate = ag1[:8, :8] + ag2[:8, :8]
# Second compute - depends on ag1/ag2 through intermediate, can hide ag3/ag4
mm2 = torch.matmul(mm1 + intermediate, c[:8])
# Use all results
result = (
ag1.sum() * 1.1
+ ag2.sum() * 1.2
+ ag3.sum() * 1.3
+ ag4.sum() * 1.4
+ mm1.sum()
+ mm2.sum()
)
return result
with _dynamo_dist_per_rank_init(
self.rank,
self.world_size,
self.backend(device_type),
fake_pg=not at_least_x_gpu(2),
):
a = torch.ones(8, 8, dtype=torch.float, device=device_type)
b = torch.ones(8, 8, dtype=torch.float, device=device_type) * 2
c = torch.ones(8, 8, dtype=torch.float, device=device_type) * 3
d = torch.ones(8, 8, dtype=torch.float, device=device_type) * 4
ranks = list(range(self.world_size))
func_c = functools.partial(func, ranks=ranks)
compiled = torch.compile(func_c)
out, aten_graph = run_and_get_manual_aten_graph(compiled, [], a, b, c, d)
(
FileCheck()
.check("all_gather_into_tensor")
.check("all_gather_into_tensor_1")
.check("all_gather_into_tensor_2")
.check("all_gather_into_tensor_3")
.check("wait_tensor")
.check("wait_tensor_1")
.check("wait_tensor_2")
.check("wait_tensor_3")
.run(str(aten_graph))
)
correct = func(a, b, c, d, ranks=ranks)
self.assertTrue(same(out, correct))
@unittest.skipIf(not HAS_GPU, "Inductor+gpu needs triton and recent GPU arch")
def test_bucketing_reordering_pass_single_bucket(
self,
):
def func(a, b, c, d, *, ranks):
# All 4 all-gathers are independent - COULD be bucketed together
ag1 = _functional_collectives.all_gather_tensor(a, 0, ranks)
ag2 = _functional_collectives.all_gather_tensor(b, 0, ranks)
ag3 = _functional_collectives.all_gather_tensor(c[:4], 0, ranks)
ag4 = _functional_collectives.all_gather_tensor(d[:4], 0, ranks)
# First compute - can hide ag1 and ag2
e = a * 5 # Use a to avoid fusion
mm1 = torch.matmul(e, e.T)
# Force ag1/ag2 to complete before mm2 (but ag3/ag4 can still be deferred)
# Use first 8x8 elements to match mm1's shape
intermediate = ag1[:8, :8] + ag2[:8, :8]
# Second compute - depends on ag1/ag2 through intermediate, can hide ag3/ag4
mm2 = torch.matmul(mm1 + intermediate, c[:8])
# Use all results
result = (
ag1.sum() * 1.1
+ ag2.sum() * 1.2
+ ag3.sum() * 1.3
+ ag4.sum() * 1.4
+ mm1.sum()
+ mm2.sum()
)
return result
with _dynamo_dist_per_rank_init(
self.rank,
self.world_size,
self.backend(device_type),
fake_pg=not at_least_x_gpu(2),
):
a = torch.ones(8, 8, dtype=torch.float, device=device_type)
b = torch.ones(8, 8, dtype=torch.float, device=device_type) * 2
c = torch.ones(8, 8, dtype=torch.float, device=device_type) * 3
d = torch.ones(8, 8, dtype=torch.float, device=device_type) * 4
ranks = list(range(self.world_size))
func_c = functools.partial(func, ranks=ranks)
compiled = torch.compile(func_c)
out, aten_graph = run_and_get_manual_aten_graph(
compiled, [["module_1", "module_2"]], a, b, c, d
)
(
FileCheck()
.check("_pre_bucket_all_gather")
.check("all_gather_into_tensor_out")
.check("wait_tensor_4")
.run(str(aten_graph))
)
correct = func(a, b, c, d, ranks=ranks)
self.assertTrue(same(out, correct))
if __name__ == "__main__":
from torch._dynamo.test_case import run_tests

View File

@ -54,12 +54,10 @@ from torch.testing._internal.common_distributed import (
verify_ddp_error_logged,
)
from torch.testing._internal.common_utils import (
MI300_ARCH,
retry_on_connect_failures,
run_tests,
skip_but_pass_in_sandcastle,
skipIfRocm,
skipIfRocmArch,
TestCase,
)
@ -1233,7 +1231,7 @@ class ProcessGroupGlooTest(MultiProcessTestCase):
self._test_gather_stress(inputs, lambda t: t.clone())
@skip_if_lt_x_gpu(2)
@skipIfRocmArch(MI300_ARCH)
@skipIfRocm
@requires_gloo()
def test_gather_stress_cuda(self):
inputs = [torch.tensor([i + self.rank]).cuda() for i in range(1000)]

View File

@ -15,7 +15,7 @@ import torch._functorch.config
import torch.distributed as dist
import torch.nn as nn
import torch.utils.checkpoint
from functorch.compile import default_partition, min_cut_rematerialization_partition
from functorch.compile import min_cut_rematerialization_partition
from torch._dynamo.backends.common import aot_autograd
from torch._dynamo.testing import (
AotEagerAndRecordGraphs,
@ -24,7 +24,7 @@ from torch._dynamo.testing import (
)
from torch._higher_order_ops.wrap import tag_activation_checkpoint
from torch.testing._internal.common_device_type import instantiate_device_type_tests
from torch.testing._internal.common_utils import IS_WINDOWS, parametrize, skipIfHpu
from torch.testing._internal.common_utils import IS_WINDOWS, skipIfHpu
from torch.testing._internal.inductor_utils import HAS_CUDA_AND_TRITON
from torch.testing._internal.triton_utils import requires_cuda_and_triton
from torch.testing._internal.two_tensor import TwoTensor
@ -281,14 +281,7 @@ class ActivationCheckpointingViaTagsTests(
run(export_compiler)
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_tags_function(self, device, partition_fn):
def test_tags_function(self, device):
def gn(x, y):
return torch.sigmoid(torch.matmul(x, y))
@ -304,22 +297,11 @@ class ActivationCheckpointingViaTagsTests(
bw_compiler = functools.partial(
count_ops, freq=3, op=torch.ops.aten.mm.default
) # mm recomputed in the bwd
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
)
backend = aot_autograd(fw_compiler=fw_compiler, bw_compiler=bw_compiler)
self._validate(fn, backend, x, y)
@requires_cuda_and_triton
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_tags_function_via_global_checkpoint(self, device, partition_fn):
def test_tags_function_via_global_checkpoint(self, device):
def gn(x, y):
return torch.sigmoid(torch.matmul(x, y))
@ -334,28 +316,17 @@ class ActivationCheckpointingViaTagsTests(
bw_compiler = functools.partial(
count_ops, freq=3, op=torch.ops.aten.mm.default
) # mm recomputed in the bwd
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
)
backend = aot_autograd(fw_compiler=fw_compiler, bw_compiler=bw_compiler)
self._validate(fn, backend, x, y)
@requires_cuda_and_triton
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_tags_function_with_kwargs(self, device, partition_fn):
def test_tags_function_with_kwargs(self, device):
def gn(x, y):
return torch.sigmoid(torch.matmul(x, y))
def fn(x, y):
return torch.utils.checkpoint.checkpoint(
gn, torch.sin(x), y, use_reentrant=False
gn, torch.sin(x), y, use_reentrant=True, preserve_rng_state=False
)
x = torch.randn(4, 4, device=device, requires_grad=True)
@ -365,22 +336,11 @@ class ActivationCheckpointingViaTagsTests(
bw_compiler = functools.partial(
count_ops, freq=3, op=torch.ops.aten.mm.default
) # mm recomputed in the bwd
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
)
backend = aot_autograd(fw_compiler=fw_compiler, bw_compiler=bw_compiler)
self._validate(fn, backend, x, y)
@requires_cuda_and_triton
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_tags_sequential_layers(self, device, partition_fn):
def test_tags_sequential_layers(self, device):
def gn(x):
x = x.cos()
for _ in range(3):
@ -401,22 +361,11 @@ class ActivationCheckpointingViaTagsTests(
freqs=[2, 18],
ops=[torch.ops.aten.cos.default, torch.ops.aten.mm.default],
) # mm recomputed in the bwd
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
)
backend = aot_autograd(fw_compiler=fw_compiler, bw_compiler=bw_compiler)
self._validate(fn, backend, x)
@requires_cuda_and_triton
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_tags_multiple_checkpoints(self, device, partition_fn):
def test_tags_multiple_checkpoints(self, device):
def gn(x, y):
return torch.sigmoid(torch.matmul(x, y))
@ -434,22 +383,11 @@ class ActivationCheckpointingViaTagsTests(
bw_compiler = functools.partial(
count_ops, freq=6, op=torch.ops.aten.mm.default
) # mm recomputed in the bwd
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
)
backend = aot_autograd(fw_compiler=fw_compiler, bw_compiler=bw_compiler)
self._validate(fn, backend, x, y)
@requires_cuda_and_triton
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_tags_module(self, device, partition_fn):
def test_tags_module(self, device):
class MockModule(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
@ -473,22 +411,11 @@ class ActivationCheckpointingViaTagsTests(
bw_compiler = functools.partial(
count_ops, freq=1, op=torch.ops.aten.sigmoid.default
)
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
)
backend = aot_autograd(fw_compiler=fw_compiler, bw_compiler=bw_compiler)
self._validate(fn, backend, x)
@requires_cuda_and_triton
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_tags_decomps(self, device, partition_fn):
def test_tags_decomps(self, device):
# Ensures that tags are passed on through decompositions as well
class MockModule(torch.nn.Module):
def __init__(self) -> None:
@ -516,7 +443,6 @@ class ActivationCheckpointingViaTagsTests(
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
decompositions=lambda: import_module(
"torch._inductor.compile_fx"
).select_decomp_table(),
@ -776,14 +702,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
@requires_cuda_and_triton
@unittest.skipIf(IS_WINDOWS, "torch.compile doesn't work with windows")
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_must_recompute(self, device, partition_fn):
def test_compile_selective_checkpoint_must_recompute(self, device):
def context_fn_must_recompute_mm():
must_recompute_list = [
torch.ops.aten.mm.default,
@ -804,9 +723,9 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
),
)
def _test(context_fn, bw_compiler, partition_fn):
def _test(context_fn, bw_compiler):
def gn(x):
return torch.cos(torch.sin(torch.matmul(x, x) @ x))
return torch.sigmoid(torch.matmul(x, x))
def fn(x):
return torch.utils.checkpoint.checkpoint(
@ -820,14 +739,14 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
fw_compiler = functools.partial(
count_ops,
freq=2,
freq=1,
op=torch.ops.aten.mm.default,
)
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
self._validate(fn, backend, x)
@ -835,19 +754,17 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
context_fn=context_fn_must_recompute_mm,
bw_compiler=functools.partial(
count_ops,
freq=6, # 1 matmul recompute and 2 bwd mm ops per fwd matmul, so 2 + 2 * 2 = 6)
freq=3, # 1 matmul recompute and 2 bwd mm ops per fwd matmul, so 1 + 2 * 1 = 3)
op=torch.ops.aten.mm.default,
),
partition_fn=partition_fn,
)
_test(
context_fn=context_fn_no_recompute_mm,
bw_compiler=functools.partial(
count_ops,
freq=4, # 2 bwd mm ops per fwd matmul
freq=2, # 2 bwd mm ops per fwd matmul
op=torch.ops.aten.mm.default,
),
partition_fn=partition_fn,
)
def test_sac_with_partial_context_fn(self):
@ -884,16 +801,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
@requires_cuda_and_triton
@unittest.skipIf(IS_WINDOWS, "torch.compile doesn't work with windows")
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_must_not_recompute_gemm(
self, device, partition_fn
):
def test_compile_selective_checkpoint_must_not_recompute_gemm(self, device):
def selective_checkpointing_context_fn():
no_recompute_list = [
torch.ops.aten.mm.default,
@ -933,22 +841,15 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
self._validate(fn, backend, x, y)
self._compare_orig_and_checkpointed_fns(gn, fn, x, y)
@requires_cuda_and_triton
@unittest.skipIf(IS_WINDOWS, "torch.compile doesn't work with windows")
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_must_not_recompute_gemm_no_functionalization(
self, device, partition_fn
self, device
):
def selective_checkpointing_context_fn():
no_recompute_list = [
@ -988,7 +889,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
disable_functionalization=True,
)
self._validate(fn, backend, x, y)
@ -996,14 +897,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
@requires_cuda_and_triton
@unittest.skipIf(IS_WINDOWS, "torch.compile doesn't work with windows")
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_triton_kernel(self, device, partition_fn):
def test_compile_selective_checkpoint_triton_kernel(self, device):
# Copy of the above test, but make sure that having a triton kernel in the
# region does not error.
def add_one(x):
@ -1063,21 +957,14 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
self._validate(fn, backend, x, y)
self._compare_orig_and_checkpointed_fns(gn, fn, x, y)
@requires_cuda_and_triton
@unittest.skipIf(IS_WINDOWS, "torch.compile doesn't work with windows")
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_tensor_subclass(self, device, partition_fn):
def test_compile_selective_checkpoint_tensor_subclass(self, device):
def selective_checkpointing_context_fn():
no_recompute_list = [
torch.ops.aten.mm.default,
@ -1120,21 +1007,14 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
self._validate(fn, backend, x, y)
self._compare_orig_and_checkpointed_fns(gn, fn, x, y)
@requires_cuda_and_triton
@unittest.skipIf(IS_WINDOWS, "torch.compile doesn't work with windows")
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_custom_rule(self, device, partition_fn):
def test_compile_selective_checkpoint_custom_rule(self, device):
def _get_custom_policy(meta):
no_recompute_list = [
torch.ops.aten.mm.default,
@ -1192,21 +1072,14 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
self._validate(fn, backend, x, y)
self._compare_orig_and_checkpointed_fns(gn, fn, x, y)
@requires_cuda_and_triton
@unittest.skipIf(IS_WINDOWS, "torch.compile doesn't work with windows")
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_partial_ctx_fn(self, device, partition_fn):
def test_compile_selective_checkpoint_partial_ctx_fn(self, device):
def selective_checkpointing_context_fn(no_recompute_list):
return create_selective_checkpoint_contexts(
_get_custom_policy(no_recompute_list=no_recompute_list)
@ -1245,21 +1118,14 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
self._validate(fn, backend, x, y)
self._compare_orig_and_checkpointed_fns(gn, fn, x, y)
@requires_cuda_and_triton
@unittest.skipIf(IS_WINDOWS, "torch.compile doesn't work with windows")
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_outplace_op(self, device, partition_fn):
def test_compile_selective_checkpoint_outplace_op(self, device):
def selective_checkpointing_context_fn():
no_recompute_list = [
torch.ops.aten.mm.default,
@ -1297,21 +1163,14 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
self._validate(fn, backend, x, y)
self._compare_orig_and_checkpointed_fns(gn, fn, x, y)
@requires_cuda_and_triton
@unittest.skipIf(IS_WINDOWS, "torch.compile doesn't work with windows")
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_list_ops(self, device, partition_fn):
def test_compile_selective_checkpoint_list_ops(self, device):
def selective_checkpointing_context_fn():
# recompute everything
no_recompute_list = []
@ -1347,7 +1206,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
self._validate(fn, backend, x, y)
self._compare_orig_and_checkpointed_fns(gn, fn, x, y)
@ -1358,14 +1217,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
"requires TorchDispatchMode + torch.compile work to complete"
)
@requires_cuda_and_triton
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_inplace_op(self, device, partition_fn):
def test_compile_selective_checkpoint_inplace_op(self, device):
def selective_checkpointing_context_fn():
no_recompute_list = [
torch.ops.aten.mm.default,
@ -1405,7 +1257,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
self._validate(fn, backend, x, y)
self._compare_orig_and_checkpointed_fns(gn, fn, x, y)
@ -1413,14 +1265,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
@requires_cuda_and_triton
@unittest.skipIf(IS_WINDOWS, "torch.compile doesn't work with windows")
@torch._inductor.config.patch(fallback_random=True)
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_random_op(self, device, partition_fn):
def test_compile_selective_checkpoint_random_op(self, device):
for preserve_rng_state in [True, False]:
def selective_checkpointing_context_fn():
@ -1467,7 +1312,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
# NOTE: when `preserve_rng_state` is False, gradient will mismatch between torch.compile and eager,
@ -1479,14 +1324,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
@requires_cuda_and_triton
@unittest.skipIf(IS_WINDOWS, "torch.compile doesn't work with windows")
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_invalid_context(self, partition_fn):
def test_compile_selective_checkpoint_invalid_context(self):
def gn(x, y):
return torch.sigmoid(torch.matmul(x, y)) * y
@ -1515,7 +1353,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
with self.assertRaisesRegex(
Exception, "must generate a tuple of two `TorchDispatchMode`s"
@ -1524,14 +1362,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
@requires_cuda_and_triton
@torch._dynamo.config.patch(inline_inbuilt_nn_modules=True)
@parametrize(
"partition_fn",
[
min_cut_rematerialization_partition,
default_partition,
],
)
def test_compile_selective_checkpoint_parametrization(self, partition_fn):
def test_compile_selective_checkpoint_parametrization(self):
def sac_policy():
def _recomp_policy():
def _custom_policy(ctx, func, *args, **kwargs):
@ -1594,9 +1425,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
bw_compiler = functools.partial(
count_ops,
freqs=[
# 1 from mul recompute, 1 from mul backward
# w/o CSE, we have one extra mul
3 if partition_fn is default_partition else 2,
2, # 1 from mul recompute, 1 from mul backward
1,
],
ops=[torch.ops.aten.mul.Tensor, torch.ops.aten.sigmoid.default],
@ -1605,7 +1434,7 @@ Non-primal fwd outputs from model w/o backward hook: {mod_no_hook_fwd_outputs_no
backend = aot_autograd(
fw_compiler=fw_compiler,
bw_compiler=bw_compiler,
partition_fn=partition_fn,
partition_fn=min_cut_rematerialization_partition,
)
model = MLPModule()
@ -1852,14 +1681,13 @@ class GraphModule(torch.nn.Module):
wrap_body_0 = self.wrap_body_0
tag_activation_checkpoint = torch.ops.higher_order.tag_activation_checkpoint(wrap_body_0, l_x_, use_reentrant = True); wrap_body_0 = l_x_ = None
getitem: "f32[4, 4]" = tag_activation_checkpoint[0]
getitem_1: "f32[4, 4]" = tag_activation_checkpoint[1]; tag_activation_checkpoint = None
return (getitem, getitem_1)
getitem: "f32[4, 4]" = tag_activation_checkpoint[0]; tag_activation_checkpoint = None
return (getitem,)
class wrap_body_0(torch.nn.Module):
def forward(self, l_x_: "f32[4, 4]"):
y: "f32[4, 4]" = torch.sin(l_x_); l_x_ = None
return (y, y)
return (y,)
""",
)
@ -1969,9 +1797,9 @@ class GraphModule(torch.nn.Module):
out: "f32[4, 4]" = l_x_.sin()
sin_1: "f32[4, 4]" = torch.sin(o)
child: "f32[4, 4]" = torch.cos(sin_1)
child_1: "f32[4, 4]" = torch.sin(l_x_); l_x_ = None
return (child, child_1, matmul, o, out, sin_1)
cos: "f32[4, 4]" = torch.cos(sin_1)
sin_2: "f32[4, 4]" = torch.sin(l_x_); l_x_ = None
return (cos, sin_2, matmul, o, out, sin_1)
""",
)

View File

@ -950,7 +950,7 @@ SeqNr|OrigAten|SrcFn|FwdSrcFn
2|aten.threshold_backward.default||relu
1|aten.native_batch_norm_backward.default||batch_norm
0|aten.convolution_backward.default||conv2d
11|aten.add.Tensor||l1_loss
11|aten.add.Tensor||
"""
),
)

View File

@ -222,13 +222,13 @@ class GraphModule(torch.nn.Module):
matmul: "f32[3, 3]" = l_x_ @ l_y_
sin: "f32[3, 3]" = matmul.sin(); matmul = None
child: "f32[3, 3]" = sin.cos(); sin = None
cos: "f32[3, 3]" = sin.cos(); sin = None
child_1: "f32[3, 3]" = l_x_ + l_y_
child_2: "f32[3, 3]" = l_x_ - l_y_
add: "f32[3, 3]" = l_x_ + l_y_
sub: "f32[3, 3]" = l_x_ - l_y_
child_3: "f32[3, 3]" = l_x_ @ l_y_; l_x_ = l_y_ = None
return (child, child_1, child_2, child_3)
matmul_1: "f32[3, 3]" = l_x_ @ l_y_; l_x_ = l_y_ = None
return (cos, add, sub, matmul_1)
""", # noqa: B950
)
self.assertExpectedInline(

View File

@ -962,7 +962,7 @@ class ExceptionTests(torch._dynamo.test_case.TestCase):
x = (torch.randn(4, 16, requires_grad=True),)
with self.assertRaisesRegex(Exception, "weight = self.linear.w"):
torch._dynamo.functional_export._dynamo_graph_capture_for_export(Model())(x)
torch._dynamo.functional_export.dynamo_graph_capture_for_export(Model())(x)
instantiate_parametrized_tests(ExceptionTests)

View File

@ -2363,6 +2363,34 @@ class FunctionTests(torch._dynamo.test_case.TestCase):
self.assertTrue(same(output, expected))
assert cnt.frame_count == 1
@unittest.skipIf(sys.version_info < (3, 13), "math.fma introduced in python 3.13")
def test_math_fma(self):
def fma_func(a, b, c):
return math.fma(a, b, c)
# Test with scalar constants (constant folding path)
cnt = torch._dynamo.testing.CompileCounter()
cfma_scalars = torch._dynamo.optimize_assert(cnt)(fma_func)
assert cnt.frame_count == 0
expected = fma_func(2.0, 3.0, 4.0)
output = cfma_scalars(2.0, 3.0, 4.0)
self.assertEqual(output, expected)
assert cnt.frame_count == 0
# Test with tensors (Inductor path)
cnt2 = torch._dynamo.testing.CompileCounter()
cfma_tensors = torch._dynamo.optimize_assert(cnt2)(fma_func)
assert cnt2.frame_count == 0
x = torch.tensor(2.0)
y = torch.tensor(3.0)
z = torch.tensor(4.0)
expected_tensors = x * y + z
output_tensors = cfma_tensors(x, y, z)
torch.testing.assert_close(output_tensors, expected_tensors)
assert cnt2.frame_count == 1
@make_test
def test_numpy_meshgrid(x, y):
r1, r2 = np.meshgrid(x.numpy(), y.numpy())

View File

@ -131,7 +131,7 @@ def default_args_generator(seed_value):
yield new_args
class HigherOrderOpTests(torch._dynamo.test_case.TestCase):
class HigherOrderOpTests(torch._dynamo.test_case.TestCaseWithNestedGraphBreaks):
def _assert_wrap_fallback(self, func, args, setup=lambda: None):
counters.clear()
backend = EagerAndRecordGraphs()
@ -249,7 +249,7 @@ class HigherOrderOpTests(torch._dynamo.test_case.TestCase):
# when testing with dynamic shape, symbols are lifted as input
arg_count = ifdynstaticdefault(2, 3)
self._test_wrap_simple(fn, default_args_generator((x,)), arg_count)
self._test_wrap_simple(fn, default_args_generator((x,)), arg_count, 1)
def test_return_captured_vars(self):
freevar1 = torch.randn(3)
@ -267,7 +267,7 @@ class HigherOrderOpTests(torch._dynamo.test_case.TestCase):
# be the input.
# when testing with dynamic shape, a symbol is lifted as input
arg_count = ifdynstaticdefault(3, 4)
self._test_wrap_simple(fn, default_args_generator((x,)), arg_count, 4)
self._test_wrap_simple(fn, default_args_generator((x,)), arg_count, 1)
def test_return_captured_var_used_multiple_times(self):
freevar = torch.randn(3)
@ -282,7 +282,7 @@ class HigherOrderOpTests(torch._dynamo.test_case.TestCase):
x = torch.randn(3)
# when testing with dynamic shape, a symbol is lifted as input
arg_count = ifdynstaticdefault(3, 4)
self._test_wrap_simple(fn, default_args_generator((x,)), arg_count, 3)
self._test_wrap_simple(fn, default_args_generator((x,)), arg_count, 2)
def test_capture_untracked_global(self):
def f(x):
@ -762,15 +762,15 @@ class GraphModule(torch.nn.Module):
def forward(self, s77: "Sym(s77)", l_x_: "f32[s77]", u0: "Sym(u0)", c: "i64[u0, 1]"):
wrap_body_0 = self.wrap_body_0
wrap = torch.ops.higher_order.wrap(wrap_body_0, s77, l_x_, u0, c); wrap_body_0 = s77 = l_x_ = u0 = c = None
child: "f32[s77]" = wrap[0]
child_1: "f32[u0, 1]" = wrap[1]; wrap = None
return (child, child_1)
getitem: "f32[s77]" = wrap[0]
getitem_1: "f32[u0, 1]" = wrap[1]; wrap = None
return (getitem, getitem_1)
class wrap_body_0(torch.nn.Module):
def forward(self, s77: "Sym(s77)", l_x_: "f32[s77]", u0: "Sym(u0)", c: "i64[u0, 1]"):
child: "f32[s77]" = l_x_.sin(); l_x_ = None
child_1: "f32[u0, 1]" = c.sin(); c = None
return (child, child_1)
sin: "f32[s77]" = l_x_.sin(); l_x_ = None
sin_1: "f32[u0, 1]" = c.sin(); c = None
return (sin, sin_1)
""",
)
else:
@ -801,15 +801,15 @@ class GraphModule(torch.nn.Module):
def forward(self, l_x_: "f32[3]", u0: "Sym(u0)", c: "i64[u0, 1]"):
wrap_body_0 = self.wrap_body_0
wrap = torch.ops.higher_order.wrap(wrap_body_0, l_x_, u0, c); wrap_body_0 = l_x_ = u0 = c = None
child: "f32[3]" = wrap[0]
child_1: "f32[u0, 1]" = wrap[1]; wrap = None
return (child, child_1)
getitem: "f32[3]" = wrap[0]
getitem_1: "f32[u0, 1]" = wrap[1]; wrap = None
return (getitem, getitem_1)
class wrap_body_0(torch.nn.Module):
def forward(self, l_x_: "f32[3]", u0: "Sym(u0)", c: "i64[u0, 1]"):
child: "f32[3]" = l_x_.sin(); l_x_ = None
child_1: "f32[u0, 1]" = c.sin(); c = None
return (child, child_1)
sin: "f32[3]" = l_x_.sin(); l_x_ = None
sin_1: "f32[u0, 1]" = c.sin(); c = None
return (sin, sin_1)
""",
)
@ -922,16 +922,16 @@ class GraphModule(torch.nn.Module):
def forward(self, l_x_: "f32[3]", size: "Sym(u0)", c: "i64[u0, 1]"):
wrap_body_0 = self.wrap_body_0
wrap = torch.ops.higher_order.wrap(wrap_body_0, l_x_, size, c); wrap_body_0 = l_x_ = size = c = None
child: "f32[3]" = wrap[0]
child_1: "f32[u0, 1]" = wrap[1]; wrap = None
return (child, child_1)
getitem: "f32[3]" = wrap[0]
getitem_1: "f32[u0, 1]" = wrap[1]; wrap = None
return (getitem, getitem_1)
class wrap_body_0(torch.nn.Module):
def forward(self, l_x_: "f32[3]", size: "Sym(u0)", c: "i64[u0, 1]"):
sin: "f32[3]" = l_x_.sin(); l_x_ = None
child: "f32[3]" = sin + size; sin = size = None
child_1: "f32[u0, 1]" = c.sin(); c = None
return (child, child_1)
add: "f32[3]" = sin + size; sin = size = None
sin_1: "f32[u0, 1]" = c.sin(); c = None
return (add, sin_1)
""",
)
@ -2458,10 +2458,10 @@ class GraphModule(torch.nn.Module):
class wrap_body_0(torch.nn.Module):
def forward(self, l_arg1_0_: "f32[3]", l_arg2_0_: "f32[3]"):
child: "f32[3]" = l_arg1_0_ + 1; l_arg1_0_ = None
add: "f32[3]" = l_arg1_0_ + 1; l_arg1_0_ = None
child_1: "f32[3]" = l_arg2_0_ + 1; l_arg2_0_ = None
return (child, child_1)
add_1: "f32[3]" = l_arg2_0_ + 1; l_arg2_0_ = None
return (add, add_1)
""",
)
@ -2655,9 +2655,9 @@ class GraphModule(torch.nn.Module):
class wrap_body_0(torch.nn.Module):
def forward(self, l_x_: "f32[2, 3]"):
child: "f32[2, 3]" = l_x_.sin()
child_1: "f32[2, 3]" = l_x_.cos(); l_x_ = None
return (child, child_1)
sin: "f32[2, 3]" = l_x_.sin()
cos: "f32[2, 3]" = l_x_.cos(); l_x_ = None
return (sin, cos)
""",
)
@ -2687,13 +2687,13 @@ class GraphModule(torch.nn.Module):
wrap_body_0 = self.wrap_body_0
wrap = torch.ops.higher_order.wrap(wrap_body_0, l_x_); wrap_body_0 = l_x_ = None
value: "f32[3]" = wrap[0]; wrap = None
return (value,)
getitem: "f32[3]" = wrap[0]; wrap = None
return (getitem,)
class wrap_body_0(torch.nn.Module):
def forward(self, l_x_: "f32[3]"):
child: "f32[3]" = -l_x_; l_x_ = None
return (child,)
neg: "f32[3]" = -l_x_; l_x_ = None
return (neg,)
""",
)
@ -3318,17 +3318,17 @@ class GraphModule(torch.nn.Module):
hints_wrapper_body_1 = self.hints_wrapper_body_1
hints_wrapper = torch.ops.higher_order.hints_wrapper(hints_wrapper_body_1, (x, l_y_), {}, hints = {'outer_body': True}); hints_wrapper_body_1 = x = l_y_ = None
res: "f32[2, 4]" = hints_wrapper[0]; hints_wrapper = None
return (res,)
getitem: "f32[2, 4]" = hints_wrapper[0]; hints_wrapper = None
return (getitem,)
class hints_wrapper_body_1(torch.nn.Module):
def forward(self, x: "f32[2, 4]", l_y_: "f32[4]"):
hints_wrapper_body_0 = self.hints_wrapper_body_0
hints_wrapper = torch.ops.higher_order.hints_wrapper(hints_wrapper_body_0, (x, l_y_), {}, hints = {'inner_body': True}); hints_wrapper_body_0 = x = l_y_ = None
x_1: "f32[2, 4]" = hints_wrapper[0]; hints_wrapper = None
getitem: "f32[2, 4]" = hints_wrapper[0]; hints_wrapper = None
x_2: "f32[2, 4]" = torch.abs(x_1); x_1 = None
return (x_2,)
x_1: "f32[2, 4]" = torch.abs(getitem); getitem = None
return (x_1,)
class hints_wrapper_body_0(torch.nn.Module):
def forward(self, x: "f32[2, 4]", l_y_: "f32[4]"):
@ -3396,7 +3396,9 @@ class GraphModule(torch.nn.Module):
fn_with_hints(x, y)
class HigherOrderOpVmapGuardTests(LoggingTestCase):
class HigherOrderOpVmapGuardTests(
torch._dynamo.test_case.TestCaseWithNestedGraphBreaks, LoggingTestCase
):
@make_logging_test(recompiles=True)
def test_vmap_grad_guard_ok(self, records):
vmap = torch.vmap
@ -3665,7 +3667,9 @@ class HigherOrderOpVmapGuardTests(LoggingTestCase):
self.assertGreater(len(records), 0)
class FuncTorchHigherOrderOpTests(torch._dynamo.test_case.TestCase):
class FuncTorchHigherOrderOpTests(
torch._dynamo.test_case.TestCaseWithNestedGraphBreaks
):
def tearDown(self):
# Ensure that in the case of a test failure, the next test won't fail
# because of a previous call to _vmap_increment_nesting that wasn't undone
@ -6782,7 +6786,9 @@ class GraphModule(torch.nn.Module):
self.assertEqual(expected, actual)
class ActivationCheckpointingTests(torch._dynamo.test_case.TestCase):
class ActivationCheckpointingTests(
torch._dynamo.test_case.TestCaseWithNestedGraphBreaks
):
def _validate(self, fn, backend, *args, skip_check=False, fullgraph=True):
cloned_args = []
for arg in args:
@ -7173,7 +7179,7 @@ xfail_hops_compile = {
}
class TestHigherOrderOpsOpInfo(torch._dynamo.test_case.TestCase):
class TestHigherOrderOpsOpInfo(torch._dynamo.test_case.TestCaseWithNestedGraphBreaks):
@requires_cuda_and_triton
@parametrize("backend", ("aot_eager", "inductor"))
@ops(

Some files were not shown because too many files have changed in this diff Show More