Compare commits

..

122 Commits

Author SHA1 Message Date
bac6f34ed0 [dynamo][export] Make the source_stack and fqn info same between dynamo and export 2025-09-28 14:44:31 -07:00
8239ba4087 Fix various bugs in subclass input in export (#163770)
This adds basic support for subclass inputs in export (specifically for non-strict). I had to make fakify little more complicated which risks further divergence from dynamo fakification. But dynamo one is so complex, so i feel it is better to do this way. Also improved fake mode detection logic to recursively look into subclass inner tensors.

Differential Revision: [D83156489](https://our.internmc.facebook.com/intern/diff/D83156489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163770
Approved by: https://github.com/avikchaudhuri
2025-09-28 18:03:32 +00:00
1fdd99de71 Building guards should be under metrics_context (#163967)
Differential Revision: [D83354042](https://our.internmc.facebook.com/intern/diff/D83354042)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163967
Approved by: https://github.com/avikchaudhuri
2025-09-28 16:28:34 +00:00
38ed608956 Better error handling in torch/nativert/* (#163308)
Replace the **runtime_error** of the vallina C++ exceptions with **TORCH_CEHCK** in **torch/nativert/***

The vallina C++ exception should not exist in the core part of pytorch for its corss-languanges trait. Comparing with the vallina C++ exceptions, TORCH_CHECK have the richer error context and It has the unified error handling mechanism. This commit replace the runtime_error with TORCH_CHECK of the files in
torch/nativert/*   .

Fixes part of #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163308
Approved by: https://github.com/dolpm
2025-09-28 14:23:44 +00:00
238dc65368 [ROCm] use hipSolver instead of MAGMA for Cholesky (#163977)
Currently, the Cholesky factorization and least squares operation defaults to magma when Pytorch is compiled for ROCm. This shows suboptimal performance.
This change allows PyTorch to rely on hipSolver instead of Magma.
@jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163977
Approved by: https://github.com/Skylion007
2025-09-28 06:53:06 +00:00
7bbde0c094 Remove unused argument from DEFINE_BINARY macro. (#163868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163868
Approved by: https://github.com/Skylion007
ghstack dependencies: #163822
2025-09-28 06:32:41 +00:00
dfcab0e7e1 Handle DDE in infer_size_impl (#163822)
hit this while running VLLM with unbacked for model Qwen/Qwen2-1.5B-Instruct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163822
Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007
2025-09-28 06:32:41 +00:00
1cc9263f52 [vllm hash update] update the pinned vllm hash (#164053)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164053
Approved by: https://github.com/pytorchbot
2025-09-28 04:35:17 +00:00
c2862c8e66 [distributed] Remove python code older than 3.10 (#163613)
Because now that the minimum Python version is 3.10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163613
Approved by: https://github.com/XuehaiPan, https://github.com/kwen2501
2025-09-28 04:15:24 +00:00
b377c9e365 graph break on tolist if capture_scalar_outputs is false (#163807)
address https://github.com/pytorch/pytorch/issues/163798

its problematic to not graph break because:
1. break current contract.
2. well dynamo trace then we have .item call then if we ever re-trace later in autograd for example we hit a
 failure (We do not know where to graph break at that point)! see the added unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163807
Approved by: https://github.com/bobrenjc93
2025-09-28 04:02:52 +00:00
3059b08012 [inductor] add subsystem to pattern matcher (#163922)
Summary:
Running a toy example through `torch.compile(fullgraph=True, backend="inductor")` with default inductor config, I tried to see what passes are run in each of pre-grad, joint-graph, and post-grad phases by printing out the subsystem in `GraphTransformObserver`. However the subsystem showed up as None in a bunch of transforms that were run in each of those phases, so this PR adds some additional annotations.

Note that these annotations are probably not a complete set, since other transforms may run based on changes to the config that are not covered here.

Hopefully this doesn't change behavior. However, I did notice that bisecting relies on disabling various phases, which means that while before some passes would *not* be disabled (because their subsystem was `None`), now they would.

Test Plan: existing tests + manual test described in summary

Differential Revision: D83306676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163922
Approved by: https://github.com/jansel
2025-09-28 03:15:23 +00:00
5504a06e01 [BE]: Update NCCL to 2.28.3 (#162351)
@eqy New NCCL has some a bunch of bugfixes for features including reducing the number SMs needed by NVLINK collectives as well as some very useful new APIs for SymmetricMemory.  Also allows FP8 support for non-reductive operations on pre-sm90 devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162351
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/atalman
2025-09-28 01:38:59 +00:00
1ad491dd88 Better error handling in torch/csrc/jit/ir/* (#163757)
Refactor error handling to use TORCH_CHECK for improved clarity in constants and scope management

Fixes some parts of ISSUE #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163757
Approved by: https://github.com/albanD
2025-09-28 01:18:24 +00:00
fd20889d0b Add type annotations to MPS profiler utilities (#163486)
## Summary
- drop the local mypy allow-untyped-defs escape hatch in the MPS profiler helpers
- annotate the context managers and bool helpers so they type-check cleanly

## Testing
- python -m mypy torch/mps/profiler.py --config-file mypy-strict.ini

------
https://chatgpt.com/codex/tasks/task_e_68d0ce4df2e483268d06673b65ef7745
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163486
Approved by: https://github.com/Skylion007
2025-09-27 23:00:53 +00:00
2ce2e48a05 [WIP][symm_mem] Add a wait for signal and put signal for one side API (#159837)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159837
Approved by: https://github.com/kwen2501
2025-09-27 21:20:13 +00:00
1d98be6abf [NFC] fixed typo in sparse semi structured filename (#163904)
Make sure all semi structured files use "SparseSemiStructured"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163904
Approved by: https://github.com/Skylion007
2025-09-27 21:19:48 +00:00
dfda239cce [DTensor] Raise an RuntimeError when checkpointing APIs are used with Partial placement (#163941)
A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.

There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement.

Ideally, we want to add the following test:

```
        with self.assertRaisesRegex(
            RuntimeError, "Any checkpointing related operations are not supported for"
        ):

            dcp.save({"dtensor": dtensor}, checkpoint_id=tempfile.gettempdir())
```

While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163941
Approved by: https://github.com/tianyu-l
2025-09-27 19:50:16 +00:00
991e3d0d16 [dynamo][guards] Revert introduction of different types of lambda_guards (#163385)
With
https://fb.workplace.com/groups/260102303573409/permalink/787294574187510/
issue, it might be a better idea to just speedup _realize_dict and keep
the changes very local. So reverting this PR as well, to return to clean
slate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163385
Approved by: https://github.com/jansel
2025-09-27 18:20:48 +00:00
8f6dbc0ba8 [scan] create fw and bw graphs via partitioning (#162754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162754
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664, #161808, #162025, #161732
2025-09-27 18:13:15 +00:00
3413490f53 [scan] materialize combine_fn in forward add more autograd tests (#161732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161732
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664, #161808, #162025
2025-09-27 18:13:15 +00:00
b85bee3bbb [hop] refactor check input alias and mutation to be a graph pass (#162025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162025
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664, #161808
2025-09-27 18:13:15 +00:00
66dbf2c9f5 [scan][autograd] clone outputs that's aliasing with inputs or outputs in bw (#161808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161808
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664
2025-09-27 18:13:15 +00:00
f5d85874dd [scan][be] remove unnecessary tensor checks (#161664)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161664
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #161557
2025-09-27 18:13:14 +00:00
8f15d6a0c9 [test][scan] refactor inductor test and prepare for adding bw tests (#161557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161557
Approved by: https://github.com/zou3519
2025-09-27 18:13:14 +00:00
e78792a70d Update ctc loss docs float32 input required for CuDNN (#162042)
Discovered while working on https://github.com/pytorch/pytorch/pull/159106 the non-obvious requirement that inputs must be float32 to use CuDNN (https://github.com/pytorch/pytorch/pull/159106#issuecomment-3189981705), otherwise the native CUDA implementation is called.

Updates the docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162042
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2025-09-27 18:10:17 +00:00
d9db838f58 [CI] Re-enable test_all_to_all_vdev_2d_offset (#163985)
Fixes https://github.com/pytorch/pytorch/issues/163847
Moving allocations upfront and collectives later. The hang goes away.

My investigation indicates that the hang is inside the last call `torch.testing.assert_close(out_expected, out[:out_numel])`. Rank 3 calls into it, but never gets out. Don't know why yet. I will investigate more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163985
Approved by: https://github.com/fegin
2025-09-27 16:56:25 +00:00
6ba83e06a5 [AMP] Add deprecated decorator for torch.xxx.amp.autocast class (#163654)
As the title stated.

**Changes:**
- torch.cuda.amp.autocast
- torch.cpu.amp.autocast
- add explicit `__new__` and `__init_subclass__` for those class above for inspect.signature to retrieve correct signature

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163654
Approved by: https://github.com/Skylion007
2025-09-27 14:37:12 +00:00
960290d629 [Docs] Add standard-imghdr for PyTorch Doc (#163944)
As the title stated.

Python [Pep-0594](https://peps.python.org/pep-0594) have removed imghdr from python standard libaries, the older version of sphinx don`t add it as installation dependencies, so we need to add it to requirement as an temporary dependencies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163944
Approved by: https://github.com/albanD, https://github.com/svekars
2025-09-27 08:14:51 +00:00
b1a4efc302 [amd] Add cudaHostFn_t to cuda_to_hip_mappings (#164007)
Summary: See title

Test Plan:
```
buck build --flagfile fbcode//mode/opt-amd-gpu fbcode//comms/ctran/algos/common/tests:ctran_algo_gpe_kernel_sync_test
```
After fix: https://www.internalfb.com/buck2/362ff91e-53f2-4b82-9536-cb84c91384a2

Before fix: failed in D83294731 (version 1):
https://www.internalfb.com/sandcastle/workflow/1792432651703947243

Differential Revision: D83375414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164007
Approved by: https://github.com/llxxee
2025-09-27 06:09:50 +00:00
96182faf96 [CI][Distributed][CUDA][Symm-Mem] Enable B200 Symm Mem Test (#162988)
Inspired by https://github.com/pytorch/pytorch/pull/162981 and motivated by https://github.com/pytorch/pytorch/pull/159323 taking a total of 20 hours to finish (and unlikely to make it in short time due to https://github.com/pytorch/pytorch/issues/162178 )

Creating this subtest to get *something distributed* on B200.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162988
Approved by: https://github.com/malfet
2025-09-27 05:12:05 +00:00
dcb8af7501 [torchfuzz] fix bool propagation (#164003)
bools can't propogate through the current pointwise ops such as add/mul. once we add more that can, we'll probably want to add an additional subclass that supports pointwise bools, but for now just don't allow it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164003
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743, #163812, #163890, #164002
2025-09-27 04:51:29 +00:00
280e712c13 [vllm hash update] update the pinned vllm hash (#164029)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164029
Approved by: https://github.com/pytorchbot
2025-09-27 04:34:57 +00:00
254d2864d6 Add runtime_overhead PR Time Benchmark (#163866)
This adds a PR time benchmark that checks for runtime overhead on a very small graph. This will help track regressions in runtime overhead.

Example Results:
```
runtime_overhead_inductor,instruction_count,222645
runtime_overhead_inductor_inference_mode,instruction_count,234998
runtime_overhead_inductor_requires_grad,instruction_count,293556
runtime_overhead_inductor_requires_grad_backward,instruction_count,78181
runtime_overhead_inductor_dynamic,instruction_count,234870
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,248711
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,309979
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77599
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163866
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
2025-09-27 03:26:59 +00:00
9dac6437da lint: Filter out /usr/include from results (#164012)
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164012
Approved by: https://github.com/ZainRizvi
ghstack dependencies: #164008
2025-09-27 00:54:07 +00:00
8a0e8cad5f lint: Only include files in pytorch (#164008)
We were seeing instances of stdlib files in clang-tidy output so this
just essentially removes them from the things that lintrunner will
report up. Longer term fix here would be to just modify the clang-tidy
configuration in order to do the correct thing here but that requires a
bit more investigation as to why this is only happening in CI and is not
reproduceable locally.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164008
Approved by: https://github.com/ZainRizvi
2025-09-27 00:54:07 +00:00
3a115da3e6 [torchfuzz] ones over zero (#164002)
reduces likelihood of divide by zero errors. long term we'll probably want to just fuzz these values entirely

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164002
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743, #163812, #163890
2025-09-27 00:53:02 +00:00
b48a3d0a38 [CuTe] Add layout overlap checking util function in _MeshLayout (#163367)
While refactoring the bookkeeping for DeviceMesh while leveraging CuTe layout, we found that we need to have two more util functions. One is to check whether one layout has overlap inside it or not. For example, (2,2):(2:1) has no overlap while (2,2):(2:2) has overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163367
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288, #163928, #163930
2025-09-27 00:22:14 +00:00
8d474bdc14 Change python grid calc for MTIA back to python mode (#163601)
Differential Revision: D83000165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163601
Approved by: https://github.com/blaine-rister
2025-09-27 00:12:53 +00:00
008051b13c [Dynamic Shape][BE] trim _DimHint serialization (#163891)
Summary:
current serialization is a bit hard to read
```
Exporting with the dynamic shape spec: {getitem_123: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False)), getitem_118: (_DimHint(type=<_DimHintType.DYNAMIC: 3>,
min=489, max=31232, _factory=False)), getitem_117: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=489, max=31232, _factory=False)), getitem_116: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=489, max=31232, _factory=False)), getitem_115: (
_DimHint(type=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True), _DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False)), getitem_46: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=29, max=1792, _factory=False),
 _DimHint(type=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True)), _predict_module__base_model_model_ro_sparse_arch_ebc__output_dists_0__dist: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False), _DimHint(t
ype=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True)), _predict_module__base_model_model_nro_sparse_arch_ebc__output_dists_0__dist: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=29, max=1792, _factory=False)...
```

Test Plan: UT

Differential Revision: D83175131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163891
Approved by: https://github.com/pianpwk
2025-09-27 00:08:01 +00:00
e4ffd718ec Fix setting of memory fraction in test_garbage_collect_expandable (#164000)
Fixes #160598
Fixes #160551
Fixes #160507

This PR fixes a bug in the `test_garbage_collect_expandable` unit test where the finally block incorrectly re-reads the current per process memory fraction instead of setting the original value. With out the fix the other tests in the `test/test_cuda.py` test suite were impacted and failed with OOM error on ROCm.

This ensures proper cleanup and isolation of test state, maintaining test correctness and avoiding side effects like the below OOM error that it caused.

For example, `test_autocast_checkpointing`  failed with the below error https://github.com/pytorch/pytorch/actions/runs/17982223758/job/51153974194 on ROCm

`torch.OutOfMemoryError: HIP out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 255.69 GiB of which 252.97 GiB is free. 1.20 GiB allowed; Of the allocated memory 1.14 GiB is allocated by PyTorch, with 17.00 MiB allocated in private pools (e.g., HIP Graphs), and 18.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164000
Approved by: https://github.com/jeffdaily
2025-09-26 23:57:32 +00:00
ed3085814a [cuDNN][SDPA] Disable dropout for cuDNN SDPA on 9.11 - 9.13 (#163903)
cuDNN introduced some broken heuristics for these cases so we need to disable dropout to avoid unexpected crashes due to heuristics refusing to proceed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163903
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman
2025-09-26 23:50:09 +00:00
e2817ac204 [cuDNN][Convolution] Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ (#163581)
To workaround #163539

Still confirming whether 9.10 is affected. The original test states that the convolution is "large," but note that the input size does not apepar to require 64-bit indexing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163581
Approved by: https://github.com/ngimel, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-26 23:47:29 +00:00
1d138e658d [AOTI] log error triton kernel name during autotune (#163889)
Summary: can't tell from current error msg which kernel got exception

Test Plan: lint & pyre

Reviewed By: muchulee8

Differential Revision: D83246522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163889
Approved by: https://github.com/jansel
2025-09-26 23:29:49 +00:00
f9095fb285 [Windows] Update libuv version from 1.39 to 1.51 (#160318)
Fixes: [#148315](https://github.com/pytorch/pytorch/issues/148315)

The PR updates `libuv` version as `conda-forge` channel doesn't contain `libuv=1.39` for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160318
Approved by: https://github.com/iremyux, https://github.com/malfet
2025-09-26 23:29:21 +00:00
a0136f149c [MPS] Fix nan behavior in grid_sampler_3d (#163881)
Fixes #163851
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163881
Approved by: https://github.com/malfet
2025-09-26 23:08:00 +00:00
62b0ebd8f9 [MPS] [Sparse] unique_dim and sparse broadcast (#163694)
Implements unique_dim, sparse broadcast ops and adds dtypes for mps for tests where we expect to fail, otherwise they would always fail due to being run in double precision

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163694
Approved by: https://github.com/malfet
2025-09-26 23:03:13 +00:00
19f16a65b4 [torchfuzz] Add support for fuzz templates (#163890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163890
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743, #163812
2025-09-26 22:51:45 +00:00
0ebfa3d7d2 Avoid fast path mask left-align check in compiled TransformerEncoder (#163773)
Fixes #163640

This PR avoids a mask left align check in the case that we're operating under torch.compile / torch.export. Originally, I planned to make a more invasive change to auto-disable the fast path entirely underneath torch.compile / torch.export, but I realized during testing that the fast path wasn't actually causing compile issues outside of the narrow issue identified here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163773
Approved by: https://github.com/mikaylagawarecki
2025-09-26 22:29:37 +00:00
eqy
0ea10f9912 [cuDNN][conv][64-bit] Disable cuDNN for 64-bit depthwise convs again (#163171)
test is breaking, will check if there's an older version that we can enable on to avoid completely dropping support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163171
Approved by: https://github.com/ngimel, https://github.com/malfet
2025-09-26 22:12:17 +00:00
48a852b7ae [AOTI] Update AOTInductor tutorial (#163808)
Summary: Remove the BC breaking warning. Add inductor_config to the example code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163808
Approved by: https://github.com/yushangdi
2025-09-26 22:01:31 +00:00
f1260c9b9a [ROCm][CI/CD] upgrade nightly wheels to ROCm 7.0 (#163937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163937
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-26 21:42:09 +00:00
28c7d11428 [AOTI] Pass in shape_env for get_stride_order (#163925)
Summary:
As titled.
Without the diff, we got P1963055009

With the diff passing in the enviroment, we can do correct sym_int deduction:
https://fburl.com/mlhub/p5zy7o28

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_sdfpa_unbacked_strides --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)"
```
Without the fix: P1964887260
With the fix: P1964888579

Differential Revision: D83211018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163925
Approved by: https://github.com/ColinPeppler
2025-09-26 21:10:03 +00:00
a60c6ed99f [DeviceMesh][ez] Extract the pg creation as a util function (#163930)
This is just to extract common logic into a util function because we will use it many times for the following stack of Device Mesh refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163930
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288, #163928
2025-09-26 20:42:58 +00:00
c257570e6c [inductor] require shape in TritonCSEVariable (#162275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275
Approved by: https://github.com/mlazos
2025-09-26 20:41:12 +00:00
2f85de0b42 Fix preserve annotation with decomp (#163896)
If we use `fx_traceback.preserve_node_meta()`, we will have a few extra node.meta fields on nodes, such as "seq_nr", added from `fx/proxy.py`. As a result, there might be non-empty node.meta on graph nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163896
Approved by: https://github.com/SherlockNoMad, https://github.com/ydwu4
2025-09-26 20:28:47 +00:00
e21b037756 Add tests for aot_export_joint_with_descriptors annotation (#163893)
As title, test

1) Annotation works with aot_export_joint_with_descriptor API
2) Annotation works with the 2 step "strict export.export + aot_export_joint_with_descriptor"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163893
Approved by: https://github.com/SherlockNoMad
2025-09-26 19:25:44 +00:00
f8c7505855 [inductor] Fix unbounded number of substitutions when equality checks contain Max expr (#163685)
## Issue

From an internal use case, we found that if we have an equality rule like:

```
Max(15, u0) == s0 * Max(15, u0)
```

This would lead to wrong substitution rule being generated in the substitution table, the result would be the process got stuck in the substitution loop as if it hangs indefinitely, as it's doing the following substitutions:

```
Max(15, u0)
--> s0 * Max(15, u0)
--> s0 ** 2 * Max(15, u0)
--> s0 ** 3 * Max(15, u0)
--> s0 ** 4 * Max(15, u0)
...
```

The root cause is with SymPy expression comparison: as `Max` is [not inside the op class table](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L50-L86), it'll take the [UNKNOWN](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L120) order, and considered bigger than any other types of expressions.

## Fix
1. Added a breaking-out from the substitution while-loop to warn about any exccessive substitutions, what threshold should be used here and how to pass it are open to suggestion, using a hard-coded static value to be simple for now
2. Enhanced the sympy expression comparison logic, so that we first check if one expr "has" the other one or not, to help work around the issue with `Max` here

## Testing

- with the unittiest alone --> unittest stuck
- with the unittest and while-loop breakout, we could see tests finished with warning "**Substitution limit reached**":
```
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+
W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] |     !!!   WARNING   !!!    |
W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+
W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead.
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [5.6947s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda W0923 13:00:39.633000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ u1**30*Max(15, u0)
W0923 13:00:39.679000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ 64*u1**30*Max(15, u0)
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [5.6278s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s]

============================ 2 passed, 1 skipped, 870 deselected in 19.66s ============================
```

- with the unittest + comparison logic enhanced, we don't see the warning any more:
```
Running 3 items in this shard

test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:15:39.560000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+
W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] |     !!!   WARNING   !!!    |
W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+
W0923 13:15:39.562000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead.
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [6.6093s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [6.0502s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s]

============================ 2 passed, 1 skipped, 870 deselected in 21.99s ============================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163685
Approved by: https://github.com/jansel
2025-09-26 18:46:36 +00:00
425ea90f95 [testing] Add test owner labels for some cuda? tests (#163296)
I am trying to give some test files better owner labels than `module: unknown`.  I am not sure them, but they seem pretty reasonable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163296
Approved by: https://github.com/eqy, https://github.com/msaroufim
2025-09-26 18:26:56 +00:00
5b764267f4 [testing] Add test owner labels for some distributed tests (#163174)
I am trying to give some test files better owner labels than `module: unknown`.  I am not sure them, but they seem pretty reasonable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163174
Approved by: https://github.com/ezyang
2025-09-26 18:19:04 +00:00
50c0550f5a Add magic TORCH_MAKE_PYBIND_ENUM_FASTER macro (#163527)
See comment on the macro definition. In short, pybind11 3.x
added `py::native_enum`, and also had to add overhead for that new way
to bind enums on the critical path for calling functions that take
regular old `py::enum_`s as arguments (for example, `__eq__`).

Differential Revision: [D82873169](https://our.internmc.facebook.com/intern/diff/D82873169/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163527
Approved by: https://github.com/ezyang
2025-09-26 17:59:22 +00:00
d7491fb1c1 Fix tensor creation with empty names crash (#163957)
Partially fixes #148324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163957
Approved by: https://github.com/malfet, https://github.com/janeyx99
2025-09-26 17:41:00 +00:00
9534c59311 [Inductor] address comments from https://github.com/pytorch/pytorch/pull/163803 (#163901)
Summary: address comments from https://github.com/pytorch/pytorch/pull/163803

Differential Revision: D83291637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163901
Approved by: https://github.com/desertfire
2025-09-26 17:18:44 +00:00
5880996b4c Expose torch.nn.utils.parametrize (#163835)
`torch.nn.utils.parametrize` is not imported from `torch/nn/utils/__init__.py`, thus is not exposed and make it hard for code editors to statically analyze the code and provide auto-completion based on the function signature.

<img width="615" height="292" alt="Screenshot 2025-09-25 at 12 01 52 PM" src="https://github.com/user-attachments/assets/a276f6f0-87f3-4732-943d-2a92ea871974" />

after the fix:

<img width="964" height="393" alt="Screenshot 2025-09-25 at 12 02 16 PM" src="https://github.com/user-attachments/assets/ca47f09e-dc4e-4420-a2d2-11669e07471a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163835
Approved by: https://github.com/albanD
2025-09-26 16:38:18 +00:00
1d26eb0fcc Move inductor.aot_compile to use new tracer (#163137)
Differential Revision: [D82603768](https://our.internmc.facebook.com/intern/diff/D82603768)

I feel no one probably uses this API now but still useful path for more test cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163137
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #163136
2025-09-26 15:54:24 +00:00
a05f6ecfec Fix bug with renaming submodules in dynamo for new tracer (#163136)
Differential Revision: [D82603767](https://our.internmc.facebook.com/intern/diff/D82603767)

Previously, i forgot to add handle call_module case which now will have export_root prepended to their names. Basically i want to clean up sth like:
```
graph():
      %l_self_export_root_sub_mod = call_module[target=l_self_export_root_sub_mod](%x, %y)
      %l_self_export_root_sub_mod_1 = call_module[target=l_self_export_root_sub_mod](%x, %y)
  ```

Dynamo graph can have call_module nodes that have messed up name due to our wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163136
Approved by: https://github.com/avikchaudhuri
2025-09-26 15:54:24 +00:00
c106ee8515 [FakeTensor] Supplement the relevant logic for converting conv1d to conv2d in meta_conv (#160408)
## Fixes https://github.com/pytorch/pytorch/issues/159462 also fixes #163569 , #163604

## summary
the issue is caused by the wrong stride of conv1d's result generated by meta_conv:
4d5b3f2d5a/torch/_meta_registrations.py (L2453-L2471)

and the wrong stride will be used to codegen size assert in inductor:
4d5b3f2d5a/torch/_inductor/ir.py (L6152-L6163)

## reason
So why the computed stride is wrong in the meta_conv function? because the corresponding backend will convert conv1d to conv2d and change the input tensor' size and memory_format(channel last). but the meta_conv do not do this transformation, so a mismatch happend.
4d5b3f2d5a/aten/src/ATen/native/Convolution.cpp (L1502-L1510)
 just add corresponding logic in meta_conv.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160408
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/mlazos
2025-09-26 15:45:02 +00:00
8aba513506 [MPS] test sparse add MPS dtypes so we get proper expected failure (#163951)
Adds dtypeIfMPS so if op is supported we get proper error like unexpected success. Before we would never get unexpected success because tests were run in torch.double dtype which will always fail on MPS due to it not supporting the dtype
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163951
Approved by: https://github.com/malfet
2025-09-26 14:48:58 +00:00
8c194a367e [DeviceMesh][ez] Add a type alias for backend config (#163928)
Create a type alias for `tuple[Optional[str], Optional[C10dBackend.Options]]` since it is too long.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163928
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288
2025-09-26 14:46:53 +00:00
33f3413bd3 [WIP][precompile] Set fake_mode of base tensor in fx graph pickler (#163738)
Summary:
When unpickling a fake tensor in fx graph pickler. It only sets the fake mode of the current tensor's metadata to the one that is consistent with pickler's `unpickle_state`. However, it doesn't set the fake mode of a tensor's base tensor when that tensor is a view.

This will cause an issue when dumping and loading the following graph
```
class GraphModule(torch.nn.Module):
    def forward(self, s77: "Sym(s77)", L_x_: "f32[s77, 8]"):
        l_x_ = L_x_
        chunk = l_x_.chunk(2, dim = -1);  l_x_ = None
        y: "f32[s77, 4]" = chunk[0];  chunk = None
        y_repeat: "f32[s77, 8]" = y.repeat_interleave(2, dim = -1);  y = None
        return (y_repeat,)
```
because `repeat_interleave` will create an intermediate fake tensor of size `[s77, 2, 4]` and it will become the base of the node `y_repeat`'s `meta['val']`.

This causes issues during the deserialization phase when applying AOT precompile to DeepSeek in vLLM.

Test Plan:
This has been tested in vLLM with DeepSeek.

As for unittest, ideally it should be `test_aot_compile_repeat_interleave` with mark_dynamic turned on. However, that's leading to some other pickle issues.

```
python test/dynamo/test_aot_compile.py -k test_aot_compile_repeat_interleave
```

I have yet to figure out a more appropriate unittest. But a proof-of-concept demo would be the following:
```
import inspect
import sympy
import torch
from torch.fx._graph_pickler import GraphPickler
from torch.fx.experimental.symbolic_shapes import ShapeEnv
from torch._subclasses import FakeTensorMode
from torch.fx._graph_pickler import GraphPickler, Options
from unittest.mock import patch

class M(torch.nn.Module):
    def forward(self, x):
        chunk = x.chunk(2, dim=-1)
        y = chunk[0]
        y_repeat = y.repeat_interleave(2, dim=-1)
        return y_repeat

def my_custom_backend(gm, example_inputs):
    global gm_global
    gm_global = gm
    return gm.forward

m = M()
m_opt = torch.compile(m, backend=my_custom_backend, fullgraph=True)

sample_inputs = (torch.randn(2, 8),)
torch._dynamo.mark_dynamic(sample_inputs[0], [0])
opt_out = m_opt(*sample_inputs)

graph_reducer_override = GraphPickler.reducer_override

def _graph_reducer_override(self, obj):
    if (inspect.isclass(obj) and issubclass(obj, sympy.Function)
            and hasattr(obj, "_torch_unpickler")):
        return obj._torch_unpickler, (obj._torch_handler_name, )
    if isinstance(obj, FakeTensorMode):
        return type(None), ()
    return graph_reducer_override(self, obj)

with patch.object(GraphPickler, "reducer_override", _graph_reducer_override):
    pickled_gm = GraphPickler.dumps(gm_global, Options(ops_filter=None))

fake_mode = FakeTensorMode(shape_env=ShapeEnv())
loaded_gm = GraphPickler.loads(pickled_gm, fake_mode)
```

Differential Revision: D83112599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163738
Approved by: https://github.com/zhxchen17
2025-09-26 14:36:37 +00:00
d4e4f70768 Fix overflow in slow_conv3d when kernel size is too large. (#162718)
Also, adding check for padding to avoid segmentation fault caused by overflow.

Fixes #141846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162718
Approved by: https://github.com/jgong5, https://github.com/Skylion007
2025-09-26 13:39:29 +00:00
bfd21cd3e6 Revert "Add less warps config to inner reductions (#162447)"
This reverts commit 768361e67f0eb36491d7b763ef38d7c928ebefe6.

Reverted https://github.com/pytorch/pytorch/pull/162447 on behalf of https://github.com/PaulZhang12 due to failed to land internally ([comment](https://github.com/pytorch/pytorch/pull/162447#issuecomment-3338680532))
2025-09-26 13:16:04 +00:00
7441a1b9b1 Update ruff to 0.13.1 (#163744)
Update ruff to 0.13.1 so that we can remove `UP038` from `pyproject.toml` because it has been removed from supported rules of ruff.
There are some fixes, the most notable one is [(PYI059)](https://docs.astral.sh/ruff/rules/generic-not-last-base-class/#generic-not-last-base-class-pyi059)
```
Checks for classes inheriting from typing.Generic[] where Generic[] is not the last base class in the bases tuple.

```

A BC-breaking change is introduced to change the typing of `OrderedSet .storage`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163744
Approved by: https://github.com/Skylion007, https://github.com/jingsh
2025-09-26 10:12:21 +00:00
6a2bd1f4ee [inductor] skip bmm when converting channel last (#159459)
Workaround of #159458 by remove some nodes output channel last set

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159459
Approved by: https://github.com/etaf, https://github.com/eellison, https://github.com/shunting314
2025-09-26 09:11:40 +00:00
4783e3ff49 Update torch-xpu-ops commit pin (#163758)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@229e8b](229e8ba104), includes:

- Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL to fix memory leak
- Enable SYCL warnings on Linux
- Fix accuracy issues with CTC loss
- Enable aten::nonzero_static on XPU backend
- Stop recursive calculations in polynomial kernels if tensor has NaNs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163758
Approved by: https://github.com/EikanWang
2025-09-26 09:05:08 +00:00
c8e5b7dabb Add SDPA patterns for T5 variants when batch size is 1 (#163252)
As mentioned in
https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/fuse_attention.py#L838, this PR generates patterns  for the cases batch size == 1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163252
Approved by: https://github.com/Valentine233, https://github.com/jansel
2025-09-26 08:50:06 +00:00
04b51499f7 [CPU] Support transpose and packing fusion for bit8 (#163233)
To be used by CPU INT8 SDPA in TorchAO https://github.com/pytorch/ao/pull/3025. This change has a kernel improvement of about 9%.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163233
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-09-26 07:15:04 +00:00
54461a53bd [Inductor] Check if profiling before using record_function in CompiledFxGraph (#163747)
The call to `record_function` adds overhead even if profiling is disabled, which can as much as double the total runtime overhead of a compiled function. #163566 aims to make `record_function` more efficient, but doesn't fully eliminate overhead. This change adds a check if profiling is active before using `record_function`, which avoids this issue all together.

`TestExecutionTrace.test_execution_trace_with_pt2` in https://github.com/pytorch/pytorch/blob/main/test/profiler/test_execution_trace.py#L372 already checks that the `record_function` region is tracked during profiling.

Comparison of the `benchmarks/dynamo/microbenchmarks/overheads.py ` results:

Before Change:
```
requires_grad=False
compiled 56.9us (warmup=10.7s)

requires_grad=True
compiled 99.4us (warmup=0.2s)

inference_mode()
compiled 55.7us (warmup=0.1s)
```

After Change:
```
requires_grad=False
eager    6.9us (warmup=0.0s)
compiled 23.9us (warmup=22.3s)

requires_grad=True
eager    8.7us (warmup=0.0s)
compiled 56.8us (warmup=0.1s)

inference_mode()
eager    6.3us (warmup=0.0s)
compiled 22.2us (warmup=0.1s)
```

Additionally, #163866 introduces an instruction count benchmark. Because that is not merged and activated yet, here is a comparison:

Before Change:
```
runtime_overhead_inductor,instruction_count,222645
runtime_overhead_inductor_inference_mode,instruction_count,234998
runtime_overhead_inductor_requires_grad,instruction_count,293556
runtime_overhead_inductor_requires_grad_backward,instruction_count,78181
runtime_overhead_inductor_dynamic,instruction_count,234870
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,248711
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,309979
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77599
```

After Change:
```
runtime_overhead_inductor,instruction_count,149997
runtime_overhead_inductor_inference_mode,instruction_count,163397
runtime_overhead_inductor_requires_grad,instruction_count,220722
runtime_overhead_inductor_requires_grad_backward,instruction_count,78276
runtime_overhead_inductor_dynamic,instruction_count,161177
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,175495
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,235674
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77475
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163747
Approved by: https://github.com/mlazos, https://github.com/anijain2305
2025-09-26 06:49:40 +00:00
d1403250c9 Fix specialize_impl from triton.runtime.jit (#163844)
Summary:
In https://github.com/triton-lang/triton/pull/7771/ , create_specialize_impl is removed. We extend the support using native_specialize_impl.

Otherwise, PyTorch won't work with trunk triton.

Test Plan:
scripts/lufang/llm/launch_qwen3_vl_235b_a22b_thinking_2507_h100.sh

No more error message like
```
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last):
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 419, in generate_ttir
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     specialization = _get_specialization(ordered_args.values())
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 390, in _get_specialization
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     from triton.runtime.jit import specialize_impl as specialize_impl_orig
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] ImportError: cannot import name 'specialize_impl' from 'triton.runtime.jit' (/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inf
erence_platform_sp/llm_predictor_gpu/__service__/service#link-tree/triton/runtime/jit.py)
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last):
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 419, in generate_ttir
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     specialization = _get_specialization(ordered_args.values())
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 390, in _get_specialization
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     from triton.runtime.jit import specialize_impl as specialize_impl_orig
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] ImportError: cannot import name 'specialize_impl' from 'triton.runtime.jit' (/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inf
erence_platform_sp/llm_predictor_gpu/__service__/service#link-tree/triton/runtime/jit.py)
(Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
(Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last):
(Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors
```

Differential Revision: D83229128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163844
Approved by: https://github.com/henryoier, https://github.com/davidberard98, https://github.com/BoyuanFeng
2025-09-26 06:37:26 +00:00
b42e81def5 Allow unbacked to unbacked replacements if rhs unbacked symbols are all inputs (#163652)
This partially solve the issue https://github.com/pytorch/pytorch/issues/163641. We do not need to ban unbacked to unbacked replacement if all rhs symbols are inputs since we know those symbols are seen by the whole program.

This issue was found as i was tracing some vllm models with unbacked, namely  Qwen/Qwen2-1.5B-Instruct it makes reasoning logic easier to do those replacements.

as for data dependent similar pattern, I am thinking to create a set of replacements that we apply only during static eval
instead of none. to make reasoning better.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163652
Approved by: https://github.com/bobrenjc93
2025-09-26 06:23:22 +00:00
2a45f30ae7 Exporting aten.conv with cuda under fake mode on a cuda-less machine (#163912)
Summary:
Improve op coverage of exporting a CUDA model on a CPU-only machine under fake tensor mode.

For `torch.nn.functional.conv2d`, it will `_select_conv_backend` based on input and weight shapes.

When calling into `supportsDepthwiseConvolutionWithCuDNN()`, it calls `at::cuda::getCurrentDeviceProperties()` and fails on a CPU-only machine.

So we check if CUDA is actually enabled first.

Test Plan: TORCH_SHOW_CPP_STACKTRACES=1 buck2 run fbcode//caffe2/test:test_export -- --r nn_functional_conv2d

Reviewed By: angelayi, henryoier

Differential Revision: D80562984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163912
Approved by: https://github.com/SherlockNoMad
2025-09-26 06:04:20 +00:00
11b4c0eb9e [aoti] Save compute information (#163792)
Metadata looks like:
```
{
  'AOTI_DEVICE_KEY': 'cpu',
  'AOTI_PLATFORM': 'linux',
  'AOTI_MACHINE': 'x86_64',
  'AOTI_CPU_ISA': 'AVX512',
  'AOTI_COMPUTE_CAPABILITY': '90'
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163792
Approved by: https://github.com/yushangdi, https://github.com/desertfire
ghstack dependencies: #163779
2025-09-26 05:40:44 +00:00
fb93491ddc [aoti] Load metadata w/o loading package (#163779)
Add a function to load the metadata stored in aoti without needing to load the .so. This can be used to store what platform we are compiling the .so on which we can check before loading the .so.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163779
Approved by: https://github.com/yushangdi, https://github.com/desertfire
2025-09-26 05:40:44 +00:00
39df24fe04 [Code Clean] Replace std::runtime_error with TORCH_CHECK (#163610)
Including:
- `torch/csrc/instruction_counter`
- `torch/csrc/lazy`
- `torch/csrc/monitor`
- `torch/csrc/profiler`
- `torch/csrc/dynamo`

Fixes part of #148114

Personal mistake about (PR #163317), this PR does the same thing **and PR #163317 has already been approved by @albanD.**

This is a personal mistake on my part, and I'm so sorry about that. Hope you won't mind @albanD. 🥹

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163610
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-26 04:52:48 +00:00
bbde16fe98 [vllm hash update] update the pinned vllm hash (#163823)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163823
Approved by: https://github.com/pytorchbot
2025-09-26 04:29:52 +00:00
1b78ca2ef5 [Triton] [Inductor] Prune template selection based on decompose_k (#163781)
Summary:

Triton templates tend to perform very poorly on large K, hence the introduction of decompose_k. As a result, when decompose_k is selected will disable exploring the Triton templates. We may want to consider an override in the future.

Note: Based on the timing results it may be desirable to better refine/prune the decompose k decisions.

Testing:

Tested by looking at the autotune/compilation time using a single shape in TritonBench.
`TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 python run --op gemm --rep 1000 --sleep 1.0 --m 512 --n 512 --k 300000 --only pt2_matmul_maxautotune`
Before this change:
`SingleProcess AUTOTUNE benchmarking takes 13.5368 seconds and 0.1595 seconds precompiling for 38 choices`
With this change:
`SingleProcess AUTOTUNE benchmarking takes 9.9626 seconds and 0.0020 seconds precompiling for 11 choices`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163781
Approved by: https://github.com/eellison, https://github.com/PaulZhang12
2025-09-26 04:09:35 +00:00
082eaf4aae [DeviceMesh] Add extra check in flatten result cache lookup (#163288)
while refactoring DeviceMesh bookkeeping, we found that there is one corner case which we just don't check whether the dims to be flattened into is same as the dims which an existing flattened name maps to. So we need to add extra cases in the unit test and extra check logic in the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163288
Approved by: https://github.com/wz337, https://github.com/ezyang, https://github.com/fegin
ghstack dependencies: #163212
2025-09-26 03:41:58 +00:00
f1f2e3e4da [DeviceMesh] Introduce CuTe layout into devicemesh code base for internal bookkeeping (#163212)
DeviceMesh essentially is a way to specify how devices interact with each other or device layout. They are all integers but because they can have various shapes and meshes, it make internal bookkeeping internally way more challenging. Currently our internal bookkeeing inside DeviceMesh is not scalable, so in order to support new functions like `_unflatten`, we need to introduce very complicated logics inside DeviceMesh as pointed out per comment (https://github.com/pytorch/pytorch/pull/159482/files#r2256025452). So thanks to @lw 's suggestion and PoC PR (https://github.com/pytorch/pytorch/pull/160429), we realize that by leveraging CuTe layout algebra([ref](https://docs.nvidia.com/cutlass/media/docs/cpp/cute/02_layout_algebra.html)) from Cutlass will greatly simply our internal mechanical bookkeeping for and make the abstraction ops way easier on top of it. So to make things go incrementally, we propose couple steps here https://github.com/pytorch/pytorch/issues/160337#issuecomment-3195106243.

On top of what we have been doing about PyCute we want to continue add methods into the wrapper class so that we can get rank indexes needed for ProcessGroup Creation with a layout object. We also added detailed explanations and comments (thanks to llm) and unit test to show case the code indeed is working as expected.

More PRs are on the way.

This is a continue of https://github.com/pytorch/pytorch/pull/161016 (originally messed with EasyCLA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163212
Approved by: https://github.com/ezyang, https://github.com/fegin, https://github.com/lw
2025-09-26 03:32:19 +00:00
67cc0e0ac9 Add Static Dispatch Kernels (#163676) (#163870)
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1951

X-link: https://github.com/pytorch/FBGEMM/pull/4927

Add a few missing static dispatch kernels for remote_ro.

Test Plan: Tested with scripts in D83028841.

Differential Revision: D83258808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163870
Approved by: https://github.com/henryoier
2025-09-26 03:00:07 +00:00
bbf8aa43ef [a2av] Separate in/out splits into two tensors (#163837)
Old signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor(a!) in_out_splits, str group_name)`
New signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name)`

i.e. split `in_out_splits` into IN tensor and OUT tensor so that we can define the TORCH_LIBRARY signature better.
Also to be in line with the 2D version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163837
Approved by: https://github.com/fduwjj
ghstack dependencies: #163886
2025-09-26 01:03:54 +00:00
5daa79fd6e Remove dataclass_slots (#163623)
`dataclass` now has `slots` kwarg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163623
Approved by: https://github.com/Skylion007
2025-09-26 00:54:42 +00:00
b776e0c71e [ROCm][CI/CD] create ROCm 7.0 magma tarball (#163883)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163883
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-26 00:51:17 +00:00
5c2f09d1f9 [export] _detect_attribute_assignment gives warning instead of raising ValueError (#163809)
Summary:
LSTM was not exportable with non-strict export as it failed at `_detect_attribute_assignment`

This is because the `_flat_weights` attribute in LSTM is a list of registered parameters and will be updated by the `_update_flat_weights` method in `forward`.

However, in `_detect_attribute_assignment`, we manually restore the state of the module by `mod.__dict__.update(snapshot)`. Therefore, it should be fine to turn the `ValueError` into a warning so that RNN models are exportable with non-strict export.

Added test to verify that there is no lifted tensor constant and no fake tensor leakage.

Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_export_rnn_variants_with_warning

Differential Revision: D83196971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163809
Approved by: https://github.com/tugsbayasgalan
2025-09-26 00:43:29 +00:00
b4be380480 [ROCm] Implement float32 copy kernel (#163869)
* Add `float32_copy_kernel` for vectorizing float16/bfloat16 to float32 conversion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163869
Approved by: https://github.com/jeffdaily
2025-09-26 00:39:30 +00:00
5b8fef3f17 Extend triton_mm auto-tune options for HIM shapes (#163273)
Summary:
Add an option to auto-tune for shape:
```
M=1024 N=171712 K=1024
```

Test Plan:
```
TRITON_PRINT_AUTOTUNING=1 buck2 run mode/opt-amd-gpu -c fbcode.enable_gpu_sections=true //pytorch/tritonbench:run -- --op fp8_gemm_rowwise --no_use_tma --no_use_persistent --m 1024 --n 171712 --k 1024 --bias
```
Before:
 {F1982074581}
After, saw 10%~ boost:
{F1982074585}

Differential Revision: D82687336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163273
Approved by: https://github.com/jananisriram, https://github.com/Camyll
2025-09-26 00:05:57 +00:00
ff2f319e6e [MPS] Fix conv layout handling (#162776)
What started as simple fix for `mps_convolution_backward_input` resulted in a pretty significant refactor/fixes:
- Updated `mps_conv_use_channels_last` to return channels last output if either input or weights are channels last
- Use the same primitive throughout `Convolution.mm` to determine wether output should be allocated in channels last format or not

But doing only those two, resulted in crash in `test_memory_format_nn_Conv2d_mps_float32`, when weights were backward, and bias is present:
```
% python -c "import torch;print(torch.nn.functional.conv2d(torch.rand(2, 4, 3, 4,device='mps'), torch.rand(5, 4, 3, 3,device='mps').to(memory_format=torch.channels_last), torch.rand(5,device='mps')))"
/AppleInternal/Library/BuildRoots/4~B5E4ugDCh2RsPWAjMEoPu8LC5w1yXEwd7XweDhg/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:3619: failed assertion `Error: MLIR pass manager failed'
zsh: abort      python -c
```

Which requires a more thorough redesign/cleanup, namely:
- Do not alter the layout based on MacOS version, but rather do additional copies on MacOS-14 if inputs/output or weight are in channels last format ( done by defining `std::optional<Tensor> output_c;` that contains a contiguous copy of the output tensor
- Introduced `input_suggested_layout` which is set to ChannelsLast if and only if input is channels last and is running on MacOS-15+
- Delete unused `memory_layout` and `group` arguments from `fill_depthwise_conv_desc`
- Fix bias broadcasting logic for channels last

As result, in addition to adding one more regression test this change removes `expectedFailures` from:
- `TestModule.test_memory_format` for `Conv2d`, `ConvTranspose2d`, `LazyConv1d`, `LazyConvTranspose1d`
- `test_require_stride_expanded_dynamic_shapes`
-  `test_mutable_custom_op_fixed_layout2` for MacOS-14

Fixes https://github.com/pytorch/pytorch/issues/161905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162776
Approved by: https://github.com/Skylion007
2025-09-25 23:41:34 +00:00
94195a37ae [BE] Remove HermeticPyObjectTLS and Simplify PythonOpRegistrationTrampoline (#163464)
Removes HermeticPyObjectTLS as we no longer need since torch deploy is no longer supported. PythonOpRegistrationTrampoline is also drastically simplified as and being prepped for removal in a future PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163464
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-25 23:30:50 +00:00
suo
c58e096cd0 [DTensor] implement logsumexp (#163879)
as title, mostly copypasta from internal. I am a dtensor noob, so please scrutinize my added test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163879
Approved by: https://github.com/XilunWu
2025-09-25 23:08:30 +00:00
2a6e6a9e3b [FSDP][Replicate] tests replicate parity for shared parameters (#162836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162836
Approved by: https://github.com/mori360
ghstack dependencies: #162830
2025-09-25 23:08:22 +00:00
6e6c899347 [Reland][163423] Promote @requires_nvshmem instead of enable_triton (#163549)
#163423 was approved but reverted due to a revert of base.
Relanding without base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163549
Approved by: https://github.com/wdvr

Co-authored-by: Wouter Devriendt <wouterdevriendt@meta.com>
2025-09-25 23:02:00 +00:00
366961df78 [FSDP][Replicate] tests replicate parity with activation checkpointing (#162830)
**Summary:**  In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This tests that replicate function works correctly when combined with activation checkpointing

**Test Case**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_with_activation_checkpointing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162830
Approved by: https://github.com/mori360
2025-09-25 22:57:00 +00:00
520fca82c8 Refactor Provenance Tracking (#163378)
Summary:
- Move the `provenance_level` flag check to inside the `set_kernel_post_grad_provenance_tracing` call to simply the code

- Move the `set_kernel_post_grad_provenance_tracing` call and `write_provenance_debug_handle` call to `codegen_comment`.

- If some `call_kernel` call sites don't have a proceeding `codegen_comment` call, add one. Now all `call_kernel` call sites are accompanied with a  `codegen_comment` call.

- Add a `codegen_comment` method to BaseScheduling and remove the noop `codegen_comment` method in Scheduling

- Remove `debug_handle` from `call_kernel`.

Test Plan:
CI

```
buck run @//mode/opt-split-dwarf fbcode//caffe2/test/inductor:provenance_tracing
```

Differential Revision: D82839271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163378
Approved by: https://github.com/angelayi
2025-09-25 22:55:59 +00:00
908bcfd403 [AOTInductor] Add input information for Triton Kernels in AOTI (#160380)
Summary:
We use record_function to pass in input information to let Kineto show
input information.

Test Plan:
Before:
<img width="459" height="582" alt="Screenshot 2025-09-19 at 10 45 10 AM" src="https://github.com/user-attachments/assets/baa0c251-86e9-49ca-8c6c-fcd2619f7f48" />

After:
<img width="473" height="1130" alt="Screenshot 2025-09-19 at 10 44 53 AM" src="https://github.com/user-attachments/assets/b7942d84-0362-4b9e-9232-14de92bbdd00" />

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160380
Approved by: https://github.com/desertfire
ghstack dependencies: #163593
2025-09-25 22:41:04 +00:00
96275dbf88 [CI] Fix test_triton_wait_until hang (#163886)
I don't know why `nvshmem_barrier_all_kernel`  leads the test to hang. Will investigate.
But since it is an unnecessary call here, I am removing it to unblock other PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163886
Approved by: https://github.com/fegin
2025-09-25 22:22:16 +00:00
b14a14a662 [torchfuzz] make generated code much more concise and cleaner (#163812)
```
import torch

torch._dynamo.config.capture_scalar_outputs = True
torch.manual_seed(42)

def fuzzed_program(arg_0, arg_1, arg_2):
    var_node_3 = arg_0 # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_4 = torch.full((1,), (-0.29262632146522655-0.7687848816195035j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_2 = torch.ops.aten.add(var_node_3, var_node_4) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_6 = arg_1 # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_7 = arg_2 # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_5 = torch.ops.aten.add(var_node_6, var_node_7) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_1 = torch.ops.aten.add(var_node_2, var_node_5) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_0 = var_node_1.item() # dtype=complex128
    return var_node_0

arg_0 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,))
arg_1 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,))
arg_2 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,))

args = (arg_0, arg_1, arg_2)
result_original = fuzzed_program(*args)
print(' eager success')
compiled_program = torch.compile(fuzzed_program, fullgraph=False, dynamic=True)
result_compiled = compiled_program(*args)
print(' compile success')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163812
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743
2025-09-25 22:12:33 +00:00
92f7361e27 [DTensor] fix uneven _StridedShard (#163843)
Previous uneven `_StridedShard` in https://github.com/pytorch/pytorch/pull/150490 seems failing cases like sharding `tensor = torch.arange(6)` with FSDP 2, TP 2.

This PR attempts to reinvent `_StridedShard`.

I didn't test nested `_StridedShard`, because there shouldn't be any use cases. I think it will become quite messy when it comes to **nested uneven** `_StridedShard`. We are probably going to deprecate it anyway after @zpcore 's work https://github.com/pytorch/pytorch/pull/160266 on ordered sharding, so IMO not worth it to make it too general.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163843
Approved by: https://github.com/ezyang
2025-09-25 22:12:29 +00:00
6a6d838832 Add H100 runner to be recognized in actionlint (#163795)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163795
Approved by: https://github.com/huydhn, https://github.com/seemethere
2025-09-25 22:09:11 +00:00
183dca423f [Inductor] add a new config fallback_embedding_bag_byte_unpack (#163803)
Differential Revision: D82988783

introduce an inductor config fallback_embedding_bag_byte_unpack so we can have options to not let inductor decompose the op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163803
Approved by: https://github.com/henryoier
2025-09-25 22:07:04 +00:00
b8efa336d2 [torchfuzz] simplify codegen and runner (#163743)
much less code. a followup PR will make these repro files even smaller.
small is important since it reduces the time for users to understand
what the repro is doing. here's a sample:

```
(/home/bobren/local/a/pytorch-env) [21:34] devgpu009:/home/bobren/local/a/pytorch/tools/experimental/dynamic_shapes/torchfuzz [130] python fuzzer.py --seed 42
Running single fuzz_and_execute...
Using seed: 42, max_depth: 10
Running generated program...
Selected CUDA_VISIBLE_DEVICES=2
=== Program Output ===
 eager success
 compile success

===============================
=== Program Source ===
import torch
import sys
import os
fuzzer_dir = r'/home/bobren/local/a/pytorch/tools/experimental/dynamic_shapes/torchfuzz'
if fuzzer_dir not in sys.path:
    sys.path.insert(0, fuzzer_dir)
from tensor_fuzzer import fuzz_scalar, fuzz_tensor_simple, ScalarSpec, TensorSpec

def fuzzed_program(arg_0, arg_1, arg_2, arg_3, arg_4, arg_5, arg_6, arg_7, arg_8, arg_9, arg_10, arg_11, arg_12, arg_13, arg_14, arg_15, arg_16, arg_17, arg_18, arg_19, arg_20, arg_21, arg_22, arg_23, arg_24, arg_25, arg_26):
    # Node node_4: arg (depth 6)
    var_node_4 = arg_0 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_7: constant (depth 4)
    var_node_7 = torch.full((1,), (-0.8353595860703585-0.8384634248041143j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_8: arg (depth 4)
    var_node_8 = arg_1 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_6: tensor_pointwise (depth 5)
    var_node_6 = torch.ops.aten.mul(var_node_7, var_node_8) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_9: constant (depth 5)
    var_node_9 = torch.full((1,), (-0.32478860712861235+0.033909682598544454j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_5: tensor_pointwise (depth 6)
    var_node_5 = torch.ops.aten.mul(var_node_6, var_node_9) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_3: tensor_pointwise (depth 7)
    var_node_3 = torch.ops.aten.sub(var_node_4, var_node_5) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_11: arg (depth 6)
    var_node_11 = arg_2 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_18: constant (depth 0)
    var_node_18 = torch.full((1,), (0.12855308616305575+1.5268033634325642j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_19: arg (depth 0)
    var_node_19 = arg_3 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_17: tensor_pointwise (depth 1)
    var_node_17 = torch.ops.aten.mul(var_node_18, var_node_19) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_21: arg (depth 0)
    var_node_21 = arg_4 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_22: arg (depth 0)
    var_node_22 = arg_5 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_20: tensor_pointwise (depth 1)
    var_node_20 = torch.ops.aten.sub(var_node_21, var_node_22) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_16: tensor_pointwise (depth 2)
    var_node_16 = torch.ops.aten.add(var_node_17, var_node_20) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_25: arg (depth 0)
    var_node_25 = arg_6 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_26: arg (depth 0)
    var_node_26 = arg_7 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_24: tensor_pointwise (depth 1)
    var_node_24 = torch.ops.aten.add(var_node_25, var_node_26) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_27: constant (depth 1)
    var_node_27 = torch.full((1,), (-0.6315711191260084+1.342004076501214j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_23: tensor_pointwise (depth 2)
    var_node_23 = torch.ops.aten.mul(var_node_24, var_node_27) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_15: tensor_pointwise (depth 3)
    var_node_15 = torch.ops.aten.mul(var_node_16, var_node_23) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_28: constant (depth 3)
    var_node_28 = torch.full((1,), (1.064498531874825-0.37289464356501284j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_14: tensor_pointwise (depth 4)
    var_node_14 = torch.ops.aten.mul(var_node_15, var_node_28) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_30: arg (depth 3)
    var_node_30 = arg_8 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_32: arg (depth 2)
    var_node_32 = arg_9 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_33: constant (depth 2)
    var_node_33 = torch.full((1,), (1.5815627438573372+0.5124667911691704j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_31: tensor_pointwise (depth 3)
    var_node_31 = torch.ops.aten.div(var_node_32, var_node_33) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_29: tensor_pointwise (depth 4)
    var_node_29 = torch.ops.aten.div(var_node_30, var_node_31) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_13: tensor_pointwise (depth 5)
    var_node_13 = torch.ops.aten.div(var_node_14, var_node_29) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_39: arg (depth 0)
    var_node_39 = arg_10 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_40: constant (depth 0)
    var_node_40 = torch.full((1,), (-0.5987350493494642-0.5711360569376475j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_38: tensor_pointwise (depth 1)
    var_node_38 = torch.ops.aten.mul(var_node_39, var_node_40) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_41: arg (depth 1)
    var_node_41 = arg_11 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_37: tensor_pointwise (depth 2)
    var_node_37 = torch.ops.aten.add(var_node_38, var_node_41) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_42: constant (depth 2)
    var_node_42 = torch.full((1,), (0.7246044564672116-0.5930730980273312j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_36: tensor_pointwise (depth 3)
    var_node_36 = torch.ops.aten.mul(var_node_37, var_node_42) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_43: constant (depth 3)
    var_node_43 = torch.full((1,), (-0.7582976293117148+1.1880929376258396j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_35: tensor_pointwise (depth 4)
    var_node_35 = torch.ops.aten.mul(var_node_36, var_node_43) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_45: constant (depth 3)
    var_node_45 = torch.full((1,), (1.0896212896322774+0.3124038130417098j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_46: arg (depth 3)
    var_node_46 = arg_12 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_44: tensor_pointwise (depth 4)
    var_node_44 = torch.ops.aten.add(var_node_45, var_node_46) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_34: tensor_pointwise (depth 5)
    var_node_34 = torch.ops.aten.div(var_node_35, var_node_44) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_12: tensor_pointwise (depth 6)
    var_node_12 = torch.ops.aten.div(var_node_13, var_node_34) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_10: tensor_pointwise (depth 7)
    var_node_10 = torch.ops.aten.mul(var_node_11, var_node_12) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_2: tensor_pointwise (depth 8)
    var_node_2 = torch.ops.aten.div(var_node_3, var_node_10) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_48: constant (depth 7)
    var_node_48 = torch.full((1,), (-1.047745491289218+0.279447315087422j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_54: arg (depth 2)
    var_node_54 = arg_13 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_55: arg (depth 2)
    var_node_55 = arg_14 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_53: tensor_pointwise (depth 3)
    var_node_53 = torch.ops.aten.div(var_node_54, var_node_55) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_56: arg (depth 3)
    var_node_56 = arg_15 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_52: tensor_pointwise (depth 4)
    var_node_52 = torch.ops.aten.div(var_node_53, var_node_56) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_59: arg (depth 2)
    var_node_59 = arg_16 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_60: arg (depth 2)
    var_node_60 = arg_17 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_58: tensor_pointwise (depth 3)
    var_node_58 = torch.ops.aten.div(var_node_59, var_node_60) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_61: constant (depth 3)
    var_node_61 = torch.full((1,), (-0.7386327586576402-0.027025998767172658j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_57: tensor_pointwise (depth 4)
    var_node_57 = torch.ops.aten.add(var_node_58, var_node_61) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_51: tensor_pointwise (depth 5)
    var_node_51 = torch.ops.aten.sub(var_node_52, var_node_57) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_64: arg (depth 3)
    var_node_64 = arg_18 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_67: arg (depth 1)
    var_node_67 = arg_19 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_68: constant (depth 1)
    var_node_68 = torch.full((1,), (-0.6840241429755998+1.327637020136433j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_66: tensor_pointwise (depth 2)
    var_node_66 = torch.ops.aten.mul(var_node_67, var_node_68) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_69: arg (depth 2)
    var_node_69 = arg_20 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_65: tensor_pointwise (depth 3)
    var_node_65 = torch.ops.aten.sub(var_node_66, var_node_69) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_63: tensor_pointwise (depth 4)
    var_node_63 = torch.ops.aten.sub(var_node_64, var_node_65) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_70: arg (depth 4)
    var_node_70 = arg_21 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_62: tensor_pointwise (depth 5)
    var_node_62 = torch.ops.aten.sub(var_node_63, var_node_70) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_50: tensor_pointwise (depth 6)
    var_node_50 = torch.ops.aten.mul(var_node_51, var_node_62) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_76: constant (depth 1)
    var_node_76 = torch.full((1,), (1.864651314238342+0.27066487315113186j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_77: arg (depth 1)
    var_node_77 = arg_22 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_75: tensor_pointwise (depth 2)
    var_node_75 = torch.ops.aten.mul(var_node_76, var_node_77) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_78: arg (depth 2)
    var_node_78 = arg_23 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_74: tensor_pointwise (depth 3)
    var_node_74 = torch.ops.aten.add(var_node_75, var_node_78) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_79: arg (depth 3)
    var_node_79 = arg_24 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_73: tensor_pointwise (depth 4)
    var_node_73 = torch.ops.aten.mul(var_node_74, var_node_79) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_80: arg (depth 4)
    var_node_80 = arg_25 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_72: tensor_pointwise (depth 5)
    var_node_72 = torch.ops.aten.mul(var_node_73, var_node_80) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_82: constant (depth 4)
    var_node_82 = torch.full((1,), (1.6341547018841247+0.3096989611326181j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_84: constant (depth 3)
    var_node_84 = torch.full((1,), (0.9609065596935821+0.2920229825681946j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_85: arg (depth 3)
    var_node_85 = arg_26 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_83: tensor_pointwise (depth 4)
    var_node_83 = torch.ops.aten.add(var_node_84, var_node_85) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_81: tensor_pointwise (depth 5)
    var_node_81 = torch.ops.aten.sub(var_node_82, var_node_83) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_71: tensor_pointwise (depth 6)
    var_node_71 = torch.ops.aten.sub(var_node_72, var_node_81) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_49: tensor_pointwise (depth 7)
    var_node_49 = torch.ops.aten.mul(var_node_50, var_node_71) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_47: tensor_pointwise (depth 8)
    var_node_47 = torch.ops.aten.add(var_node_48, var_node_49) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_1: tensor_pointwise (depth 9)
    var_node_1 = torch.ops.aten.add(var_node_2, var_node_47) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_0: torch.ops.aten.item (depth 10)
    var_node_0 = var_node_1.item() # dtype=complex128

    # Final result from root node
    return var_node_0

arg_0 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10042)
arg_1 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10043)
arg_2 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10044)
arg_3 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10045)
arg_4 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10046)
arg_5 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10047)
arg_6 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10048)
arg_7 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10049)
arg_8 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10050)
arg_9 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10051)
arg_10 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10052)
arg_11 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10053)
arg_12 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10054)
arg_13 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10055)
arg_14 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10056)
arg_15 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10057)
arg_16 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10058)
arg_17 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10059)
arg_18 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10060)
arg_19 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10061)
arg_20 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10062)
arg_21 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10063)
arg_22 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10064)
arg_23 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10065)
arg_24 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10066)
arg_25 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10067)
arg_26 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10068)
import torch
import sys
torch._dynamo.config.capture_scalar_outputs = True

args = (arg_0, arg_1, arg_2, arg_3, arg_4, arg_5, arg_6, arg_7, arg_8, arg_9, arg_10, arg_11, arg_12, arg_13, arg_14, arg_15, arg_16, arg_17, arg_18, arg_19, arg_20, arg_21, arg_22, arg_23, arg_24, arg_25, arg_26)
result_original = fuzzed_program(*args)
print(' eager success')
sys.exit(1)
compiled_program = torch.compile(fuzzed_program, fullgraph=False, dynamic=True)
result_compiled = compiled_program(*args)
print(' compile success')

======================
Program exited with code: 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163743
Approved by: https://github.com/pianpwk
2025-09-25 21:42:22 +00:00
1cffa42d4d PyTorch histc fix for values with large magnitudes (#163506)
Summary:
The current implementation of the `histc` function on CPU doesn't take into account the nature of the floating point precision represenation when two numbers have very different magnitudes.

In the code of `histc` there is a following logic, which tries to fix an issue when automatically calculated `min` and `max` are identical:
```
if (leftmost_edge == rightmost_edge) {
        leftmost_edge -= 1;
        rightmost_edge += 1;
    }

...

TORCH_CHECK(leftmost_edge < rightmost_edge, "torch.histc: max must be larger than min");
```

But, not for all floating point values expanding the range exactly by 1 will give the representable result that is different from the original value.

The test code:

```
info = th.finfo(th.float32)
f_min = info.min

test_tensor = th.ones((224, 224), dtype=th.float64) * f_min
res = th.histc(test_tensor, bins=10)
```

Actual result:
```
RuntimeError: torch.histc: max must be larger than min
```

Expected result:
Everything should work fine.

NOTICE: If we set `f_min` just to small enough number, code works, which demonstrates the correct purpose of the possible range correction.

In short, `f_min + 1 == f_min` executes to true, since we reach the precision of the floating point prepresentation.
Please notice, this is not limitation of the float32 data type, since all computations happen in float64 (C++ data type `double`). The magnitudes are just different enough, that we reach the precision representation with simple approach of `+/-1`.

Interesting is that `histogram` function doesn't throw an exception, because edges range selection is implemented differently.

The fix we propose is to use `std::nextafter` which returns next representable floating point value starting from the current one in the direction of the lowest or max numbers. In theory, mathecmatically correct is to use this function without constrains, but to maintain backward compatibility in case if there is a code which relies on the current logic of `+/-1` offset we call `std::min` and `std::max` to pick the right representable value (i.e. for small floating point values the next representable value has step smaller than 1 for large values it's larger than 1).
We could stick to `histogram` implementation, but again, to avoid possible backward compatibility breaks, we decided to use the fix presented in this change.

*The real use case scenario:*
In our project we use the well-known transformer version from HuggingFace which fills up the buffer with float32 min (please note this is not a minimal value closer to 0, it's minimal absolute value which is often like `-max`).
The code where it sits is here:
https://github.com/huggingface/transformers/blob/v4.51.1/src/transformers/models/mimi/modeling_mimi.py#L1159

Switching to other version of the transformer will lead to other issues in our project and the bug which we fix here may appear in other projects and scenarios.

The real world problem appears when for such tensor the CPU version of the `histc` is called. In our usecase, it happens because this tensor is an input to the softmax activaiton function and as part of the quantisation the input parameter should go trough the observer as well. In our case the default Histogram observer is selected, which calls the `histc`.

Test Plan:
The simple test code snippet doesn't produce failure:
```
f_min = th.finfo(th.float32).min
test_tensor = th.ones((224, 224), dtype=th.float32) * f_min
th.histc(test_tensor, bins=10)
```

**Testing update:**
The `test_histc` has been updated accordingly.
Now when we have +INF as all values of the tensor, the previous representation of the floating number should be <max_float>, hence the assert message is changed from `[inf, inf]` to `[<max_float>|inf, inf]`.
The test also extended to check the assert message when tensor is filled with values -INF and with combination of (-INF, +INF).
The new regexp assert includes possible output as `inf` and any floating point number in scientific representation for one of the bin edges. We left `inf` as possible value due to possible difference in implementation between CPU and CUDA.

Differential Revision: D82955597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163506
Approved by: https://github.com/jermenkoo, https://github.com/malfet
2025-09-25 20:55:25 +00:00
ebfc87e303 Always produce kernel_info.json (#163715)
Summary: Always produce kernel_info.json so zoomer can use this json to populate GPU traces

Test Plan: CI

Differential Revision: D82762435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163715
Approved by: https://github.com/angelayi
2025-09-25 19:38:49 +00:00
21a41edd4f Add fake_impl for _native_multi_head_attention (#163700)
Test Plan: See added test in test_export.py

Differential Revision: D83099187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163700
Approved by: https://github.com/angelayi
2025-09-25 19:01:27 +00:00
7bad9c5a64 Revert "Update ruff to 0.13.1 (#163744)"
This reverts commit 3dd89a079f2b0c1d39351f98ff5d5ca882523152.

Reverted https://github.com/pytorch/pytorch/pull/163744 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/18016220484/job/51261729375 looks like a landrace with PR that updated min-version to 3.10 ([comment](https://github.com/pytorch/pytorch/pull/163744#issuecomment-3335534084))
2025-09-25 18:54:03 +00:00
151e66e50d Update documentation for torch.index_select (#163616)
Description said "entries in index which is a LongTensor" but index_select can accept an IntTensor as the parameter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163616
Approved by: https://github.com/jbschlosser

Co-authored-by: Joel Schlosser <75754324+jbschlosser@users.noreply.github.com>
2025-09-25 18:29:17 +00:00
b61bdc7cc4 Fix cpp build (#162774)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162774
Approved by: https://github.com/malfet, https://github.com/atalman
2025-09-25 18:21:45 +00:00
3dd89a079f Update ruff to 0.13.1 (#163744)
Update ruff to 0.13.1 so that we can remove `UP038` from `pyproject.toml` because it has been removed from supported rules of ruff.
There are some fixes, the most notable one is [(PYI059)](https://docs.astral.sh/ruff/rules/generic-not-last-base-class/#generic-not-last-base-class-pyi059)
```
Checks for classes inheriting from typing.Generic[] where Generic[] is not the last base class in the bases tuple.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163744
Approved by: https://github.com/Skylion007, https://github.com/jingsh
2025-09-25 17:52:35 +00:00
6539537a59 [ROCm][CD] create ROCm 7.0 images for binary builds (#163860)
Adds gfx950.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163860
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-25 17:26:40 +00:00
3cbfbbd691 [ROCm] Transformer/SDPA unit test parity (#163745)
## Major Changes

* Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes.
  - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required.
* Fix `atomic_counter` handling in varlen FA API
* Unskips a few unit tests.

Fixes #157120
Fixes #157121
Fixes #157122
Fixes #157167
Fixes #155217
Fixes #157043
Fixes #157060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163745
Approved by: https://github.com/jeffdaily
2025-09-25 17:14:19 +00:00
112e204797 Revert "[CUDA] Compare major version of the runtime device arch against the built version of the pytorch binary (#161299)"
This reverts commit 7163dce1e091cb5564c723110314bb372b5e81a8.

Reverted https://github.com/pytorch/pytorch/pull/161299 on behalf of https://github.com/nWEIdia due to Incorrectly suppressing useful warnings when running sm89 binary on sm86 ([comment](https://github.com/pytorch/pytorch/pull/161299#issuecomment-3335127621))
2025-09-25 17:13:32 +00:00
f9821b1be7 DebugMode supports_higher_order_operators=True (#163824)
Make DebugMode supports HOP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163824
Approved by: https://github.com/ydwu4
2025-09-25 17:11:43 +00:00
c4312b443f [Tools] Adapting the Hypothesis library (version 5.x) for use with the PyTorch framework (#163748)
Starting from version 5.x, the Hypothesis library removed the timeout setting and only retained the deadline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163748
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-25 16:41:50 +00:00
7194d77550 Revert "enable test_sampled_addmm_zero_sized_cuda for rocm (#121940)" (#163848)
This reverts commit 5494b2a8d38c3ddbeb2d96a5ac990e20ec4c48fd.

Need to skip `test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_zero_sized_cuda_*` again. Tests are failing now with "core dumped" error
```
python test_sparse_csr.py -v -k test_sampled_addmm_zero_sized_cuda_float64

  test_sampled_addmm_zero_sized_cuda_float64 (__main__.TestSparseCSRCUDA) ... /tmp/pytorch/test/test_sparse_csr.py:2503:   c = torch.empty(m, n, dtype=dtype, device=device, layout=torch.sparse_csr)
GPU core dump created: gpucore.186789
:0:rocdevice.cpp            :2992: 4701819131755 us:  Callback: Queue 0x760cdcd00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
Aborted (core dumped)
```
These failures are linked to `test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_*` due to incorrect test log parsing. We will be able to close these issues also:

- Fixes https://github.com/pytorch/pytorch/issues/163663
- Fixes https://github.com/pytorch/pytorch/issues/160786
- Fixes https://github.com/pytorch/pytorch/issues/160785
- Fixes https://github.com/pytorch/pytorch/issues/160784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163848
Approved by: https://github.com/jeffdaily
2025-09-25 16:38:00 +00:00
22d5f5ff94 [OpenReg][BE] Replacing explicit prefix/suffix with CMake variables (#163850)
As the title states, suffixes like`.dylib` and `lib` can be replaced by `CMAKE_SHARED_LIBRARY_SUFFIX`, and prefixes like `lib` can be replaced by `CMAKE_SHARED_LIBRARY_PREFIX` on Unix or `CMAKE_IMPORT_LIBRARY_PREFIX` on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163850
Approved by: https://github.com/albanD
2025-09-25 16:33:16 +00:00
308 changed files with 8232 additions and 3930 deletions

View File

@ -69,7 +69,8 @@ RUN bash ./install_cuda.sh 13.0
ENV DESIRED_CUDA=13.0
FROM ${ROCM_IMAGE} as rocm
ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
ENV MKLROOT /opt/intel

View File

@ -36,6 +36,12 @@ case ${DOCKER_TAG_PREFIX} in
;;
rocm*)
BASE_TARGET=rocm
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
if [[ "$ROCM_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
fi
EXTRA_BUILD_ARGS="${EXTRA_BUILD_ARGS} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;
*)
echo "ERROR: Unknown docker tag ${DOCKER_TAG_PREFIX}"

View File

@ -1 +1 @@
v2.27.5-1
v2.28.3-1

View File

@ -1 +1 @@
v2.27.7-1
v2.28.3-1

View File

@ -12,8 +12,8 @@ function do_install() {
rocm_version_nodot=${rocm_version//./}
# Version 2.7.2 + ROCm related updates
MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6
# https://github.com/icl-utk-edu/magma/pull/65
MAGMA_VERSION=d6e4117bc88e73f06d26c6c2e14f064e8fc3d1ec
magma_archive="magma-rocm${rocm_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
rocm_dir="/opt/rocm"

View File

@ -40,12 +40,16 @@ case ${DOCKER_TAG_PREFIX} in
;;
rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
if [[ "$GPU_ARCH_VERSION" == *"6.4"* ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
fi
BASE_TARGET=rocm
GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
fi
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"
;;
*)

View File

@ -82,7 +82,7 @@ case ${image} in
;;
manylinux2_28-builder:rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
if [[ "$GPU_ARCH_VERSION" == *"6.4"* ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
fi
TARGET=rocm_final
@ -90,6 +90,10 @@ case ${image} in
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
fi
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;
manylinux2_28-builder:xpu)

View File

@ -1,8 +1,15 @@
sphinx==5.3.0
#Description: This is used to generate PyTorch docs
#Pinned versions: 5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@d53b0ffb9b1cda68260693ea98f3483823c88d8e#egg=pytorch_sphinx_theme2
standard-imghdr==3.13.0; python_version >= "3.13"
#Description: This is needed by Sphinx, so it needs to be added here.
# The reasons are as follows:
# 1) This module has been removed from the Python standard library since Python 3.13(https://peps.python.org/pep-0594/#imghdr);
# 2) The current version of Sphinx (5.3.0) is not compatible with Python 3.13.
# Once Sphinx is upgraded to a version compatible with Python 3.13 or later, we can remove this dependency.
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@d53b0ffb9b1cda68260693ea98f3483823c88d8e#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought that it is probably
# something related to Docker setup. We can investigate this later.

View File

@ -1,11 +1,11 @@
SHELL=/usr/bin/env bash
DOCKER_CMD ?= docker
DESIRED_ROCM ?= 6.4
DESIRED_ROCM ?= 7.0
DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))
PACKAGE_NAME = magma-rocm
# inherit this from underlying docker image, do not pass this env var to docker
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201
DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-v $(shell git rev-parse --show-toplevel)/.ci:/builder \
@ -16,6 +16,7 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
magma-rocm/build_magma.sh
.PHONY: all
all: magma-rocm70
all: magma-rocm64
all: magma-rocm63
@ -24,6 +25,11 @@ clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-rocm70
magma-rocm70: DESIRED_ROCM := 7.0
magma-rocm70:
$(DOCKER_RUN)
.PHONY: magma-rocm64
magma-rocm64: DESIRED_ROCM := 6.4
magma-rocm64:

View File

@ -6,8 +6,8 @@ set -eou pipefail
# The script expects DESIRED_CUDA and PACKAGE_NAME to be set
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
# Version 2.7.2 + ROCm related updates
MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6
# https://github.com/icl-utk-edu/magma/pull/65
MAGMA_VERSION=d6e4117bc88e73f06d26c6c2e14f064e8fc3d1ec
# Folders for the build
PACKAGE_FILES=${ROOT_DIR}/magma-rocm/package_files # metadata
@ -20,7 +20,7 @@ mkdir -p ${PACKAGE_DIR} ${PACKAGE_OUTPUT}/linux-64 ${PACKAGE_BUILD} ${PACKAGE_RE
# Fetch magma sources and verify checksum
pushd ${PACKAGE_DIR}
git clone https://bitbucket.org/icl/magma.git
git clone https://github.com/jeffdaily/magma
pushd magma
git checkout ${MAGMA_VERSION}
popd

View File

@ -58,7 +58,7 @@ time python tools/setup_helpers/generate_code.py \
# Build the docs
pushd docs/cpp
time make VERBOSE=1 html -j
time make VERBOSE=1 html
popd
popd

View File

@ -1794,6 +1794,8 @@ elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then
test_h100_distributed
elif [[ "${TEST_CONFIG}" == "h100-symm-mem" ]]; then
test_h100_symm_mem
elif [[ "${TEST_CONFIG}" == "b200-symm-mem" ]]; then
test_h100_symm_mem
elif [[ "${TEST_CONFIG}" == h100_cutlass_backend ]]; then
test_h100_cutlass_backend
else

View File

@ -63,7 +63,7 @@ if errorlevel 1 exit /b 1
call %CONDA_HOME%\condabin\activate.bat testenv
if errorlevel 1 exit /b 1
call conda install -y -q -c conda-forge libuv=1.39
call conda install -y -q -c conda-forge libuv=1.51
call conda install -y -q intel-openmp
echo "install and test libtorch"

View File

@ -69,6 +69,8 @@ readability-string-compare,
'
HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
WarningsAsErrors: '*'
LineFilter:
- name: '/usr/include/.*'
CheckOptions:
cppcoreguidelines-special-member-functions.AllowSoleDefaultDtor: true
cppcoreguidelines-special-member-functions.AllowImplicitlyDeletedCopyOrMove: true

View File

@ -22,6 +22,9 @@ self-hosted-runner:
- linux.arm64.m7g.4xlarge
- linux.arm64.m7g.4xlarge.ephemeral
- linux.arm64.r7g.12xlarge.memory
- linux.aws.h100
- linux.aws.h100.4
- linux.aws.h100.8
- linux.4xlarge.nvidia.gpu
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu

View File

@ -1 +1 @@
1983609239caaab24ab1ed2bfa2aa92e8c76c1b1
da63274d9f3d06ba5815b5c8786a7194923a0234

View File

@ -1,43 +1,44 @@
tracking_issue: 24422
ciflow_tracking_issue: 64124
ciflow_push_tags:
- ciflow/b200
- ciflow/b200-symm-mem
- ciflow/binaries
- ciflow/binaries_libtorch
- ciflow/binaries_wheel
- ciflow/triton_binaries
- ciflow/h100
- ciflow/h100-cutlass-backend
- ciflow/h100-distributed
- ciflow/h100-symm-mem
- ciflow/inductor
- ciflow/inductor-periodic
- ciflow/inductor-rocm
- ciflow/inductor-perf-test-nightly-rocm
- ciflow/inductor-perf-compare
- ciflow/inductor-cu126
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
- ciflow/inductor-perf-compare
- ciflow/inductor-perf-test-nightly-rocm
- ciflow/inductor-perf-test-nightly-x86-zen
- ciflow/inductor-cu126
- ciflow/inductor-periodic
- ciflow/inductor-rocm
- ciflow/linux-aarch64
- ciflow/mps
- ciflow/nightly
- ciflow/op-benchmark
- ciflow/periodic
- ciflow/periodic-rocm-mi300
- ciflow/pull
- ciflow/quantization-periodic
- ciflow/riscv64
- ciflow/rocm
- ciflow/rocm-mi300
- ciflow/s390
- ciflow/riscv64
- ciflow/slow
- ciflow/torchbench
- ciflow/triton_binaries
- ciflow/trunk
- ciflow/unstable
- ciflow/xpu
- ciflow/vllm
- ciflow/torchbench
- ciflow/op-benchmark
- ciflow/pull
- ciflow/h100
- ciflow/h100-distributed
- ciflow/win-arm64
- ciflow/h100-symm-mem
- ciflow/h100-cutlass-backend
- ciflow/b200
- ciflow/xpu
retryable_workflows:
- pull
- trunk
@ -46,4 +47,4 @@ retryable_workflows:
- inductor-A100-perf-nightly
labeler_config: labeler.yml
label_to_label_config: label_to_label.yml
mergebot: True
mergebot: true

View File

@ -30,7 +30,7 @@ CUDA_ARCHES_CUDNN_VERSION = {
}
# NOTE: Please also update the ROCm sources in `PIP_SOURCES` in tools/nightly.py when changing this
ROCM_ARCHES = ["6.3", "6.4"]
ROCM_ARCHES = ["6.4", "7.0"]
XPU_ARCHES = ["xpu"]
@ -53,7 +53,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | "
@ -70,7 +70,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | "
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | "
@ -87,7 +87,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | "
"nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | "
"nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | "
"nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | "
"nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | "
"nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx==13.0.39; platform_system == 'Linux' | "
"nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | "

View File

@ -67,7 +67,7 @@ jobs:
# an OOM issue when running the job, so this upgrades the runner from 4xlarge
# to the next available tier of 12xlarge. So much memory just to generate cpp
# doc
runner: ${{ inputs.runner_prefix }}linux.12xlarge
runner: ${{ inputs.runner_prefix }}linux.12xlarge.memory
# TODO: Nightly cpp docs take longer and longer to finish (more than 3h now)
# Let's try to figure out how this can be improved
timeout-minutes: 360

60
.github/workflows/b200-symm-mem.yml vendored Normal file
View File

@ -0,0 +1,60 @@
name: Limited CI for symmetric memory tests on B200
on:
pull_request:
paths:
- .github/workflows/b200-symm-mem.yml
workflow_dispatch:
push:
tags:
- ciflow/b200-symm-mem/*
schedule:
- cron: 22 8 * * * # about 1:22am PDT
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'
test-matrix: |
{ include: [
{ config: "b200-symm-mem", shard: 1, num_shards: 1, runner: "linux.dgx.b200.8" },
]}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-sm100-test:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm.outputs.test-matrix }}
aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
secrets: inherit

View File

@ -36,7 +36,7 @@ jobs:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
tag: ["cuda12.6", "cuda12.8", "cuda12.9", "cuda13.0", "rocm6.3", "rocm6.4", "cpu"]
tag: ["cuda12.6", "cuda12.8", "cuda12.9", "cuda13.0", "rocm6.3", "rocm6.4", "rocm7.0", "cpu"]
steps:
- name: Build docker image
uses: pytorch/pytorch/.github/actions/binary-docker-build@main

View File

@ -52,8 +52,8 @@ jobs:
{ tag: "cuda12.9" },
{ tag: "cuda12.8" },
{ tag: "cuda12.6" },
{ tag: "rocm6.3" },
{ tag: "rocm6.4" },
{ tag: "rocm7.0" },
{ tag: "cpu" },
]
steps:

View File

@ -34,7 +34,7 @@ jobs:
id-token: write
strategy:
matrix:
rocm_version: ["64", "63"]
rocm_version: ["70", "64"]
steps:
- name: Checkout PyTorch
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -52,8 +52,8 @@ jobs:
{ name: "manylinuxaarch64-builder", tag: "cuda13.0", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.8", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.6", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.3", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.4", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm7.0", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cpu", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28_aarch64-builder", tag: "cpu-aarch64", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxcxx11-abi-builder", tag: "cpu-cxx11-abi", runner: "linux.9xlarge.ephemeral" },

View File

@ -55,7 +55,7 @@ jobs:
docker-image: ["pytorch/manylinux2_28-builder:cpu"]
include:
- device: "rocm"
rocm_version: "6.4"
rocm_version: "7.0"
runs_on: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge"
- device: "cuda"
rocm_version: ""

View File

@ -132,7 +132,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -178,7 +178,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -224,7 +224,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -335,7 +335,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -381,7 +381,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -427,7 +427,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -538,7 +538,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -584,7 +584,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -630,7 +630,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -741,7 +741,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -787,7 +787,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -833,7 +833,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -944,7 +944,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -990,7 +990,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1036,7 +1036,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1147,7 +1147,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1193,7 +1193,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1239,7 +1239,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1350,7 +1350,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1396,7 +1396,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1442,7 +1442,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -316,121 +316,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_3-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.3
GPU_ARCH_VERSION: "6.3"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.3
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
timeout-minutes: 300
build_name: libtorch-rocm6_3-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm6_3-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-rocm6_3-shared-with-deps-release-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.3
GPU_ARCH_VERSION: "6.3"
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.3
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-rocm6_3-shared-with-deps-release
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: libtorch-cxx11-builder
custom-tag-prefix: rocm6.3
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm6_3-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm6_3-shared-with-deps-release-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.3
GPU_ARCH_VERSION: "6.3"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.3
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-rocm6_3-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_4-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -545,3 +430,118 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm7_0-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.0
GPU_ARCH_VERSION: "7.0"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
timeout-minutes: 300
build_name: libtorch-rocm7_0-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm7_0-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-rocm7_0-shared-with-deps-release-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.0
GPU_ARCH_VERSION: "7.0"
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-rocm7_0-shared-with-deps-release
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: libtorch-cxx11-builder
custom-tag-prefix: rocm7.0
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm7_0-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm7_0-shared-with-deps-release-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.0
GPU_ARCH_VERSION: "7.0"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-rocm7_0-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -60,7 +60,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda13_0-test: # Testing

File diff suppressed because it is too large Load Diff

View File

@ -1453,7 +1453,7 @@ init_command = [
'--dry-run={{DRYRUN}}',
'usort==1.0.8.post1',
'isort==6.0.1',
'ruff==0.12.9', # sync with RUFF
'ruff==0.13.1', # sync with RUFF
]
is_formatter = true
@ -1587,7 +1587,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'ruff==0.12.9', # sync with PYFMT
'ruff==0.13.1', # sync with PYFMT
]
is_formatter = true

View File

@ -442,7 +442,7 @@ if(WIN32)
message(
WARNING
"Libuv is not installed in current conda env. Set USE_DISTRIBUTED to OFF. "
"Please run command 'conda install -c conda-forge libuv=1.39' to install libuv."
"Please run command 'conda install -c conda-forge libuv=1.51' to install libuv."
)
else()
set(ENV{libuv_ROOT} ${libuv_tmp_LIBRARY}/../../)

View File

@ -275,7 +275,7 @@ conda install pkg-config libuv
pip install mkl-static mkl-include
# Add these packages if torch.distributed is needed.
# Distributed package support on Windows is a prototype feature and is subject to changes.
conda install -c conda-forge libuv
conda install -c conda-forge libuv=1.51
```
#### Install PyTorch

View File

@ -468,7 +468,7 @@ inline Tensor _sum_to(
// if we assume no reduction due to unbacked we ensure that at runtime.
TORCH_MAYBE_SYM_CHECK(
sym_eq(shape[i - leading_dims], sizes[i]),
"non-reduction path was assumed due to unabcked symbols expected those two sizes to be the same:",
"non-reduction path was assumed due to unbacked symbols expected those two sizes to be the same:",
shape[i - leading_dims],
", ",
sizes[i])

View File

@ -45,7 +45,39 @@ inline void infer_size_impl(
}
}
auto set_infer_dim = [&]() {
if (infer_dim) {
// numel is the product of known sizes, it has to be divisible by newsize.
// and newsize should be positive unless newsize == numel (we throw
// different) error message in that case.
if constexpr (std::is_same_v<NumelType, c10::SymInt>) {
auto v = newsize.maybe_as_int();
if (v and *v == 0) {
// Avoid div by 0 when sym_eq(numel % newsize, 0) is constructed!
// which may happen when newsize is not a symbol! if its a symbol
// division won't happen anyway during compile.
TORCH_MAYBE_SYM_CHECK(
numel == newsize,
"shape '",
shape,
"' is invalid for input of size ",
numel);
} else {
auto cond = sym_gt(newsize, 0)
.sym_and(sym_eq(numel % newsize, 0))
.sym_or(sym_eq(numel, newsize));
TORCH_MAYBE_SYM_CHECK(
cond, "shape '", shape, "' is invalid for input of size ", numel);
}
} else {
TORCH_CHECK(
(newsize > 0 && (numel % newsize == 0)) || numel == newsize,
"shape '",
shape,
"' is invalid for input of size ",
numel);
}
// We have a degree of freedom here to select the dimension size; follow
// NumPy semantics and just bail. However, a nice error message is needed
// because users often use `view` as a way to flatten & unflatten
@ -54,18 +86,14 @@ inline void infer_size_impl(
// works yet
// empty_tensor.view(-1, 0)
// doesn't.
TORCH_CHECK(
TORCH_MAYBE_SYM_CHECK(
newsize != 0,
"cannot reshape tensor of 0 elements into shape ",
shape,
" because the unspecified dimension size -1 can be any "
"value and is ambiguous");
res[*infer_dim] = numel / newsize;
return;
};
if (infer_dim && newsize > 0 && numel % newsize == 0) {
set_infer_dim();
res[*infer_dim] = numel / newsize;
return;
}
@ -75,9 +103,6 @@ inline void infer_size_impl(
shape,
"' is invalid for input of size ",
numel);
if (infer_dim) {
set_infer_dim();
}
}
inline std::vector<int64_t> infer_size(IntArrayRef shape, int64_t numel) {

View File

@ -1,32 +1,22 @@
#include <ATen/core/PythonOpRegistrationTrampoline.h>
#include <c10/core/impl/PyInterpreterHooks.h>
// TODO: delete this
namespace at::impl {
// The strategy is that all python interpreters attempt to register themselves
// as the main interpreter, but only one wins. Only that interpreter is
// allowed to interact with the C++ dispatcher. Furthermore, when we execute
// logic on that interpreter, we do so hermetically, never setting pyobj field
// on Tensor.
std::atomic<c10::impl::PyInterpreter*>
PythonOpRegistrationTrampoline::interpreter_{nullptr};
c10::impl::PyInterpreter* PythonOpRegistrationTrampoline::interpreter_ = nullptr;
c10::impl::PyInterpreter* PythonOpRegistrationTrampoline::getInterpreter() {
return PythonOpRegistrationTrampoline::interpreter_.load();
return c10::impl::getGlobalPyInterpreter();
}
bool PythonOpRegistrationTrampoline::registerInterpreter(
c10::impl::PyInterpreter* interp) {
c10::impl::PyInterpreter* expected = nullptr;
interpreter_.compare_exchange_strong(expected, interp);
if (expected != nullptr) {
// This is the second (or later) Python interpreter, which means we need
// non-trivial hermetic PyObject TLS
c10::impl::HermeticPyObjectTLS::init_state();
if (interpreter_ != nullptr) {
return false;
} else {
return true;
}
interpreter_ = interp;
return true;
}
} // namespace at::impl

View File

@ -2,19 +2,21 @@
#include <ATen/core/dispatch/Dispatcher.h>
// TODO: this can probably live in c10
// TODO: We can get rid of this
namespace at::impl {
// Manages the single Python interpreter instance for PyTorch.
class TORCH_API PythonOpRegistrationTrampoline final {
static std::atomic<c10::impl::PyInterpreter*> interpreter_;
static c10::impl::PyInterpreter* interpreter_;
public:
// Returns true if you successfully registered yourself (that means
// you are in the hot seat for doing the operator registrations!)
// Register the Python interpreter. Returns true on first registration,
// false if an interpreter was already registered.
static bool registerInterpreter(c10::impl::PyInterpreter*);
// Returns the registered interpreter via the global PyInterpreter hooks.
// Returns nullptr if no interpreter has been registered yet.
static c10::impl::PyInterpreter* getInterpreter();
};

View File

@ -149,5 +149,105 @@ static inline void pack_vnni4(
#endif
}
// This is a helper function for transpose_pack_vnni4
// Transform a [4, 16] block (with incontiguous output)
// Src:
// a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16
// b1 b2 b3 b4 b5 b6 b7 b8 b9 b10 b11 b12 b13 b14 b15 b16
// c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16
// d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16
// Dst:
// a1 a2 a3 a4 b1 b2 b3 b4 c1 c2 c3 c4 d1 d2 d3 d4
// a5 a6 a7 a8 b5 b6 b7 b8 c5 c6 c7 c8 d5 d6 d7 d8
// a9 a10 a11 a12 b9 b10 b11 b12 c9 c10 c11 c12 d9 d10 d11 d12
// a13 a14 a15 a16 b13 b14 b15 b16 c13 c14 c15 c16 d13 d14 d15 d16
template <typename scalar_t, typename = std::enable_if_t<sizeof(scalar_t) == 1>>
static inline void transpose_vnni4_pad_4x16_block(
const scalar_t* src,
scalar_t* dst,
int64_t ld_src,
int64_t ld_dst,
int krem = 4) {
#if defined(CPU_CAPABILITY_AVX512)
__m128i r[4];
for (int i = 0; i < krem; ++i) {
r[i] = _mm_loadu_si128(reinterpret_cast<const __m128i*>(src + i * ld_src));
}
for (int i = krem; i < 4; ++i) {
r[i] = _mm_setzero_si128();
}
// Transpose 4x16 bytes using unpack and shuffle
__m128i t0 = _mm_unpacklo_epi32(r[0], r[1]);
__m128i t1 = _mm_unpackhi_epi32(r[0], r[1]);
__m128i t2 = _mm_unpacklo_epi32(r[2], r[3]);
__m128i t3 = _mm_unpackhi_epi32(r[2], r[3]);
__m128i r0 = _mm_unpacklo_epi64(t0, t2);
__m128i r1 = _mm_unpackhi_epi64(t0, t2);
__m128i r2 = _mm_unpacklo_epi64(t1, t3);
__m128i r3 = _mm_unpackhi_epi64(t1, t3);
// Store output
if (krem == 4) {
// normal case
_mm_storeu_si128(reinterpret_cast<__m128i*>(dst), r0);
_mm_storeu_si128(reinterpret_cast<__m128i*>(dst + ld_dst), r1);
_mm_storeu_si128(reinterpret_cast<__m128i*>(dst + ld_dst * 2), r2);
_mm_storeu_si128(reinterpret_cast<__m128i*>(dst + ld_dst * 3), r3);
} else {
// masked case
__mmask16 mask = (1ULL << (krem * 4)) - 1;
_mm_mask_storeu_epi8(dst, mask, r0);
_mm_mask_storeu_epi8(reinterpret_cast<__m128i*>(dst + ld_dst), mask, r1);
_mm_mask_storeu_epi8(
reinterpret_cast<__m128i*>(dst + ld_dst * 2), mask, r2);
_mm_mask_storeu_epi8(
reinterpret_cast<__m128i*>(dst + ld_dst * 3), mask, r3);
}
#else
TORCH_CHECK(
false,
"transpose_vnni4_pad_4x16_block is only supported when AVX-512 is supported")
#endif
}
// Do the transpose packing fusion with VNNI4
// Reorder [K, N] → [N/4, K, 4] (VNNI4-style layout for bit8)
template <typename scalar_t, typename = std::enable_if_t<sizeof(scalar_t) == 1>>
static inline void transpose_pack_vnni4(
const scalar_t* src,
scalar_t* dst,
int64_t ld_src,
int64_t K,
int64_t N) {
#if defined(CPU_CAPABILITY_AVX512)
TORCH_CHECK(
N % 16 == 0, "N needs to be multiple of 16 for transpose_pack_vnni4");
int64_t bk = 0;
int64_t _K = K / 4 * 4;
for (; bk < _K; bk += 4) {
int64_t bn = 0;
for (; bn < N; bn += 16) {
transpose_vnni4_pad_4x16_block(
src + bk * ld_src + bn, dst + bn * K + bk * 4, ld_src, K * 4);
}
}
// Handle leftover K rows (< 4)
if (K % 4 != 0) {
int krem = K - bk;
int64_t bn = 0;
for (; bn < N; bn += 16) {
transpose_vnni4_pad_4x16_block(
src + bk * ld_src + bn, dst + bn * K + bk * 4, ld_src, K * 4, krem);
}
}
#else
TORCH_CHECK(
false, "transpose_pack_vnni4 is only supported when AVX-512 is supported")
#endif
}
} // namespace CPU_CAPABILITY
} // namespace at::vec

View File

@ -281,6 +281,9 @@ bool CUDAHooks::compiledWithMIOpen() const {
bool CUDAHooks::supportsDilatedConvolutionWithCuDNN() const {
#if AT_CUDNN_ENABLED()
if (!hasCUDA()) {
return false;
}
// NOTE: extra parenthesis around numbers disable clang warnings about
// dead code
return true;
@ -291,6 +294,9 @@ bool CUDAHooks::supportsDilatedConvolutionWithCuDNN() const {
bool CUDAHooks::supportsDepthwiseConvolutionWithCuDNN() const {
#if AT_CUDNN_ENABLED()
if (!hasCUDA()) {
return false;
}
cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties();
// Check for Volta cores
if (prop->major >= 7) {
@ -305,6 +311,9 @@ bool CUDAHooks::supportsDepthwiseConvolutionWithCuDNN() const {
bool CUDAHooks::supportsBFloat16ConvolutionWithCuDNNv8() const {
#if AT_CUDNN_ENABLED()
if (!hasCUDA()) {
return false;
}
cudaDeviceProp* prop = at::cuda::getCurrentDeviceProperties();
// Check for Volta cores
if (prop->major >= 8) {

View File

@ -465,8 +465,11 @@ inline bool mps_conv_use_channels_last(const at::Tensor& input, const at::Tensor
return false;
}
auto fmt = input.suggest_memory_format();
return fmt == at::MemoryFormat::ChannelsLast || fmt == at::MemoryFormat::ChannelsLast3d;
auto is_channel_last = [](const at::Tensor& t) {
auto fmt = t.suggest_memory_format();
return fmt == at::MemoryFormat::ChannelsLast || fmt == at::MemoryFormat::ChannelsLast3d;
};
return is_channel_last(input) || is_channel_last(weight);
}
} // namespace at::native

View File

@ -32,10 +32,6 @@
#include <ATen/native/mkldnn/Utils.h>
#endif
#ifdef USE_MPS
#include <ATen/mps/MPSDevice.h>
#endif
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/NativeFunctions.h>
@ -410,11 +406,23 @@ struct ConvParams {
// cudnn and miopen are guaranteed not to be on mobile, and T102591915 / T110194934 suggest
// that maybe the compiledWithCuDNN() check sometimes segfaults (though I can't imagine how)
#if !defined(C10_MOBILE)
if (!detail::getCUDAHooks().compiledWithCuDNN()) {
if (!detail::getCUDAHooks().compiledWithCuDNN() || !input.is_cuda() || !cudnn_enabled) {
return false;
}
static long cudnn_version = detail::getCUDAHooks().versionCuDNN();
// broken on cuDNN 9.8
if (cudnn_version >= 90800) {
if (cudnn_conv_suggest_memory_format(input, weight) == at::MemoryFormat::Contiguous &&
(input.scalar_type() == at::kBFloat16 || input.scalar_type() == at::kHalf) &&
weight.dim() == 5) {
for (int i = 2; i < weight.dim(); i++) {
if (weight.size(i) != 1) {
return false;
}
}
}
}
if (needs_64bit_indexing_no_split(input, weight)) {
static long cudnn_version = detail::getCUDAHooks().versionCuDNN();
if (!(cudnn_version >= 90300 && at::native::cudnnv8_enabled_check_debug())) {
TORCH_WARN_ONCE("cuDNN cannot be used for large non-batch-splittable convolutions"
" if the V8 API is not enabled or before cuDNN version 9.3+."
@ -422,9 +430,6 @@ struct ConvParams {
return false;
}
}
if (!input.is_cuda() || !cudnn_enabled) {
return false;
}
if (input.scalar_type() == at::kBFloat16 || weight.scalar_type() == at::kBFloat16) {
if (!(detail::getCUDAHooks().supportsBFloat16ConvolutionWithCuDNNv8() && at::native::cudnnv8_enabled_check_debug())) {
return false;
@ -443,16 +448,19 @@ struct ConvParams {
// Use cudnn for FP16 depthwise convolutions
bool use_cudnn_depthwise(const at::Tensor& input, const at::Tensor& weight) const {
if (!detail::getCUDAHooks().compiledWithCuDNN()) {
if (!cudnn_enabled || !detail::getCUDAHooks().compiledWithCuDNN() || !input.is_cuda()) {
return false;
}
if (cudnn_conv_suggest_memory_format(input, weight) != at::MemoryFormat::Contiguous && use_cudnn(input, weight)) {
// always use cudnn_depthwise for channels_last format
return true;
}
// native kernel doesn't support 64-bit non-splittable case
if (cudnn_enabled && !(canUse32BitIndexMath(input) && canUse32BitIndexMath(weight))) {
if (!(canUse32BitIndexMath(input) && canUse32BitIndexMath(weight))) {
static long cudnn_version = detail::getCUDAHooks().compiledWithCuDNN() ? detail::getCUDAHooks().versionCuDNN() : -1;
// TODO(eqy): remove this once cuDNN fixes 64-bit depthwise support, first broken in 9.11x
if (cudnn_conv_suggest_memory_format(input, weight) != at::MemoryFormat::Contiguous) {
if (cudnn_version < 0 || cudnn_version > 91000) {
return false;
}
}
if (!(cudnn_version >= 90300 && at::native::cudnnv8_enabled_check_debug())) {
TORCH_WARN_ONCE("cuDNN cannot be used for large non-batch-splittable convolutions"
" if the V8 API is not enabled or before cuDNN version 9.3+."
@ -462,6 +470,10 @@ struct ConvParams {
return true;
}
}
if (cudnn_conv_suggest_memory_format(input, weight) != at::MemoryFormat::Contiguous) {
// always use cudnn_depthwise for channels_last format
return true;
}
if (detail::getCUDAHooks().supportsDepthwiseConvolutionWithCuDNN()) {
bool kernel_cond = (use_cudnn(input, weight) &&
input.scalar_type() == kHalf && // only for FP16
@ -1429,12 +1441,8 @@ static inline at::MemoryFormat determine_backend_memory_format(
}
break;
case ConvBackend::Mps:
case ConvBackend::MpsTranspose:
if (mps_conv_use_channels_last(input, weight)) {
#ifdef USE_MPS
if (!mps::is_macos_13_or_newer(mps::MacOSVersion::MACOS_VER_15_0_PLUS)) {
break;
}
#endif
backend_memory_format = (k == 5) ? MemoryFormat::ChannelsLast3d : MemoryFormat::ChannelsLast;
}
break;

View File

@ -9,6 +9,7 @@
#include <ATen/native/TransposeType.h>
#include <ATen/native/Unfold3d.h>
#include <c10/util/irange.h>
#include <c10/util/safe_numerics.h>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
@ -174,6 +175,23 @@ static inline void slow_conv3d_shape_check(
const int64_t input_height = input.size(dim_height);
const int64_t input_width = input.size(dim_width);
constexpr int64_t MAX_SAFE_PAD = (1LL << 61);
TORCH_CHECK_VALUE(
pad_height <= MAX_SAFE_PAD,
"Padding height too large: pad_height=",
pad_height);
TORCH_CHECK_VALUE(
pad_width <= MAX_SAFE_PAD,
"Padding width too large: pad_width=",
pad_width);
TORCH_CHECK_VALUE(
pad_depth <= MAX_SAFE_PAD,
"Padding depth too large: pad_depth=",
pad_depth);
const int64_t exact_input_depth = input_depth + 2 * pad_depth;
const int64_t exact_input_height = input_height + 2 * pad_height;
const int64_t exact_input_width = input_width + 2 * pad_width;
@ -221,6 +239,14 @@ static inline void slow_conv3d_shape_check(
output_width,
"). Output size is too small");
uint64_t kernel_product;
TORCH_CHECK(
!c10::mul_overflows(kernel_height, kernel_width, &kernel_product),
"Kernel height x width product is too large: kernel_height=",
kernel_height,
", kernel_width=",
kernel_width);
if (weight.defined()) {
int64_t n_input_plane = weight.size(1);
if (weight.dim() == 2) {

View File

@ -23,6 +23,7 @@
#include <ATen/ops/linspace.h>
#endif
#include <cmath>
#include <numeric>
#include <tuple>
#include <vector>
@ -202,6 +203,46 @@ select_outer_bin_edges(const Tensor& input, std::optional<c10::ArrayRef<double>>
return std::make_pair(leftmost_edges, rightmost_edges);
}
/* Bin edges correction based on the precision representation.
* To maintain the backward compatibility we take max(std::nextafter<>, +1)
* and min(std::nextafter<>, -1) for scalar types. For other types +/- 1 as usual.
*/
void bins_edges_correction(const ScalarType& t, double &leftmost_edge, double &rightmost_edge)
{
#define UPDATE_WITH_LIMIT(real_type, scalartype) \
case ScalarType::scalartype: \
leftmost_edge = std::min( \
static_cast<double>( \
std::nexttoward( \
static_cast<real_type>(leftmost_edge), \
std::numeric_limits<real_type>::lowest() \
) \
), \
leftmost_edge - 1. \
); \
rightmost_edge = std::max( \
static_cast<double>( \
std::nexttoward( \
static_cast<real_type>(rightmost_edge), \
std::numeric_limits<real_type>::max() \
) \
), \
rightmost_edge + 1. \
); \
break;
switch (t) {
UPDATE_WITH_LIMIT(double, Double)
UPDATE_WITH_LIMIT(float, Float)
default:
// Fallback to the default behavior for other types
leftmost_edge -= 1;
rightmost_edge += 1;
}
#undef UPDATE_WITH_LIMIT
}
/* histc's version of the logic for outermost bin edges.
*/
std::pair<double, double> histc_select_outer_bin_edges(const Tensor& input,
@ -216,8 +257,7 @@ std::pair<double, double> histc_select_outer_bin_edges(const Tensor& input,
}
if (leftmost_edge == rightmost_edge) {
leftmost_edge -= 1;
rightmost_edge += 1;
bins_edges_correction(input.dtype().toScalarType(), leftmost_edge, rightmost_edge);
}
TORCH_CHECK(!(std::isinf(leftmost_edge) || std::isinf(rightmost_edge) ||

View File

@ -42,6 +42,19 @@ void bfloat16_copy_kernel_cuda(TensorIteratorBase &iter) {
});
}
#ifdef USE_ROCM
void bfloat16tofloat32_copy_kernel_cuda(TensorIteratorBase &iter) {
gpu_kernel_nocast(iter, [] GPU_LAMBDA(at::BFloat16 value) {
return static_cast<float>(value);
});
}
void float16tofloat32_copy_kernel_cuda(TensorIteratorBase &iter) {
gpu_kernel_nocast(iter, [] GPU_LAMBDA(at::Half value) {
return static_cast<float>(value);
});
}
#endif
void float8_copy_kernel_cuda(TensorIteratorBase &iter) {
ScalarType dtype = iter.dtype(0);
ScalarType other_dtype = iter.dtype(1);
@ -187,7 +200,17 @@ void direct_copy_kernel_cuda(TensorIteratorBase &iter) {
} else {
float16_copy_kernel_cuda(iter);
}
} else if (isBitsType(dtype)) {
}
#ifdef USE_ROCM
else if ((iter.dtype(1) == kBFloat16 || iter.dtype(1) == kHalf) && dtype == kFloat) {
if (iter.dtype(1) == kBFloat16) {
bfloat16tofloat32_copy_kernel_cuda(iter);
} else {
float16tofloat32_copy_kernel_cuda(iter);
}
}
#endif
else if (isBitsType(dtype)) {
TORCH_CHECK(dtype == iter.dtype(1), "copy_() does not support casting "
"bits types to different bits types. Source dtype is ", iter.dtype(1), "target dtype is ", dtype);
AT_DISPATCH_BIT_TYPES(dtype, "copy_", [&] {

View File

@ -1238,7 +1238,7 @@ Tensor _cholesky_solve_helper_cuda_magma(const Tensor& self, const Tensor& A, bo
// Todo: cusolverDn<T>potrsBatched only supports nrhs == 1 and does not have good performance.
// Batched cholesky_solve is dispatched to magma.
Tensor _cholesky_solve_helper_cuda(const Tensor& self, const Tensor& A, bool upper) {
#if defined(USE_LINALG_SOLVER) && !defined(USE_ROCM)
#if defined(USE_LINALG_SOLVER)
auto preferred_backend = at::globalContext().linalgPreferredBackend();
switch (preferred_backend) {
case at::LinalgBackend::Cusolver:
@ -1352,7 +1352,7 @@ void cholesky_helper_magma(const Tensor& input, bool upper, const Tensor& info)
}
static void cholesky_kernel(const Tensor& input, const Tensor& info, bool upper) {
#if defined(USE_LINALG_SOLVER) && !defined(USE_ROCM)
#if defined(USE_LINALG_SOLVER)
auto preferred_backend = at::globalContext().linalgPreferredBackend();
switch (preferred_backend) {
case at::LinalgBackend::Cusolver:
@ -2709,7 +2709,7 @@ void linalg_lstsq_gels(const Tensor& A, const Tensor& B, const Tensor& /*infos*/
}
void gels_looped(const Tensor& a, Tensor& b, Tensor& infos) {
#if defined(USE_LINALG_SOLVER) && !defined(USE_ROCM)
#if defined(USE_LINALG_SOLVER)
auto preferred_backend = at::globalContext().linalgPreferredBackend();
switch (preferred_backend) {
case at::LinalgBackend::Magma:
@ -2733,7 +2733,7 @@ void lstsq_kernel(const Tensor& a, Tensor& b, Tensor& /*rank*/, Tensor& /*singul
// first handle the underdetermined case (m < n)
// this case is not supported by MAGMA or cuBLAS
if (m < n) {
#if defined(USE_LINALG_SOLVER) && !defined(USE_ROCM)
#if defined(USE_LINALG_SOLVER)
linalg_lstsq_gels(a, b, infos);
#else
TORCH_CHECK(

View File

@ -223,9 +223,6 @@ void grid_sampler_single_element(
auto input_size = input_sizes[input_dim];
auto coord = static_cast<opmath_t<T>>(coords[coord_dim]);
// Interpret nan as -1
coord = isnan(coord) ? -1 : coord;
if (!align_corners) {
// Map unaligned grid space to aligned grid space
auto corner_alignment_factor = static_cast<opmath_t<T>>(input_size) /

View File

@ -52,9 +52,7 @@ static void fill_depthwise_conv_desc(MPSGraphDepthwiseConvolution3DOpDescriptor*
NSUInteger dilationRateInX,
NSUInteger dilationRateInY,
NSUInteger paddingHorizontal,
NSUInteger paddingVertical,
c10::MemoryFormat memory_format,
NSUInteger groups) {
NSUInteger paddingVertical) {
descriptor_.strides =
@[ @1, [[NSNumber alloc] initWithInteger:strideInY], [[NSNumber alloc] initWithInteger:strideInX] ];
descriptor_.dilationRates =
@ -103,7 +101,7 @@ static void fill_conv_desc(MPSGraphConvolution2DOpDescriptor* descriptor_,
descriptor_.groups = groups;
}
static Tensor _mps_convolution_impl(const Tensor& input_t_,
static Tensor _mps_convolution_impl(const Tensor& input_t,
const Tensor& weight_t,
const std::optional<Tensor>& bias_opt,
IntArrayRef padding,
@ -111,12 +109,15 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
IntArrayRef dilation,
int64_t groups,
std::optional<IntArrayRef> input_shape) {
const bool is_macOS_15_0_or_newer = is_macos_13_or_newer(MacOSVersion::MACOS_VER_15_0_PLUS);
Tensor input_t = input_t_;
bool is3DConv = input_t.dim() == 5;
if (!is_macOS_15_0_or_newer || is3DConv) {
input_t = input_t.contiguous();
}
constexpr auto kChannelsLast = MemoryFormat::ChannelsLast;
constexpr auto kContiguous = MemoryFormat::Contiguous;
const bool is_macos_15_plus = is_macos_13_or_newer(MacOSVersion::MACOS_VER_15_0_PLUS);
const bool is3DConv = input_t.dim() == 5;
const auto memory_format = input_t.suggest_memory_format();
const auto input_suggested_layout = memory_format == kChannelsLast && is_macos_15_plus ? kChannelsLast : kContiguous;
const bool is_channels_last = mps_conv_use_channels_last(input_t, weight_t) && !is3DConv;
const bool bias_defined = bias_opt ? bias_opt->defined() : false;
TORCH_CHECK(isFloatingType(input_t.scalar_type()), "Convolution is supported only for Floating types");
@ -126,15 +127,6 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
checkAllSameType(c, {input, weight});
checkAllSameGPU(c, {input, weight});
bool bias_defined;
if (bias_opt == std::nullopt)
bias_defined = false;
else
bias_defined = bias_opt->defined();
auto memory_format = input_t.suggest_memory_format();
bool is_channels_last = (memory_format == at::MemoryFormat::ChannelsLast) && !is3DConv;
auto output_t =
at::empty(input_shape.has_value() ? input_shape.value()
: conv_output_size(input->sizes(), weight->sizes(), padding, stride, dilation),
@ -142,12 +134,18 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
std::nullopt,
kMPS,
std::nullopt,
is_macOS_15_0_or_newer ? memory_format : MemoryFormat::Contiguous);
is_channels_last ? kChannelsLast : kContiguous);
if (output_t.numel() == 0) {
return output_t;
}
TensorArg output{output_t, "result", 0};
// TODO: Remove me when MacOS-14 is no longer supported
std::optional<Tensor> output_c;
if (!is_macos_15_plus && is_channels_last) {
output_c = at::empty_like(output_t, output_t.options().memory_format(kContiguous));
}
if (!is_macos_13_or_newer(MacOSVersion::MACOS_VER_15_1_PLUS)) {
// On macOS < 15.1, MPS convolution kernel does not support output channels > 2^16
for (auto elem : output_t.sizes()) {
@ -186,32 +184,22 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
getArrayRefString(dilation),
getArrayRefString(padding),
groups,
is_channels_last,
input_suggested_layout == kChannelsLast,
mps::getTensorsStringKey({input_t, weight_t}),
bias_defined,
bias_shape_key);
MPSShape* inputShape = mps::getMPSShape(input_t, memory_format);
MPSShape* outputShape = mps::getMPSShape(output_t, memory_format);
MPSNDArray* inputNDArray = nil;
MPSNDArray* outputNDArray = nil;
if (input_t.is_contiguous(memory_format) && output_t.is_contiguous(memory_format) && is_macOS_15_0_or_newer) {
inputNDArray = getMPSNDArray(input_t, inputShape);
outputNDArray = getMPSNDArray(output_t, outputShape);
}
auto inputShape = mps::getMPSShape(input_t, input_suggested_layout);
auto outputShape = mps::getMPSShape(output_t, input_suggested_layout);
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSShape* weightShape = mps::getMPSShape(weight_t);
bool isDepthwiseConv = ((groups > 1 && (weightShape[1].intValue == 1)) && inputShape.count >= 4 &&
weightShape.count >= 4 && !is_channels_last);
bool isDepthwiseConv =
(groups > 1 && weight_t.size(1) == 1) && input_t.dim() >= 4 && weight_t.dim() >= 4 && !is_channels_last;
MPSGraphTensor* inputTensor =
mpsGraphRankedPlaceHolder(mpsGraph, getMPSScalarType(input_t.scalar_type()), inputShape);
MPSGraphTensor* weightTensor = mpsGraphRankedPlaceHolder(mpsGraph, weight_t);
MPSGraphTensor* outputTensor;
auto inputTensor = mpsGraphRankedPlaceHolder(mpsGraph, getMPSScalarType(input_t), inputShape);
auto weightTensor = mpsGraphRankedPlaceHolder(mpsGraph, weight_t);
MPSGraphTensor* outputTensor = nil;
if (is3DConv) {
MPSGraphConvolution3DOpDescriptor* conv3dDescriptor_ = [[MPSGraphConvolution3DOpDescriptor new] autorelease];
auto conv3dDescriptor_ = [[MPSGraphConvolution3DOpDescriptor new] autorelease];
fill_conv3d_desc(conv3dDescriptor_,
stride[2],
stride[1],
@ -229,17 +217,9 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
descriptor:conv3dDescriptor_
name:nil];
} else if (isDepthwiseConv) {
MPSGraphDepthwiseConvolution3DOpDescriptor* depthWiseConv3dDescriptor_ =
[[MPSGraphDepthwiseConvolution3DOpDescriptor new] autorelease];
fill_depthwise_conv_desc(depthWiseConv3dDescriptor_,
stride[1],
stride[0],
dilation[1],
dilation[0],
padding[1],
padding[0],
memory_format,
groups);
auto depthWiseConv3dDescriptor_ = [[MPSGraphDepthwiseConvolution3DOpDescriptor new] autorelease];
fill_depthwise_conv_desc(
depthWiseConv3dDescriptor_, stride[1], stride[0], dilation[1], dilation[0], padding[1], padding[0]);
MPSGraphTensor* weightTransposeTensor = [mpsGraph transposeTensor:weightTensor
dimension:-3
@ -258,7 +238,7 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
dilation[0],
padding[1],
padding[0],
memory_format,
input_suggested_layout,
groups);
outputTensor = [mpsGraph convolution2DWithSourceTensor:inputTensor
@ -270,13 +250,6 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
MPSGraphTensor* biasTensor = nil;
if (bias_defined) {
biasTensor = mpsGraphUnrankedPlaceHolder(mpsGraph, getMPSDataType(bias_opt.value()));
}
if (is_channels_last && !is_macOS_15_0_or_newer) {
outputTensor = mps::convertNHWCtoNCHW(mpsGraph, outputTensor);
}
if (bias_defined) {
outputTensor = [mpsGraph additionWithPrimaryTensor:outputTensor secondaryTensor:biasTensor name:nil];
}
newCachedGraph->inputTensor_ = inputTensor;
@ -285,27 +258,26 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
newCachedGraph->outputTensor_ = outputTensor;
});
auto inputPlaceholder = inputNDArray ? Placeholder(cachedGraph->inputTensor_, inputNDArray)
: Placeholder(cachedGraph->inputTensor_, input_t, inputShape);
auto weightsPlaceholder = Placeholder(cachedGraph->weightTensor_, weight_t);
auto inputPlaceholder = input_suggested_layout == kContiguous
? Placeholder(cachedGraph->inputTensor_, output_c || is3DConv ? input_t.contiguous() : input_t)
: Placeholder(cachedGraph->inputTensor_, getMPSNDArray(input_t, inputShape));
auto outputPlaceholder = input_suggested_layout == kContiguous
? Placeholder(cachedGraph->outputTensor_, output_c ? *output_c : output_t)
: Placeholder(cachedGraph->outputTensor_, getMPSNDArray(output_t, outputShape));
auto weightsPlaceholder = Placeholder(cachedGraph->weightTensor_, output_c ? weight_t.contiguous() : weight_t);
auto biasPlaceholder = Placeholder();
// Reshape the bias to be broadcastable with output of conv2d or conv3d
if (bias_defined) {
if (is3DConv) {
biasPlaceholder = Placeholder(cachedGraph->biasTensor_, (bias_opt.value()).view({1, bias_shape[0], 1, 1, 1}));
biasPlaceholder = Placeholder(cachedGraph->biasTensor_, bias_opt->view({1, bias_shape[0], 1, 1, 1}));
} else if (input_suggested_layout == kChannelsLast) {
biasPlaceholder = Placeholder(cachedGraph->biasTensor_, bias_opt->view({1, 1, 1, bias_shape[0]}));
} else {
if (is_channels_last && is_macOS_15_0_or_newer) {
biasPlaceholder = Placeholder(cachedGraph->biasTensor_, (bias_opt.value()).view({1, 1, 1, bias_shape[0]}));
} else {
biasPlaceholder = Placeholder(cachedGraph->biasTensor_, (bias_opt.value()).view({1, bias_shape[0], 1, 1}));
}
biasPlaceholder = Placeholder(cachedGraph->biasTensor_, bias_opt->view({1, bias_shape[0], 1, 1}));
}
}
auto outputPlaceholder = outputNDArray ? Placeholder(cachedGraph->outputTensor_, outputNDArray)
: Placeholder(cachedGraph->outputTensor_, output_t);
NSMutableDictionary<MPSGraphTensor*, MPSGraphTensorData*>* feeds =
[[[NSMutableDictionary alloc] initWithCapacity:3] autorelease];
auto feeds = [[[NSMutableDictionary alloc] initWithCapacity:3] autorelease];
feeds[inputPlaceholder.getMPSGraphTensor()] = inputPlaceholder.getMPSGraphTensorData();
feeds[weightsPlaceholder.getMPSGraphTensor()] = weightsPlaceholder.getMPSGraphTensorData();
if (bias_defined) {
@ -315,6 +287,10 @@ static Tensor _mps_convolution_impl(const Tensor& input_t_,
runMPSGraph(stream, cachedGraph->graph(), feeds, outputPlaceholder);
}
if (output_c) {
output_t.copy_(*output_c);
}
return output_t;
}
@ -351,14 +327,21 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
TensorArg grad_output{grad_output_t, "grad_output", 1}, weight{weight_t, "weight", 2};
checkAllSameType(c, {grad_output, weight});
checkAllSameGPU(c, {grad_output, weight});
auto memory_format = grad_output_t.suggest_memory_format();
bool is_channels_last = (memory_format == at::MemoryFormat::ChannelsLast) && !is3DConv;
auto grad_input_t = at::empty(input_size, grad_output_t.options(), std::nullopt);
constexpr auto kChannelsLast = at::MemoryFormat::ChannelsLast;
bool is_channels_last = mps_conv_use_channels_last(grad_output_t, weight_t) && !is3DConv;
auto grad_input_t =
at::empty(input_size, grad_output_t.options(), is_channels_last ? std::optional(kChannelsLast) : std::nullopt);
// Avoid "grad_input" when this is being used as transposed convolution
TensorArg grad_input{grad_input_t, "result", 0};
convolution_shape_check(c, grad_input, weight, grad_output, padding, stride, dilation, groups);
// TODO: Remove me when MacOS-14 is no longer supported
std::optional<Tensor> grad_input_c;
if (!is_macos_13_or_newer(MacOSVersion::MACOS_VER_15_0_PLUS) && is_channels_last) {
grad_input_c = at::empty_like(grad_input_t, grad_input_t.options().memory_format(MemoryFormat::Contiguous));
}
// Derive from MPSCachedGraph
struct CachedGraph : public MPSCachedGraph {
CachedGraph(MPSGraph* graph) : MPSCachedGraph(graph) {}
@ -370,7 +353,6 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
// Add backward with input
@autoreleasepool {
MPSStream* stream = getCurrentMPSStream();
MPSShape* mps_input_shape = getMPSShape(input_size);
std::string key = fmt::format("mps_{}_convolution_backward_input:{}:{}:{}:{}:{}:{}",
is3DConv ? "3d_" : "",
@ -411,15 +393,8 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
} else if (isDepthwiseConv) {
MPSGraphDepthwiseConvolution3DOpDescriptor* depthWiseConv3dDescriptor_ =
[[MPSGraphDepthwiseConvolution3DOpDescriptor new] autorelease];
fill_depthwise_conv_desc(depthWiseConv3dDescriptor_,
stride[1],
stride[0],
dilation[1],
dilation[0],
padding[1],
padding[0],
at::MemoryFormat::Contiguous,
groups);
fill_depthwise_conv_desc(
depthWiseConv3dDescriptor_, stride[1], stride[0], dilation[1], dilation[0], padding[1], padding[0]);
MPSGraphTensor* weightTransposeTensor = [mpsGraph transposeTensor:weightTensor
dimension:-3
withDimension:-4
@ -454,14 +429,18 @@ static Tensor mps_convolution_backward_input(IntArrayRef input_size,
newCachedGraph->gradInputTensor_ = gradInputTensor;
});
auto gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor_, grad_output_t);
auto weightsPlaceholder = Placeholder(cachedGraph->weightTensor_, weight_t);
auto outputPlaceholder = Placeholder(cachedGraph->gradInputTensor_, *grad_input);
auto gradOutputPlaceholder =
Placeholder(cachedGraph->gradOutputTensor_, grad_input_c ? grad_output_t.contiguous() : grad_output_t);
auto weightsPlaceholder = Placeholder(cachedGraph->weightTensor_, grad_input_c ? weight_t.contiguous() : weight_t);
auto outputPlaceholder = Placeholder(cachedGraph->gradInputTensor_, grad_input_c ? *grad_input_c : grad_input_t);
auto feeds = dictionaryFromPlaceholders(gradOutputPlaceholder, weightsPlaceholder);
runMPSGraph(stream, cachedGraph->graph(), feeds, outputPlaceholder);
}
return *grad_input;
if (grad_input_c) {
grad_input_t.copy_(*grad_input_c);
}
return grad_input_t;
}
static Tensor mps_convolution_backward_weights(IntArrayRef weight_size,
@ -474,9 +453,11 @@ static Tensor mps_convolution_backward_weights(IntArrayRef weight_size,
bool bias_defined) {
using namespace at::native::mps;
using namespace mps;
bool is3DConv = input_t.dim() == 5;
const bool is3DConv = input_t.dim() == 5;
TORCH_CHECK(isFloatingType(grad_output_t.scalar_type()), "Convolution is supported only for Floating types");
CheckedFrom c = "mps_convolution_backward_weights";
constexpr auto kChannelsLast = at::MemoryFormat::ChannelsLast;
bool is_channels_last = mps_conv_use_channels_last(input_t, grad_output_t) && !is3DConv;
// For uniformity with everything else, although it seems grad_weight
// would be unambiguous too.
@ -487,7 +468,8 @@ static Tensor mps_convolution_backward_weights(IntArrayRef weight_size,
checkAllSameGPU(c, {grad_output, input});
auto grad_weight_t =
at::empty(weight_size, grad_output_t.scalar_type(), std::nullopt, kMPS, std::nullopt, std::nullopt);
at::empty(weight_size, grad_output_t.options(), is_channels_last ? std::optional(kChannelsLast) : std::nullopt);
TensorArg grad_weight{grad_weight_t, "result", 0};
convolution_shape_check(c, input, grad_weight, grad_output, padding, stride, dilation, groups);
@ -500,16 +482,23 @@ static Tensor mps_convolution_backward_weights(IntArrayRef weight_size,
MPSGraphTensor* gradWeightTensor_ = nil;
};
// TODO: Remove me when MacOS-14 is no longer supported
std::optional<Tensor> grad_weight_c;
if (!is_macos_13_or_newer(MacOSVersion::MACOS_VER_15_0_PLUS) && is_channels_last) {
grad_weight_c = at::empty_like(grad_weight_t, grad_weight_t.options().memory_format(MemoryFormat::Contiguous));
}
@autoreleasepool {
MPSStream* stream = getCurrentMPSStream();
MPSShape* mps_weight_shape = getMPSShape(weight_size);
std::string key = fmt::format("mps_{}convolution_backward_weights:{}:{}:{}:{}:{}",
std::string key = fmt::format("mps_{}convolution_backward_weights:{}:{}:{}:{}:{}:{}",
is3DConv ? "3d_" : "",
getArrayRefString(stride),
getArrayRefString(dilation),
getArrayRefString(padding),
groups,
is_channels_last,
getTensorsStringKey({grad_output_t, input_t, grad_weight_t}));
auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
MPSShape* inputShape = getMPSShape(input_t);
@ -541,15 +530,8 @@ static Tensor mps_convolution_backward_weights(IntArrayRef weight_size,
} else if (isDepthwiseConv) {
MPSGraphDepthwiseConvolution3DOpDescriptor* depthWiseConv3dDescriptor_ =
[[MPSGraphDepthwiseConvolution3DOpDescriptor new] autorelease];
fill_depthwise_conv_desc(depthWiseConv3dDescriptor_,
stride[1],
stride[0],
dilation[1],
dilation[0],
padding[1],
padding[0],
at::MemoryFormat::Contiguous,
groups);
fill_depthwise_conv_desc(
depthWiseConv3dDescriptor_, stride[1], stride[0], dilation[1], dilation[0], padding[1], padding[0]);
NSNumber* outputFeatChannelDim = mps_weight_shape[0];
MPSShape* weightShapeTranspose = @[ @1, outputFeatChannelDim, mps_weight_shape[2], mps_weight_shape[3] ];
MPSGraphTensor* gradWeightTensorTranspose =
@ -583,14 +565,19 @@ static Tensor mps_convolution_backward_weights(IntArrayRef weight_size,
newCachedGraph->gradWeightTensor_ = gradWeightTensor;
});
auto gradOutputPlaceholder = Placeholder(cachedGraph->gradOutputTensor_, grad_output_t);
auto inputPlaceholder = Placeholder(cachedGraph->inputTensor_, input_t);
auto outputPlaceholder = Placeholder(cachedGraph->gradWeightTensor_, grad_weight_t);
auto gradOutputPlaceholder =
Placeholder(cachedGraph->gradOutputTensor_, grad_weight_c ? grad_output_t.contiguous() : grad_output_t);
auto inputPlaceholder = Placeholder(cachedGraph->inputTensor_, grad_weight_c ? input_t.contiguous() : input_t);
auto outputPlaceholder =
Placeholder(cachedGraph->gradWeightTensor_, grad_weight_c ? *grad_weight_c : grad_weight_t);
auto feeds = dictionaryFromPlaceholders(gradOutputPlaceholder, inputPlaceholder);
runMPSGraph(stream, cachedGraph->graph(), feeds, outputPlaceholder);
}
if (grad_weight_c) {
grad_weight_t.copy_(*grad_weight_c);
}
return grad_weight_t;
}

View File

@ -9,11 +9,22 @@
#else
#include <ATen/ops/_unique2.h>
#include <ATen/ops/_unique2_native.h>
#include <ATen/ops/arange.h>
#include <ATen/ops/argsort.h>
#include <ATen/ops/cat.h>
#include <ATen/ops/cumsum.h>
#include <ATen/ops/full.h>
#include <ATen/ops/masked_select.h>
#include <ATen/ops/nonzero.h>
#include <ATen/ops/ones.h>
#include <ATen/ops/ones_like.h>
#include <ATen/ops/slice.h>
#include <ATen/ops/unique_consecutive.h>
#include <ATen/ops/unique_consecutive_native.h>
#include <ATen/ops/unique_dim_consecutive.h>
#include <ATen/ops/unique_dim_consecutive_native.h>
#include <ATen/ops/unique_dim_native.h>
#include <ATen/ops/zeros.h>
#endif
namespace at::native {
@ -305,4 +316,85 @@ std::tuple<Tensor, Tensor, Tensor> _unique2_mps(const Tensor& self,
return _unique_impl_mps(self, return_inverse, return_counts, false, std::nullopt);
}
static Tensor lexsort_rows_perm_mps(const Tensor& mat_2d) {
const auto rows = mat_2d.size(0), cols = mat_2d.size(1);
if (rows <= 1 || cols == 0) {
return arange(rows, mat_2d.options().dtype(kLong));
}
auto perm = arange(rows, mat_2d.options().dtype(kLong));
for (auto c = cols - 1; c >= 0; --c) {
auto keys = mat_2d.select(1, c).index_select(0, perm);
const auto idx = argsort(keys, /*dim=*/0, /*descending=*/false);
perm = perm.index_select(0, idx);
}
return perm;
}
static std::tuple<Tensor, Tensor, Tensor> unique_dim_sorted_mps_impl(const Tensor& self,
int64_t dim,
bool return_inverse,
bool return_counts) {
dim = maybe_wrap_dim(dim, self.dim());
auto sizes = self.sizes().vec();
auto num_zero_dims = std::count(sizes.begin(), sizes.end(), (int64_t)0);
if (self.size(dim) == 0) {
auto output = at::empty(sizes, self.options());
auto inverse_indices = at::empty({0}, self.options().dtype(kLong));
auto counts = at::empty({0}, self.options().dtype(kLong));
return {output, inverse_indices, counts};
}
auto transposed = self.moveaxis(dim, 0);
auto orig_sizes = transposed.sizes().vec();
auto rows = transposed.size(0);
auto input_flat = transposed.contiguous().view({rows, -1});
auto perm = lexsort_rows_perm_mps(input_flat);
auto input_sorted = input_flat.index_select(0, perm);
Tensor is_unique = at::zeros({rows}, self.options().dtype(kBool));
if (rows > 0) {
is_unique.narrow(0, 0, 1).fill_(true);
}
if (rows > 1) {
auto a = input_sorted.narrow(0, 1, rows - 1);
auto b = input_sorted.narrow(0, 0, rows - 1);
auto row_changed = a.ne(b).any(1);
is_unique.narrow(0, 1, rows - 1).copy_(row_changed);
}
auto unique_pos = nonzero(is_unique).squeeze(1);
auto group_id = cumsum(is_unique.to(kLong), 0).sub(1);
auto unique_rows_2d = input_sorted.index_select(0, unique_pos);
Tensor inverse_indices = empty({0}, self.options().dtype(kLong));
if (return_inverse) {
inverse_indices = empty({rows}, self.options().dtype(kLong));
inverse_indices.index_copy_(0, perm, group_id);
}
Tensor counts = empty({0}, self.options().dtype(kLong));
if (return_counts) {
const auto num_unique = unique_pos.size(0);
counts = zeros({num_unique}, self.options().dtype(kLong));
counts.scatter_add_(0, group_id, ones_like(group_id, group_id.options().dtype(kLong)));
}
orig_sizes[0] = unique_rows_2d.size(0);
auto output = unique_rows_2d.view(orig_sizes).moveaxis(0, dim);
return std::make_tuple(std::move(output), std::move(inverse_indices), std::move(counts));
}
std::tuple<Tensor, Tensor, Tensor> unique_dim_mps(const Tensor& self,
int64_t dim,
const bool /*sorted*/,
const bool return_inverse,
const bool return_counts) {
return unique_dim_sorted_mps_impl(self, dim, return_inverse, return_counts);
}
} // namespace at::native

View File

@ -1409,7 +1409,7 @@
- func: _sparse_broadcast_to(Tensor(a) self, int[] size) -> Tensor(a)
variants: function
dispatch:
SparseCPU, SparseCUDA: sparse_broadcast_to
SparseCPU, SparseCUDA, SparseMPS: sparse_broadcast_to
- func: cat(Tensor[] tensors, int dim=0) -> Tensor
structured_delegate: cat.out
@ -6450,6 +6450,7 @@
dispatch:
CPU: unique_dim_cpu
CUDA: unique_dim_cuda
MPS: unique_dim_mps
tags: dynamic_output_shape
autogen: unique_dim.out

View File

@ -176,6 +176,28 @@ bool check_head_dim_size_flash(sdp_params const& params, bool debug) {
}
return false;
}
if constexpr(caller_is_meff) {
bool is_half = (params.query.dtype() == at::kHalf) ||
(params.query.dtype() == at::kBFloat16);
const int64_t alignment = is_half ? 8 : 4;
if (!(query_size_last % alignment == 0 && query_size_last > 0 &&
value_size_last % alignment == 0 && value_size_last > 0)) {
if (debug) {
TORCH_WARN(
"Mem efficient attention requires last dimension of inputs to be divisible by ",
alignment,
". ",
"Got Query.size(-1): ",
query_size_last,
", Key.size(-1): ",
params.key.sym_size(-1),
", Value.size(-1): ",
params.value.sym_size(-1),
" instead.");
}
return false;
}
}
return true;
}
@ -666,6 +688,15 @@ bool can_use_cudnn_attention(const sdp_params& params, bool debug) {
TORCH_WARN(CUDNN_VERSION, " cuDNN version too old to use cuDNN Attention (< v9.0.0)");
}
return false;
#endif
#if defined(CUDNN_VERSION)
static auto cudnn_version = cudnnGetVersion();
if (params.dropout > 0.0 && cudnn_version > 91100 && cudnn_version < 91400) {
if (debug) {
TORCH_WARN(CUDNN_VERSION, " cuDNN version does not support droppout in SDPA (9.11 - 9.13).");
}
return false;
}
#endif
// Define gate functions that determine if a flash kernel can be ran
// Replace with std::to_array when we migrate to c++20

View File

@ -462,10 +462,11 @@ mha_varlen_fwd_aot(const at::Tensor &q, // total_q x num_heads x head_size, tot
using sdp::aotriton_adapter::mk_aotensor;
using sdp::aotriton_adapter::mk_aoscalartensor;
using sdp::aotriton_adapter::mk_philoxtensor;
using sdp::aotriton_adapter::mk_atomictensor;
using sdp::aotriton_adapter::cast_dtype;
at::Tensor atomic_counter;
if (is_causal) {
atomic_counter = at::zeros({1}, q.options());
atomic_counter = at::zeros({1}, q.options().dtype(at::kInt));
}
aotriton::TensorView<4> empty_bias(0, {0,0,0,0}, {0,0,0,0}, cast_dtype(q.dtype()));
auto seed = use_philox_state ? mk_philoxtensor(philox_state.seed_.ptr) : mk_aoscalartensor(seed_t);
@ -474,7 +475,7 @@ mha_varlen_fwd_aot(const at::Tensor &q, // total_q x num_heads x head_size, tot
auto nullscalar = mk_philoxtensor(nullptr);
auto seed_output = use_philox_state ? mk_philoxtensor(seed_t.data_ptr<int64_t>()) : nullscalar;
auto offset_output = use_philox_state ? mk_philoxtensor(offset_t.data_ptr<int64_t>()) : nullscalar;
auto persistent_counter = is_causal ? mk_philoxtensor(atomic_counter.data_ptr<int64_t>()) : nullscalar;
auto persistent_counter = mk_atomictensor(is_causal ? atomic_counter.data_ptr<int32_t>() : nullptr);
if (uses_swa || AOTRITON_ALWAYS_V3_API) {
#if AOTRITON_V3_API
using aotriton::v3::flash::CausalType;

View File

@ -1561,6 +1561,38 @@ namespace {
<< "Failure Details:\nTest Seed to reproduce: " << seed;
}
}
#endif
#if defined(CPU_CAPABILITY_AVX512)
TYPED_TEST(Quantization8BitTests, TransposePackVNNI4) {
using VT = ValueType<TypeParam>;
constexpr auto K = 197;
constexpr auto N = 64;
constexpr auto L = K * N;
constexpr auto ld_src = N;
constexpr auto ld_dst = K * 4;
CACHE_ALIGN VT x[L];
CACHE_ALIGN VT y[L];
CACHE_ALIGN VT ref[L];
auto seed = TestSeed();
ValueGen<VT> generator(VT(-100), VT(100), seed);
for (const auto i : c10::irange(L)) {
x[i] = generator.get();
}
at::vec::transpose_pack_vnni4(x, y, ld_src, K, N);
int64_t _N = N / 4;
for (int64_t k = 0; k < K; k++) {
for(int64_t n = 0; n < _N; n++) {
for(int64_t l = 0; l < 4; l++) {
ref[n * ld_dst + k * 4 + l] =
c10::load(&(x[k * ld_src + n * 4 + l]));
}
}
}
for (const auto i : c10::irange(L)) {
ASSERT_EQ(y[i], ref[i])
<< "Failure Details:\nTest Seed to reproduce: " << seed;
}
}
#endif
TYPED_TEST(FunctionalTests, Map) {
using vec = TypeParam;

View File

@ -218,7 +218,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
218
219
220
221
222
223
224

View File

@ -146,7 +146,7 @@ maml_omniglot,pass,7
microbench_unbacked_tolist_sum,pass,8
microbench_unbacked_tolist_sum,pass,9

1 name accuracy graph_breaks
146
147
148
149
150
151
152

View File

@ -182,7 +182,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,0
microbench_unbacked_tolist_sum,pass,1

1 name accuracy graph_breaks
182
183
184
185
186
187
188

View File

@ -182,7 +182,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,0
microbench_unbacked_tolist_sum,pass,1

1 name accuracy graph_breaks
182
183
184
185
186
187
188

View File

@ -198,7 +198,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
198
199
200
201
202
203
204

View File

@ -198,7 +198,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
198
199
200
201
202
203
204

View File

@ -198,7 +198,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
198
199
200
201
202
203
204

View File

@ -218,7 +218,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
218
219
220
221
222
223
224

View File

@ -146,7 +146,7 @@ maml_omniglot,pass,7
microbench_unbacked_tolist_sum,pass,8
microbench_unbacked_tolist_sum,pass,9

1 name accuracy graph_breaks
146
147
148
149
150
151
152

View File

@ -166,7 +166,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,0
microbench_unbacked_tolist_sum,pass,1

1 name accuracy graph_breaks
166
167
168
169
170
171
172

View File

@ -166,7 +166,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,0
microbench_unbacked_tolist_sum,pass,1

1 name accuracy graph_breaks
166
167
168
169
170
171
172

View File

@ -182,7 +182,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2
@ -318,7 +318,7 @@ timm_vovnet,pass,0
torch_multimodal_clip,pass,3
torch_multimodal_clip,pass,0

1 name accuracy graph_breaks
182
183
184
185
186
187
188
318
319
320
321
322
323
324

View File

@ -198,7 +198,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
198
199
200
201
202
203
204

View File

@ -218,7 +218,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
218
219
220
221
222
223
224

View File

@ -146,7 +146,7 @@ maml_omniglot,pass,7
microbench_unbacked_tolist_sum,pass,8
microbench_unbacked_tolist_sum,pass,9

1 name accuracy graph_breaks
146
147
148
149
150
151
152

View File

@ -218,7 +218,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
218
219
220
221
222
223
224

View File

@ -146,7 +146,7 @@ maml_omniglot,pass,7
microbench_unbacked_tolist_sum,pass,8
microbench_unbacked_tolist_sum,pass,9

1 name accuracy graph_breaks
146
147
148
149
150
151
152

View File

@ -218,7 +218,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
218
219
220
221
222
223
224

View File

@ -146,7 +146,7 @@ maml_omniglot,pass,7
microbench_unbacked_tolist_sum,pass,8
microbench_unbacked_tolist_sum,pass,9

1 name accuracy graph_breaks
146
147
148
149
150
151
152

View File

@ -221,7 +221,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
221
222
223
224
225
226
227

View File

@ -154,7 +154,7 @@ maml_omniglot,pass,7
microbench_unbacked_tolist_sum,pass,8
microbench_unbacked_tolist_sum,pass,9

1 name accuracy graph_breaks
154
155
156
157
158
159
160

View File

@ -221,7 +221,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
221
222
223
224
225
226
227

View File

@ -154,7 +154,7 @@ maml_omniglot,pass,7
microbench_unbacked_tolist_sum,pass,8
microbench_unbacked_tolist_sum,pass,9

1 name accuracy graph_breaks
154
155
156
157
158
159
160

View File

@ -142,7 +142,7 @@ maml_omniglot,pass,7
microbench_unbacked_tolist_sum,pass,8
microbench_unbacked_tolist_sum,pass,9

1 name accuracy graph_breaks
142
143
144
145
146
147
148

View File

@ -221,7 +221,7 @@ maml_omniglot,pass,0
microbench_unbacked_tolist_sum,pass,1
microbench_unbacked_tolist_sum,pass,2

1 name accuracy graph_breaks
221
222
223
224
225
226
227

View File

@ -154,7 +154,7 @@ maml_omniglot,pass,7
microbench_unbacked_tolist_sum,pass,8
microbench_unbacked_tolist_sum,pass,9

1 name accuracy graph_breaks
154
155
156
157
158
159
160

View File

@ -142,7 +142,7 @@ maml_omniglot,pass,7
microbench_unbacked_tolist_sum,pass,8
microbench_unbacked_tolist_sum,pass,9

1 name accuracy graph_breaks
142
143
144
145
146
147
148

View File

@ -6,4 +6,4 @@
4. (Optional) flip a flag that you know will change the benchmark and run again with b.txt `PYTHONPATH=./ python benchmarks/[YOUR_BENCHMARK].py a.txt`
5. Compare `a.txt` and `b.txt` located within the `benchmarks/dynamo/pr_time_benchmarks` folder to make sure things look as you expect
6. Check in your new benchmark file and submit a new PR
7. In a few days, if your benchmark is stable, bug Laith Sakka to enable running your benchmark on all PRs. If your a meta employee, you can find the dashboard here: internalfb.com/intern/unidash/dashboard/pt2_diff_time_metrics
7. In a few days, if your benchmark is stable, bug Laith Sakka to enable running your benchmark on all PRs. If you are a meta employee, you can find the dashboard here: https://internalfb.com/intern/unidash/dashboard/pt2_diff_time_metrics

View File

@ -0,0 +1,111 @@
import sys
from benchmark_base import BenchmarkBase
import torch
from torch.autograd.grad_mode import inference_mode
class Benchmark(BenchmarkBase):
def __init__(self, requires_grad, inference_mode, backward, dynamic):
assert not (inference_mode and backward), (
"inference_mode and backward cannot be both True"
)
self._requires_grad = requires_grad
self._inference_mode = inference_mode
self._backward = backward
super().__init__(
category="runtime_overhead",
backend="inductor",
device="cuda",
dynamic=dynamic,
)
def name(self):
prefix = f"{self.category()}_{self.backend()}"
if self._requires_grad:
prefix += "_requires_grad"
if self._inference_mode:
prefix += "_inference_mode"
if self._backward:
prefix += "_backward"
if self.is_dynamic():
prefix += "_dynamic"
return prefix
def description(self):
return "runtime of a compiled add1 op small input"
def _prepare_once(self):
torch._dynamo.reset()
self.a = torch.ones(2, device=self.device(), requires_grad=self._requires_grad)
@torch.compile(
backend=self.backend(),
fullgraph=True,
dynamic=self.is_dynamic(),
)
def add1(a):
return a + 1
self._add1 = add1
# warmup
for _ in range(10):
if self._backward:
self.forward_val = self._add1(self.a).sum()
self.forward_val.backward()
else:
self._work()
def _prepare(self):
if self._backward:
self.forward_val = self._add1(self.a).sum()
def _work(self):
if self._inference_mode:
with inference_mode():
self._add1(self.a)
elif self._backward:
self.forward_val.backward()
else:
self._add1(self.a)
def main():
result_path = sys.argv[1]
all = [
Benchmark(
requires_grad=False, inference_mode=False, backward=False, dynamic=False
),
Benchmark(
requires_grad=False, inference_mode=True, backward=False, dynamic=False
),
Benchmark(
requires_grad=True, inference_mode=False, backward=False, dynamic=False
),
Benchmark(
requires_grad=True, inference_mode=False, backward=True, dynamic=False
),
Benchmark(
requires_grad=False, inference_mode=False, backward=False, dynamic=True
),
Benchmark(
requires_grad=False, inference_mode=True, backward=False, dynamic=True
),
Benchmark(
requires_grad=True, inference_mode=False, backward=False, dynamic=True
),
Benchmark(
requires_grad=True, inference_mode=False, backward=True, dynamic=True
),
]
for benchmark in all:
benchmark.enable_instruction_count().collect_all().append_results(result_path)
if __name__ == "__main__":
main()

View File

@ -50,7 +50,7 @@ bool SymInt::has_hint() const {
return toSymNodeImplUnowned()->has_hint();
}
#define DEFINE_BINARY(API, OP, METHOD, RET) \
#define DEFINE_BINARY(API, METHOD, RET) \
RET SymInt::API(const SymInt& sci) const { \
if (auto ma = maybe_as_int()) { \
TORCH_INTERNAL_ASSERT_DEBUG_ONLY( \
@ -68,19 +68,19 @@ bool SymInt::has_hint() const {
} \
}
DEFINE_BINARY(operator_add_slow_path, std::plus<>(), add, SymInt)
DEFINE_BINARY(operator_sub_slow_path, std::minus<>(), sub, SymInt)
DEFINE_BINARY(operator_mul_slow_path, std::multiplies<>(), mul, SymInt)
DEFINE_BINARY(operator_div_slow_path, std::divides<>(), floordiv, SymInt)
DEFINE_BINARY(operator_mod_slow_path, std::modulus<>(), mod, SymInt)
DEFINE_BINARY(sym_eq_slow_path, std::equal_to<>(), eq, SymBool)
DEFINE_BINARY(sym_ne_slow_path, std::not_equal_to<>(), ne, SymBool)
DEFINE_BINARY(sym_lt_slow_path, std::less<>(), lt, SymBool)
DEFINE_BINARY(sym_le_slow_path, std::less_equal<>(), le, SymBool)
DEFINE_BINARY(sym_gt_slow_path, std::greater<>(), gt, SymBool)
DEFINE_BINARY(sym_ge_slow_path, std::greater_equal<>(), ge, SymBool)
DEFINE_BINARY(min_slow_path, std::min, sym_min, SymInt)
DEFINE_BINARY(max_slow_path, std::max, sym_max, SymInt)
DEFINE_BINARY(operator_add_slow_path, add, SymInt)
DEFINE_BINARY(operator_sub_slow_path, sub, SymInt)
DEFINE_BINARY(operator_mul_slow_path, mul, SymInt)
DEFINE_BINARY(operator_div_slow_path, floordiv, SymInt)
DEFINE_BINARY(operator_mod_slow_path, mod, SymInt)
DEFINE_BINARY(sym_eq_slow_path, eq, SymBool)
DEFINE_BINARY(sym_ne_slow_path, ne, SymBool)
DEFINE_BINARY(sym_lt_slow_path, lt, SymBool)
DEFINE_BINARY(sym_le_slow_path, le, SymBool)
DEFINE_BINARY(sym_gt_slow_path, gt, SymBool)
DEFINE_BINARY(sym_ge_slow_path, ge, SymBool)
DEFINE_BINARY(min_slow_path, sym_min, SymInt)
DEFINE_BINARY(max_slow_path, sym_max, SymInt)
SymInt::operator SymFloat() const {
if (auto ma = maybe_as_int()) {

View File

@ -1,21 +0,0 @@
#include <c10/core/impl/HermeticPyObjectTLS.h>
namespace c10::impl {
thread_local static std::atomic<bool> hermeticPyObjectState{false};
std::atomic<bool> HermeticPyObjectTLS::haveState_{false};
void HermeticPyObjectTLS::set_state(bool state) {
hermeticPyObjectState = state;
}
bool HermeticPyObjectTLS::get_tls_state() {
return hermeticPyObjectState;
}
void HermeticPyObjectTLS::init_state() {
haveState_ = true;
}
} // namespace c10::impl

View File

@ -1,62 +0,0 @@
#pragma once
#include <c10/macros/Export.h>
#include <atomic>
namespace c10::impl {
// This TLS controls whether or not we permanently associate PyObject
// with Tensor the first time it is allocated. When hermetic PyObject
// TLS is enabled (state is true), we DO NOT save PyObjects to Tensor,
// meaning you get a distinct PyObject whenever you execute the code in
// question.
struct C10_API HermeticPyObjectTLS {
static void set_state(bool state);
static bool get_state() {
// Hypothetical fastpath if torchdeploy/multipy // codespell:ignore multipy
// isn't used. Per
// https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2055r0.pdf
// this qualifies relaxed access because it is a single-location data
// structure (only the boolean here).
//
// Forgetting about data races for a moment, is there a logical race?
//
// - Boolean only ever transitions from false to true. So the
// critical situation is when one interpreter is already running
// when a second interpreter switches haveState from false to true.
//
// - The first interpreter is indifferent whether or not it sees
// hasState true/false; obviously false works (this is what the
// interpreter was previously using; more directly, the interpreter
// calls into itself as the handler, so being hermetic is not
// required), and true simply means serviced python operator calls will
// be hermetic; in these cases it is expected to be functionally
// equivalent.
//
// - The second interpreter MUST see hasState true (as its requests will
// be forwarded to the first interpreter), but it is assumed that there
// is a synchronization between the interpreter initialization, and
// when we actually perform operations, so it is guaranteed to see
// hasState true.
//
// QED.
//
// This fastpath is currently disabled so that we can more easily test that
// hermetic mode works correctly even on stock build of PyTorch.
if (false && !haveState_.load(std::memory_order_relaxed))
return false;
return get_tls_state();
}
// Call this from the multipy/torchdeploy // codespell:ignore multipy
// top level
static void init_state();
private:
// This only flipped once from false to true during
// torchdeploy/multipy initialization, // codespell:ignore multipy
// and never again.
static std::atomic<bool> haveState_;
static bool get_tls_state();
};
} // namespace c10::impl

View File

@ -1,6 +1,5 @@
#pragma once
#include <c10/core/impl/HermeticPyObjectTLS.h>
#include <c10/core/impl/PyInterpreter.h>
#include <c10/core/impl/PyInterpreterHooks.h>
#include <c10/util/python_stub.h>
@ -42,32 +41,15 @@ struct C10_API PyObjectSlot {
PyObject* _unchecked_untagged_pyobj() const;
// Test the interpreter tag. If tagged for the current interpreter, return
// a non-nullopt (but possibly null) PyObject. If (possibly) untagged,
// returns a nullopt. If it is definitely invalid, raises an error.
//
// If `ignore_hermetic_tls` is false and this function is called from a
// hermetic context (ie, `HermeticPyObjectTLS::get_state()` is true), then
// nullopt is returned. If `ignore_hermetic_tls` is true, then the hermetic
// context is ignored, allowing you to check the interpreter tag of a
// nonhermetic PyObject from within a hermetic context. This is necessary
// because there are some cases where the deallocator function of a
// nonhermetic PyObject is called from within a hermetic context, so it must
// be properly treated as a nonhermetic PyObject.
//
// Test the interpreter / PyObj as they may be null
// NB: this lives in header so that we can avoid actually creating the
// std::optional
// @todo alban: I'm not too sure what's going on here, we can probably delete
// it but it's worthwhile making sure
std::optional<PyObject*> check_pyobj() const {
impl::PyInterpreter* interpreter = getGlobalPyInterpreter();
if (interpreter == nullptr || pyobj_ == nullptr) {
return std::nullopt;
}
if (c10::impl::HermeticPyObjectTLS::get_state()) {
return std::nullopt;
}
return _unchecked_untagged_pyobj();
}

View File

@ -38,7 +38,7 @@ def unroll(num_unrolls, IndexType, InType, OutType):
code = []
if num_unrolls == 1:
code.append(f" // tail loop")
code.append(" // tail loop")
code.append(" if (j < end_offset) {")
else:
code.append(f" // unrolling {num_unrolls} times")

View File

@ -153,7 +153,6 @@ _ZN3c104impl12PyObjectSlot10owns_pyobjEv
_ZN3c104impl12PyObjectSlot19maybe_destroy_pyobjEv
_ZN3c104impl12PyObjectSlotC1Ev
_ZN3c104impl12PyObjectSlotD2Ev
_ZN3c104impl19HermeticPyObjectTLS13get_tls_stateEv
_ZN3c104impl20TorchDispatchModeTLS13any_modes_setEb
_ZN3c104impl23ExcludeDispatchKeyGuardC1ENS_14DispatchKeySetE
_ZN3c104impl23ExcludeDispatchKeyGuardD2Ev

View File

@ -40,7 +40,34 @@ extensions = [
"sphinx.ext.intersphinx",
] + (["breathe", "exhale"] if run_doxygen else [])
intersphinx_mapping = {"pytorch": ("https://pytorch.org/docs/main", None)}
intersphinx_mapping = {"pytorch": ("https://docs.pytorch.org/docs/main", None)}
# Configure Sphinx warnings and error handling
suppress_warnings = [
"ref.citation",
"ref.footnote",
"ref.doc",
"toc.excluded",
"toc.not_readable",
"misc.highlighting_failure",
]
# Configure Breathe
breathe_show_define_initializer = True
breathe_show_enumvalue_initializer = True
breathe_default_members = ("members", "undoc-members")
# Fix for Python 3.10+ compatibility with exhale 2.3.0
# MutableMapping was moved from collections to collections.abc in Python 3.10
try:
import collections
from collections.abc import MutableMapping
if not hasattr(collections, "MutableMapping"):
collections.MutableMapping = MutableMapping
except ImportError:
pass
# Setup absolute paths for communicating with breathe / exhale where
# items are expected / should be trimmed by.
@ -101,6 +128,21 @@ exhale_args = {
Welcome to the developer reference for the PyTorch C++ API.
"""
),
############################################################################
# Duplicate handling and error management. #
############################################################################
# Note: Using Doxyfile instead of stdin configuration
# "exhaleDoxygenStdin" is not compatible with "exhaleUseDoxyfile"
# Handle unresolved references more gracefully
"unabridgedOrphanKinds": {
"function",
"define",
"enum",
"enumvalue",
"typedef",
"variable",
},
"fullToctreeMaxDepth": 2,
}
# Tell sphinx what the primary language being documented is.

View File

@ -2,11 +2,6 @@
# AOTInductor: Ahead-Of-Time Compilation for Torch.Export-ed Models
```{warning}
AOTInductor and its related features are in prototype status and are
subject to backwards compatibility breaking changes.
```
AOTInductor is a specialized version of
[TorchInductor](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747),
designed to process exported PyTorch models, optimize them, and produce shared libraries as well
@ -73,6 +68,10 @@ with torch.no_grad():
# [Optional] Specify the generated shared library path. If not specified,
# the generated artifact is stored in your system temp directory.
package_path=os.path.join(os.getcwd(), "model.pt2"),
# [Optional] Specify Inductor configs
# This specific max_autotune option will turn on more extensive kernel autotuning for
# better performance.
inductor_configs={"max_autotune": True,},
)
```

View File

@ -182,7 +182,6 @@ ignore = [
"SIM117",
"SIM118",
"UP007", # keep-runtime-typing
"UP038", # Was removed from newer versions, results in slower code
"UP045", # keep-runtime-typing
"TC006",
# TODO: Remove Python-3.10 specific suppressions

View File

@ -1,9 +1,7 @@
if(WIN32)
set(TORCH_PYTHON_IMPORTED_LOCATION "${PYTORCH_INSTALL_DIR}/lib/torch_python.lib")
elseif(APPLE)
set(TORCH_PYTHON_IMPORTED_LOCATION "${PYTORCH_INSTALL_DIR}/lib/libtorch_python.dylib")
set(TORCH_PYTHON_IMPORTED_LOCATION "${PYTORCH_INSTALL_DIR}/lib/${CMAKE_IMPORT_LIBRARY_PREFIX}torch_python${CMAKE_IMPORT_LIBRARY_SUFFIX}")
else()
set(TORCH_PYTHON_IMPORTED_LOCATION "${PYTORCH_INSTALL_DIR}/lib/libtorch_python.so")
set(TORCH_PYTHON_IMPORTED_LOCATION "${PYTORCH_INSTALL_DIR}/lib/${CMAKE_SHARED_LIBRARY_PREFIX}torch_python${CMAKE_SHARED_LIBRARY_SUFFIX}")
endif()
add_library(torch_python SHARED IMPORTED)

View File

@ -11,7 +11,12 @@ from typing import Union
import torch
import torch.distributed as dist
import torch.nn as nn
from torch.distributed._composable import checkpoint
from torch.distributed._composable.replicate_with_fsdp import replicate
from torch.distributed.algorithms._checkpoint.checkpoint_wrapper import (
_CHECKPOINT_PREFIX,
apply_activation_checkpointing,
)
from torch.distributed.fsdp import CPUOffloadPolicy, FSDPModule, OffloadPolicy
from torch.distributed.tensor import DTensor, init_device_mesh
from torch.testing._internal.common_distributed import skip_if_lt_x_gpu
@ -652,5 +657,190 @@ class TestReplicate1DTrainingCore(FSDPTest):
self.assertEqual(ref_loss, loss)
class TestReplicateTrainingCompose(FSDPTest):
@property
def world_size(self) -> int:
# Since these tests run with a larger transformer model, they may see
# some numeric drift with >2 GPUs
return min(torch.get_device_module(device_type).device_count(), 2)
@skip_if_lt_x_gpu(2)
@compiled_fsdp_test(compile_compute_on_module=Transformer)
def test_train_parity_with_activation_checkpointing(self):
"""
Tests train parity against DDP when composing with activation
checkpointing.
"""
self.run_subtests(
{
"reshard_after_forward": [True, False],
"checkpoint_impl": ["composable", "utils", "wrapper"],
"module_grouping": ["block", "mem_eff", "mem_eff_weight_tied"],
"test_device_type": [device_type.type],
},
self._test_train_parity_with_activation_checkpointing,
)
def _test_train_parity_with_activation_checkpointing(
self,
reshard_after_forward: Union[bool, int],
checkpoint_impl: str,
module_grouping: str,
test_device_type: str,
):
assert checkpoint_impl in ("composable", "utils", "wrapper")
testing_compile = replicate != torch.distributed._composable.replicate_with_fsdp
if testing_compile and checkpoint_impl == "composable":
return
torch.manual_seed(42)
vocab_size = 1024
with torch.device(device_type):
model_args = ModelArgs(
n_layers=3,
n_heads=4,
vocab_size=vocab_size,
max_seq_len=64,
dropout_p=0,
checkpoint_activations=(checkpoint_impl == "utils"),
# For the mem-efficient module grouping, we separate the
# embeddings from the output projection, which does not support
# weight tying
weight_tying=module_grouping != "mem_eff",
)
model = Transformer(model_args)
ref_model = copy.deepcopy(model).to(device_type)
ref_optim = torch.optim.Adam(ref_model.parameters(), lr=1e-2)
# Apply activation checkpointing
prefixes_to_ignore = ()
if checkpoint_impl == "wrapper":
prefixes_to_ignore = (_CHECKPOINT_PREFIX,)
apply_activation_checkpointing(
model, check_fn=lambda m: isinstance(m, TransformerBlock)
)
elif checkpoint_impl == "composable":
for module in model.modules():
if isinstance(module, TransformerBlock):
checkpoint(module)
# Apply Replicate
device_mesh = init_device_mesh(
test_device_type,
(self.world_size, 1),
mesh_dim_names=("replicate", "shard"),
)
fsdp_kwargs = {
"reshard_after_forward": reshard_after_forward,
"device_mesh": device_mesh,
}
if module_grouping == "mem_eff":
assert model_args.n_layers == 3
replicate(model.layers[0], **fsdp_kwargs)
replicate([model.layers[1], model.layers[2]], **fsdp_kwargs)
replicate([model.tok_embeddings, model.pos_embeddings], **fsdp_kwargs)
# Embedding weights are not needed for embedding backward
model.tok_embeddings.set_unshard_in_backward(False)
replicate([model.norm, model.output], **fsdp_kwargs)
elif module_grouping == "mem_eff_weight_tied":
replicate([model.tok_embeddings, model.output], **fsdp_kwargs)
for layer in model.layers:
replicate(layer, **fsdp_kwargs)
elif module_grouping == "block":
for layer in model.layers:
replicate(layer, **fsdp_kwargs)
else:
raise NotImplementedError(f"Unknown module grouping: {module_grouping}")
replicate(model, **fsdp_kwargs)
optim = torch.optim.Adam(model.parameters(), lr=1e-2)
torch.manual_seed(42 + self.rank)
# Reuse the same input across iterations to avoid loss explosion from
# trying to learn from random inputs
inp = torch.randint(0, vocab_size, (3, 64), device=device_type.type)
check_sharded_parity(
self, ref_model, model, prefixes_to_ignore=prefixes_to_ignore
)
for iter_idx in range(10):
losses: list[torch.Tensor] = []
for _model in (ref_model, model):
torch.manual_seed(iter_idx + 1) # for dropout determinism
losses.append(_model(inp).sum())
losses[-1].backward()
for param in ref_model.parameters():
if param.grad is not None:
dist.all_reduce(param.grad)
param.grad.div_(self.world_size)
if not testing_compile:
check_sharded_parity(
self, ref_model, model, prefixes_to_ignore=prefixes_to_ignore
)
self.assertEqual(losses[0], losses[1])
for _optim in (ref_optim, optim):
_optim.step()
_optim.zero_grad(set_to_none=(iter_idx % 2 == 0))
if not testing_compile:
check_sharded_parity(
self, ref_model, model, prefixes_to_ignore=prefixes_to_ignore
)
class TestReplicateSharedParams(FSDPTest):
@property
def world_size(self) -> int:
return min(4, torch.get_device_module(device_type).device_count())
@skip_if_lt_x_gpu(2)
def test_train_parity_with_shared_params(self):
self.run_subtests(
{
"reshard_after_forward": [False, True],
"use_activation_checkpointing": [False, True],
},
self._test_train_shared_params,
)
def _test_train_shared_params(
self,
reshard_after_forward: bool,
use_activation_checkpointing: bool,
):
torch.manual_seed(42)
model_args = ModelArgs(n_layers=3, dropout_p=0.0, weight_tying=True)
model = Transformer(model_args)
ref_model = copy.deepcopy(model).to(device_type)
ref_optim = torch.optim.Adam(ref_model.parameters(), lr=1e-2)
for module in model.modules():
if isinstance(module, TransformerBlock):
if use_activation_checkpointing:
checkpoint(module)
replicate(module, reshard_after_forward=reshard_after_forward)
replicate(model, reshard_after_forward=reshard_after_forward)
optim = torch.optim.Adam(model.parameters(), lr=1e-2)
torch.manual_seed(42 + self.rank + 1)
for iter_idx in range(10):
inp = torch.randint(
0, model_args.vocab_size, (2, 16), device=device_type.type
)
losses: list[torch.Tensor] = []
for _model in (ref_model, model):
losses.append(_model(inp).sum())
losses[-1].backward()
for param in ref_model.parameters():
if param.grad is not None:
dist.all_reduce(param.grad)
param.grad.div_(self.world_size)
for _optim in (ref_optim, optim):
_optim.zero_grad(set_to_none=(iter_idx % 2 == 0))
_optim.step()
self.assertEqual(losses[0], losses[1])
if __name__ == "__main__":
run_tests()

View File

@ -1,4 +1,4 @@
# Owner(s): ["module: unknown"]
# Owner(s): ["oncall: distributed"]
import unittest
import torch

View File

@ -1,4 +1,4 @@
# Owner(s): ["module: unknown"]
# Owner(s): ["module: fsdp"]
import functools
import gc
from typing import Union

View File

@ -1,4 +1,4 @@
# Owner(s): ["module: unknown"]
# Owner(s): ["oncall: distributed"]
import gc
import unittest

View File

@ -1,4 +1,4 @@
# Owner(s): ["module: unknown"]
# Owner(s): ["oncall: distributed"]
from copy import copy

View File

@ -1,4 +1,4 @@
# Owner(s): ["module: unknown"]
# Owner(s): ["oncall: distributed"]
import unittest
from dataclasses import dataclass
from typing import Any, Callable, cast, Union

View File

@ -1,4 +1,4 @@
# Owner(s): ["module: unknown"]
# Owner(s): ["oncall: distributed"]
import unittest
import torch

View File

@ -1,4 +1,4 @@
# Owner(s): ["module: unknown"]
# Owner(s): ["oncall: distributed"]
import copy
import unittest

View File

@ -248,6 +248,16 @@ class TestDTensorDebugMode(TestCase):
"redistribute_input(1, [S(0)] -> [R])" in debug_mode.debug_string()
)
def test_debug_mode_higher_order_cond(self):
"""Test DebugMode with higher order operation."""
x = torch.randn(1, 8, requires_grad=True)
with DebugMode(record_torchfunction=True) as debug_mode:
torch.cond(torch.tensor(True), lambda x: x + 1, lambda x: x - 1, [x])
# Verify that cond operations are captured in debug mode
self.assertIn("torch.ops.higher_order.cond", debug_mode.debug_string())
instantiate_parametrized_tests(TestDTensorDebugMode)

Some files were not shown because too many files have changed in this diff Show More