Compare commits

..

497 Commits

Author SHA1 Message Date
f1f9683409 [export] Preserve nn_module_stack for aliased nn modules 2025-09-30 18:02:43 -07:00
ad7e3c93b1 [ROCm][CD] librocroller.so missing from ROCm 7 wheel (#164244)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164244
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-01 00:02:34 +00:00
7f3dc45300 Migrate DeviceType to torch/headeronly (#163999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163999
Approved by: https://github.com/mikaylagawarecki
2025-09-30 23:13:27 +00:00
ff715366aa [vllm hash update] update the pinned vllm hash (#164190)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164190
Approved by: https://github.com/pytorchbot
2025-09-30 22:43:49 +00:00
60a4961ff4 [DTensor] Allow redistribute to Partial if src matches (#164253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164253
Approved by: https://github.com/zpcore
2025-09-30 22:42:49 +00:00
bec6541d84 [CUDA][CUDAGraph] Reduce capture overhead in CUDA Graph memory reuse (#162186)
Previous work #158352 delivered CUDAGraph memory footprint reduction with no replay-time impact, but capture time regressed (up to 20× slower) due to repeated full-graph traversals. See previous benchmark results [here](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3215947565)

This PR removes capture/reply overhead while preserving the memory savings:

1. **Terminals as free markers**
   We stop inserting empty nodes and instead record the current stream terminals as free markers. This avoids mutating the user’s graph and keeps semantics unchanged.

2. **Incremental, cached reachability**
   We add a **per-graph reuse context** that caches reverse-traversal state:

   * `graph_reuse_context[graph].visited[stream]` tracks nodes already seen from that stream’s terminal frontier.
   * On each allocation during capture, we resume traversal from the latest terminals and only visit unseen nodes.
   * A block is freed when all its recorded markers are in the visited set of its allocation stream—i.e., all markers are proven predecessors of future work.

See [the performance results here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRPvdd9Xa8W87ixbiA0da_qvOhrUAjUpFz0G-_j-MsDnoeRyhEa4_ut_W3rqcg1VVZVFJ-gucwov-3b/pubhtml?gid=1468302443&single=true), we sweep synthetic multi-stream CUDA Graphs built by `capture_benchmark.py` (same as before, we generate random interleaving of alloc/free/join with given probabilities, see [gist here](https://gist.github.com/eee4017/e2092d215b1d4bd46534148939af39e3)), and we compare median capture/replay times and memory. On an NVIDIA H100 PCIe across 24 configs, the optimization preserves reserved memory reduction at ~24–98%, leaves allocated memory unchanged, and brings capture time back to baseline (range 0.96–1.04× vs. baseline) with replay time unchanged (range 0.97–1.11×).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162186
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-09-30 22:28:46 +00:00
1f1de20ba9 [c10d][BE][ez] Update tensor ptr inside nccl.cpp (#164276)
This is mostly a cosmetic change which replace the deprecating `data_ptr` API with mutable or const one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164276
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501
2025-09-30 22:05:12 +00:00
2810977d3a [FSDP][Replicate] tests replicate type casting behavior and edge cases in mixed precision (#162861)
**Summary:** Ensures that replicate can handle the same type casting behavior and edge cases that fully shard can when mixed precision is used

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_float16_on_one_submodule
2. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_submodules_with_external_inputs
3. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_bf16
4. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_fp16
5. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_clamp_reduce_dtype
6. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_dataclass_input

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162861
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839, #162851, #162853, #162855
2025-09-30 22:03:23 +00:00
ae4fd4ea75 [FSDP2] support AC(FSDP) for torchtitan's MOE (#164009)
for fsdp2 + EP, titan has fully_shard(AC(layer)) and fully_shard(layer.moe.experts): https://github.com/pytorch/torchtitan/issues/1624

for implicit prefetching, backward order is
* _pre_backward unshard (norm, output)
* _backward_prefetch unshard layers.6
* post_backward reshard (norm, output)
* _pre_backward unshard layers.6 (no-op, unsharded already)
* _backward_prefetch unshard layers.6.moe.experts
* recompute_fn pre_forward unshard layers.6.moe.experts (no-op, unsharded already)
* ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR make it a no-op
* _pre_backward unshard layers.6.moe.experts (no-op, unsharded already)
* _backward_prefetch unshard layers.5
* post_backward reshard layers.6.moe.experts
* post_backward reshard layers.6

unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_comm.py -k test_set_modules_to_backward_prefetch_inside_ac`

before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - step:  1  loss: 12.0162  grad_norm:  1.7315  memory: 45.64GiB(48.05%)  tps: 1,028  tflops: 10.87  mfu: 1.10%
[rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 11:43:35,233 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds
[rank0]:[titan] 2025-09-30 11:43:35,987 - root - INFO - step: 50  loss:  6.9302  grad_norm:  0.9985  memory: 59.66GiB(62.80%)  tps: 11,712  tflops: 123.89  mfu: 12.53%
```

after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - step:  1  loss: 12.0134  grad_norm:  1.6916  memory: 38.42GiB(40.45%)  tps: 805  tflops: 8.51  mfu: 0.86%
[rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 11:39:28,541 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds
[rank0]:[titan] 2025-09-30 11:39:29,279 - root - INFO - step: 50  loss:  6.9346  grad_norm:  1.1875  memory: 52.58GiB(55.36%)  tps: 12,583  tflops: 133.10  mfu: 13.46%
```

for explicit prefetching, layers.6 backward prefetch layers.5 and layers.5.moe.experts. layers.6.moe.experts does not have explicit prefetch. backward order is like this
* _pre_backward unshard (norm, output)
* _prefetch_unshard layers.6
* post_backward reshard (norm, output)
* _pre_backward unshard layers.6 (no-op, unsharded already)
* _prefetch_unshard layers.5
* _prefetch_unshard layers.5.moe.experts
* recompute_fn pre_forward unshard layers.6.moe.experts
* ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR makes it a no-op
* _pre_backward unshard layers.6.moe.expert (no-op, unsharded already)
* post_backward reshard layers.6.moe.expert
* post_backward reshard layers.6

before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - step:  1  loss: 12.0180  grad_norm:  1.6948  memory: 45.77GiB(48.18%)  tps: 849  tflops: 8.98  mfu: 0.91%
[rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 11:53:57,768 - root - INFO - [GC] Performing periodical GC collection 0.07 seconds
[rank0]:[titan] 2025-09-30 11:53:58,515 - root - INFO - step: 50  loss:  6.9358  grad_norm:  1.0528  memory: 59.80GiB(62.95%)  tps: 11,827  tflops: 125.10  mfu: 12.65%```
```

after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2`
```
[rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - step:  1  loss: 12.0143  grad_norm:  1.7030  memory: 38.55GiB(40.58%)  tps: 988  tflops: 10.45  mfu: 1.06%
[rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:[titan] 2025-09-30 12:09:10,482 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds
[rank0]:[titan] 2025-09-30 12:09:11,168 - root - INFO - step: 50  loss:  6.9356  grad_norm:  0.9911  memory: 52.81GiB(55.59%)  tps: 12,637  tflops: 133.68  mfu: 13.52%
```

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164009
Approved by: https://github.com/soulitzer
2025-09-30 22:02:24 +00:00
adc11a7634 [export] avoid checks during tracing of export verification (#164219)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164219
Approved by: https://github.com/Lucaskabela
2025-09-30 21:46:59 +00:00
99e28ffab3 [FSDP][Replicate] tests replicate core functionality with mixed precision (#162855)
**Summary:** Ensures that replicate functionality works the same as fully shard's when mixed precision is used

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k TestReplicateMixedPrecisionTraining

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162855
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839, #162851, #162853
2025-09-30 21:45:58 +00:00
01dd2c2b42 [FSDP][Replicate] tests replicate is composable with tp (#162853)
**Summary:** Proof that new replicate API is composable with TP

**Test Case**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_replicate_tp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162853
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839, #162851
2025-09-30 21:29:54 +00:00
d3bdf8c32e [FSDP][Replicate] tests replicate with custom forward method (#162851)
**Summary: tests replicate works when users use custom forward methods**

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_register_fsdp_forward_method

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162851
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836, #162839
2025-09-30 21:15:34 +00:00
1ce9563ff6 [FSDP][Replicate] tests replicate gradient accumulation and 1f1b microbatching (#162839)
**Summary:** In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. The first test verifies Replicate works with gradient accumulation properly. The second verifies that replicate works correctly with a One-Forward-One-Backward (1F1B) pipeline parallelism schedule

**Test Cases**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_gradient_accumulation
2. pytest test/distributed/_composable/test_replicate_training.py -k test_1f1b_microbatching

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162839
Approved by: https://github.com/mori360
ghstack dependencies: #162830, #162836
2025-09-30 21:00:16 +00:00
9e631392dc Missing lambda in torch._check (#164225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164225
Approved by: https://github.com/Skylion007
2025-09-30 20:32:38 +00:00
1cce6efdb8 Fix silent incorrectness for bmm/baddmm out_dtype overload (#164095)
Add input checks like meta functions for standard ops in `ATen/native/LinearAlgebra.cpp` for the `out_dtype` variants. Fixes silent incorrectness in https://github.com/pytorch/pytorch/issues/163816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164095
Approved by: https://github.com/ngimel
2025-09-30 20:13:13 +00:00
5a93f00c79 [CI] Delete binary smoke workflows (#164260)
Those were very useful in the past, because:
- CI builder jobs did not generates wheels, but rather run `python setup.py develop` and shared docker layers, which is no longer the case, all CI jobs produce wheels
- CD jobs were targeting pre-CXX11 ABI, but this is no longer the case after manylinux2_28 migration

Existing, but acceptable gaps:
 - Windows libtorch debug builds sometimes might fail, but IMO it's ok not to be able to produce those for a few days, as number of libtorch users are somewhat small
 - All CD jobs are based on AlmaLinux, while CI are based on Ubuntu, but this could be adjusted if needed, besides AlmaLinux-9 and Ubuntu-22.04 are pretty close in terms of glibc and gcc versions
 - CD jobs build for all GPU architectures, while CI only for the one being tested, but there are now periodic H100 and B200 jobs, and not a lot of development happens for Voltas or Pascals

Besides there are better tools to alert about the nightly failures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164260
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-09-30 20:00:07 +00:00
e30f01b5b5 [1/N] Simplify "in" operation for containers of a single item (#164224)
These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164224
Approved by: https://github.com/rec, https://github.com/Skylion007
2025-09-30 19:59:43 +00:00
ffc645c870 half support for fused_moving_avg_obs_fake_quant() op (#164175)
Follow up to https://github.com/pytorch/pytorch/pull/162620.  Add half support, as well.  This fixes some failures in inductor benchmarks such as from this log https://github.com/pytorch/pytorch/actions/runs/18051942373/job/51376749459.

`NotImplementedError: "aminmax_kernel" not implemented for 'Half'`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164175
Approved by: https://github.com/malfet, https://github.com/jerryzh168
2025-09-30 19:35:17 +00:00
60f0a356fd Update persons of interest for XLA. The previous one is out of date. (#158652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158652
Approved by: https://github.com/JackCaoG, https://github.com/albanD
2025-09-30 19:21:18 +00:00
d2c5f231f6 Fix the shape check inside gnll loss (#147522)
Fixes #147521
This modification allow user to put any size of var in GaussianNLLLoss if the var is broadcastable (to input/target's size)

Therefore, the demo code in #147521 will result in expected behaviour and correct output.

This allow all input size that match:
`input.size = (..., n, ...), var.size = (..., 1, ...)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147522
Approved by: https://github.com/mikaylagawarecki
2025-09-30 18:40:15 +00:00
cc5d74c366 Revert "[BE] Remove HermeticPyObjectTLS and Simplify PythonOpRegistrationTrampoline (#163464)"
This reverts commit 94195a37ae4eae9c486a81b0f67725c8970f74d6.

Reverted https://github.com/pytorch/pytorch/pull/163464 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/163464#issuecomment-3353307034))
2025-09-30 18:20:20 +00:00
a707042353 fix: inductor non_blocking test - warmup events to make test pass whether it is the first run or not (#164188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164188
Approved by: https://github.com/williamwen42
2025-09-30 18:20:17 +00:00
d615f6b935 [inductor] use hint_override in kernel benchmark args (#164207)
Summary: forward fix T239259207

Test Plan: test_multi_kernel

Differential Revision: D83539263

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164207
Approved by: https://github.com/bobrenjc93, https://github.com/mlazos
2025-09-30 18:09:29 +00:00
719b64ee8b Fix TMA transpose logic to handle 1D shapes + string differences (#163966)
Fixes #163702.

This fixes 2 issues:
1. The value may inconsistently be a shape or string. This normalizes to handle both of these.
2. 1D shapes should not transpose data. This fixes the order of operations to prevent this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163966
Approved by: https://github.com/eellison
2025-09-30 17:51:37 +00:00
1cf1b9138d [inductor][templates] Template hooks should be finalised inside a kernel context (#164229)
The prologue buffer added in https://github.com/pytorch/pytorch/pull/160480 is added to template code in the DEF_KERNEL [hook](29221b9828/torch/_inductor/select_algorithm.py (L742)). The lines in this buffer may be of type `DeferredLine`, and so require the correct kernel context to determine whether lines should be added or removed.

Test plan:

Tested with a custom template using tensor descriptors for prologue fused inputs, whose tensor descriptors need to be hoisted to the top of the kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164229
Approved by: https://github.com/njriasan
2025-09-30 17:50:59 +00:00
5ed4672477 [dynamo, 3.14] fix _detect_and_normalize_assert_statement for 3.14 (#164005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164005
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818, #163919, #163920, #164004
2025-09-30 17:43:03 +00:00
2600f8b3d1 [dynamo, 3.14] fix tracing typing.Union (#164004)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164004
Approved by: https://github.com/anijain2305, https://github.com/mlazos
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818, #163919, #163920
2025-09-30 17:43:03 +00:00
9ce31e4278 [3.14] make unbacked_sym[int/float]_counter integers (#163920)
3.14 removed copy/deepcopy/pickle support for `itertools` iterators: https://docs.python.org/3.14/whatsnew/3.14.html#itertools

Change unbacked_sym[int/float]_counter from `itertools.count` to regular integers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163920
Approved by: https://github.com/ezyang
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818, #163919
2025-09-30 17:42:55 +00:00
0657de9c61 [dynamo, 3.14] support LOAD_COMMON_CONSTANT (#163919)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163919
Approved by: https://github.com/anijain2305, https://github.com/mlazos
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796, #163818
2025-09-30 17:42:47 +00:00
4ead8ebf70 [dynamo, 3.14] fix BUILD_TUPLE with 0 args (#163818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163818
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292, #163796
2025-09-30 17:42:40 +00:00
d4b785a6a7 [dynamo, 3.14] fix stack ref copy error (#163796)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163796
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191, #163292
2025-09-30 17:42:33 +00:00
9278b18ec0 [dynamo, 3.14] fix WITH_EXCEPT_START (#163292)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163292
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110, #163191
2025-09-30 17:42:26 +00:00
008b0a9425 [dynamo, 3.14] fix inactive ctx handling in resume functions (#163191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163191
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109, #163110
2025-09-30 17:42:19 +00:00
44677ad917 [dynamo, 3.14] support LOAD_CONST on slice, codegen LOAD_CONST slice instead of BINARY/STORE_SLICE (#163110)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163110
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839, #163009, #163109
2025-09-30 17:42:11 +00:00
1c9987fdf4 [dynamo, 3.14] fix context managers (#163109)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163109
Approved by: https://github.com/anijain2305, https://github.com/mlazos
ghstack dependencies: #161838, #161555, #161839, #163009
2025-09-30 17:42:03 +00:00
7cbc011700 [dynamo, 3.14] support some bytecodes, fix CALL_FUNCTION_EX (#163009)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163009
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838, #161555, #161839
2025-09-30 17:41:56 +00:00
09c774145e [dynamo, 3.14] Python dynamo changes to get basic programs working (#161839)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161839
Approved by: https://github.com/Lucaskabela, https://github.com/anijain2305
ghstack dependencies: #161838, #161555
2025-09-30 17:41:49 +00:00
763ab2a6ed [dynamo, 3.14] compile actual code in C dynamo (#161555)
No 3.14 CI tests enabled yet, but this was enough to get Dynamo compiling locally and Python Dynamo is at least being called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161555
Approved by: https://github.com/anijain2305
ghstack dependencies: #161838
2025-09-30 17:41:42 +00:00
4b8fe795f8 [dynamo] format cpython_defs.c (#161838)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161838
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2025-09-30 17:41:35 +00:00
84e1cd7392 [inductor] fx comm overlap: align runtime estimations across dist ranks (#164226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164226
Approved by: https://github.com/eellison
2025-09-30 17:29:18 +00:00
937869657e Exporting aten.sdpa with cuda under fake mode on a cuda-less machine (#164162)
Summary:
As titled.

sdpa will select backend based on hardware check, and it fails when exporting with cuda under fake mode on a cuda-less machine.

We guard `at::cuda::is_available()` check before `at::cuda::getCurrentDeviceProperties()` and give warnings.

Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r nn_functional_scaled_dot_product_attention

Differential Revision: D83496154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164162
Approved by: https://github.com/SherlockNoMad
2025-09-30 17:21:31 +00:00
7d7ae4d7b2 [submodule] upgrade cutlass version to 4.2.1 and completely resolved python/cutlass name collision (#164156)
Differential Revision: D83489362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164156
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-09-30 17:04:57 +00:00
906fe7b120 [ROCm][CI] no longer build almalinux image for ROCm 6.3 (#164201)
Missed during ROCm 7 upgrades.  We only build N and N-1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164201
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-30 16:59:31 +00:00
7edd18f0fd [Inductor-FX] Generalize FloorDiv conversion to handle more complex launch grids. Remove python_slow grid mode. (#163828)
# Problem
Inductor's FX backend receives sympy expressions for Triton launch grids, and passes these to a tracer to generate equivalent FX IR. However, the tracer does not support all possible sympy expressions. In particular, it can't handle ops like `floor` and `Pow` which would be found in an expression like `floor(x / y)`. Instead, it expects `FloorDiv(x, y)`, which has the advantage that all intermediate values are integers, unlike `x / y`.

Inductor's Python backend uses a trick where `ceil(x / y)` is computed in Python as `-(x // -y)`, which is faster when evaluating Python launch grids at runtime. However, this trick generates more complex sympy expressions, so the FX backend introduced a `"python_slow"` mode using a more familiar form of ceil division. However, this mode is slower to evaluate, which increased production CPU usage. (Internal reviewers see T237853632.)

# Solution
To get the best of both worlds, this PR removes `"python_slow"` mode, and generalizes the `replace_floor_div` function  to handle the more complex expressions resulting from the `"python"` grid mode. The new algorithm is conceptually similar to the existing one, except instead of analyzing only the first argument to a `sympy.Mul` op, it checks all factors, so it can handle expressions containing both `Rational` and `Pow` ops, among other cases. It also uses `Mul.make_args` to handle the case when the argument to `floor` is not a `Mul`. Finally, it uses `expr.is_positive` to check the sign of symbolic exponents.

This new algorithm is guaranteed to convert all `floor` ops to an equivalent expression using `FloorDiv`. (To see this, consider that `floor(x) == FloorDiv(x, 1)`.) Note it may not remove all `Pow` ops, with a counterexample being `floor(x / (2 + z ** y))`, but it covers everything we've seen in practice for symbolic launch grids. In particular, it covers the typical case where `Pow` is a factor of the argument to `floor`, and the exponent is `-1`. Is this situation, we move the `Pow` to the denominator of `FloorDiv` and the exponent becomes `1`, eliminating the `Pow` op.

# Test plan
This PR adds an end-to-end test for static padding with dynamic outer dimensions, which creates a difficult sympy expression that the existing algorithm would not be able to handle.

This PR also adds some unit tests for the `replace_floor_div` function. It can be difficult to construct end-to-end tests that expose all the trickiest expressions, as those tests have to pass through a number of other systems handling dynamic shapes. Therefore, it's easier to expose the edge cases with these new unit tests. The tests check that we can replace all `floor` ops in the input expression with `FloorDiv`, then they expand `FloorDiv` back to `floor` and check equality with the original expression.

Note this PR also requires some MTIA changes to pass internal tests. Those will be stacked onto the imported diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163828
Approved by: https://github.com/nandesuka, https://github.com/angelayi, https://github.com/jansel
2025-09-30 16:47:49 +00:00
3564cd294c Fix TestExportOpInfo (#164184)
Fixes https://github.com/pytorch/pytorch/issues/163699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164184
Approved by: https://github.com/yiming0416, https://github.com/tugsbayasgalan
2025-09-30 16:12:39 +00:00
1412a4a42f [precompile] Add option to disable guard check on aot-compiled function. (#163432)
Summary:
Under circumstances it seems reasonable to return a callable directly without guard check when user use aot_compile on a function with single compilation result.

When having multiple entries (aot_compile_module), we should start enabling guard check to differetiate different compiled functions apart.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163432
Approved by: https://github.com/dolpm, https://github.com/mlazos
2025-09-30 16:10:15 +00:00
96330f490d [testing] Add upload for test status during test stat uploads (#164189)
Add test status (flaky, success, skipped, failure) upload for easier comparison between test status on two commits

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164189
Approved by: https://github.com/huydhn, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-30 15:53:53 +00:00
eqy
66abba8f49 [CUDA][Expandable Segments] Follow-up cleanups for even more expandable segments tests (#163297)
Gets original setting even earlier in case of crashes, fixes previous get call where set should be

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163297
Approved by: https://github.com/Skylion007
2025-09-30 15:39:14 +00:00
e88cca0691 Update Sphinx theme (#164147)
Fix links in the top nav bar: 71e55749be

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164147
Approved by: https://github.com/albanD
2025-09-30 15:35:58 +00:00
5c020beba4 Update LPPool docs to clarify ceil_mode padding semantics when ceil_mode=True (#163186)
# Summary

- Add a note to each `nn.LPPool*d` docstring explaining how `ceil_mode=True` interacts with right padding.
- Mirror the same clarification in the `torch.nn.functional.lp_pool*` docstrings so the rendered functional docs stay in sync.

# Motivation

The current PyTorch spec for **LPPool** does not fully match runtime behavior, which has led to downstream confusion in other specs (e.g., ONNX) and runtimes (e.g., [onnxruntime issue #25848](https://github.com/microsoft/onnxruntime/issues/25848)). A corresponding clarification was also made in the ONNX spec: [onnx/onnx#5741](https://github.com/onnx/onnx/pull/5741).

PyTorch’s **LPPool** implementation calls into **AvgPool**, which enforces the rule that windows starting entirely in the right padded region are ignored when `ceil_mode=True`. As a result, **LPPool** inherits the same behavior.

This is an edge case where the output size formula shown in the LPPool docs/spec is not sufficient on its own. Without the added caveat, the documentation is technically incorrect. This PR brings the LPPool docs in line with actual behavior.

Note that this is a trivial fix to the spec as all major implementers of the spec adhere to this caveat.

For comparison, both **MaxPool** and **AvgPool** already include this clarification in their spec. Their docstrings explicitly state:

> *When `ceil_mode=True`, sliding windows are allowed to go off-bounds if they start within the left padding or the input. Sliding windows that would start in the right padded region are ignored.*

Adding the same note to LPPool ensures consistency across all pooling operators.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163186
Approved by: https://github.com/mikaylagawarecki
2025-09-30 15:22:46 +00:00
edd9e07aff [BE] Remove not existing mnist mirror (#164238)
Looks like original source is empty now:
http://yann.lecun.com/exdb/mnist/

Pytorch hosted mirror exist. Hence leaving it as only option.
https://ossci-datasets.s3.amazonaws.com/mnist/

Fixes these errors in pytorch/ci:
```
C:\actions-runner\_work\pytorch\pytorch>python tools\download_mnist.py --quiet -d C:\actions-runner\_work\pytorch\pytorch\test\cpp\api\mnist
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz ...
Failed to download (trying next):
HTTP Error 404: Not Found
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz ...
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz ...
Failed to download (trying next):
HTTP Error 404: Not Found
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz ...
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz ...
Failed to download (trying next):
HTTP Error 404: Not Found
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz ...
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ...
Failed to download (trying next):
HTTP Error 404: Not Found
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz ...
```

Link to workflow with example:
https://github.com/pytorch/pytorch/actions/runs/18109150240/job/51542177282#step:15:2335
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164238
Approved by: https://github.com/jeanschmidt
2025-09-30 15:15:13 +00:00
0fb89b84b9 Revert "Consistently use c10_ovrsource in arvr mode everywhere (#164128)"
This reverts commit efd7fd5ed5ac7ec03201a546a09fb19ec59de431.

Reverted https://github.com/pytorch/pytorch/pull/164128 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164128#issuecomment-3352544006))
2025-09-30 14:43:52 +00:00
79fcfd49d6 Revert "[CI] Push viable/strict/${time} tags (#164183)"
This reverts commit 9f27b0c24515d9cf319d9a728d5009bf9ed035cf.

Reverted https://github.com/pytorch/pytorch/pull/164183 on behalf of https://github.com/malfet due to Hmm, didn't work that way ([comment](https://github.com/pytorch/pytorch/pull/164183#issuecomment-3352494098))
2025-09-30 14:32:46 +00:00
71b4fada57 Revert "Add less warps config to inner reductions (#162447)"
This reverts commit 84d673ef577d42d6ec20c6c9f09863583c3111f5.

Reverted https://github.com/pytorch/pytorch/pull/162447 on behalf of https://github.com/PaulZhang12 due to internal failure ([comment](https://github.com/pytorch/pytorch/pull/162447#issuecomment-3352474768))
2025-09-30 14:28:19 +00:00
46ec0664e3 Remove unused PyIntXXX, THPUtils_newReal_BOOL, THPQXXX macros (#164056)
The removed macros are not used in other places of the `pytorch` GitHub org.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164056
Approved by: https://github.com/albanD
2025-09-30 13:48:25 +00:00
410ed3006b Revert "Add functions to setup PrivateUse1 as a python backend device. (#157859)"
This reverts commit 1310d6a1f9194ddcf6753f7e12fb78f278451f8a.

Reverted https://github.com/pytorch/pytorch/pull/157859 on behalf of https://github.com/jeanschmidt due to introduce linting errors ([comment](https://github.com/pytorch/pytorch/pull/157859#issuecomment-3352140098))
2025-09-30 13:24:37 +00:00
77354e22e1 [OpenReg] Add AMP Integration guide for accelerators (#162050)
Fix part of #158917

Add AMP integration document and OpenReg code as example to explain steps of integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162050
Approved by: https://github.com/albanD

Co-authored-by: FFFrog <ljw1101.vip@gmail.com>
2025-09-30 12:27:11 +00:00
7f29c47a4f Fix cdist export compute mode validation (#161724)
Fixes #161089. Added '0' as the acceptable value for compute mode in _meta_registrations.py. Also, added a test case in test_export.py file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161724
Approved by: https://github.com/albanD, https://github.com/angelayi
2025-09-30 12:23:20 +00:00
ace6c76103 [inductor] Small refactor of CachingAutotuner (#162406)
This is a simple refactor that just moves some logic in `_precompile_config` to two new functions for separation of concerns. This will allow subclasses e.g. out of tree to configure options and metadata for triton.compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162406
Approved by: https://github.com/exclamaforte
2025-09-30 11:29:15 +00:00
1310d6a1f9 Add functions to setup PrivateUse1 as a python backend device. (#157859)
Fixes #156052 and #156444.

This PR setup the privateuseone key in Python to be used as a python backend for pytorch.
Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor.

Changes done in this PR:

1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine.
2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops.
3. python function that invokes the other incantations to setup the privateusekey backend.

This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859
Approved by: https://github.com/albanD
2025-09-30 08:39:36 +00:00
7f4c3e7d2f distributed/serialization: support zero sized tensors (#164198)
Fixes
```
[4] ValueError: both buffer length (0) and count (-1) must not be 0
```

Test plan:

```
pytest test/distributed/test_serialization.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164198
Approved by: https://github.com/amirafzali
2025-09-30 08:11:29 +00:00
6e5b4249a5 [DTensor][Export] Supporting exporting a model with DTensor params/inputs (#163609)
I experimented with 3 paths to get joint graph for DTensorized module and input

1. strict_export + aot_export_joint_with_descriptors
2. graph_capture + aot_export_joint_with_descriptors
3. aot_export_joint_with_descriptors alone

Added test to guard them.

1 doesn't work, as bw graph region is missing from the joint graph.
I am leaning towards making 2 the recommended path.
If 2 doesn't work going forward, we can fallback to 3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163609
Approved by: https://github.com/tugsbayasgalan

Co-authored-by: suo <suo@fb.com>
2025-09-30 07:54:13 +00:00
5274753873 [dynamo][device_mesh] Support mesh_dim_names (#164200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164200
Approved by: https://github.com/SherlockNoMad, https://github.com/jansel
2025-09-30 07:16:28 +00:00
7afcb030d8 Back out "Revert D81959389" (#163905)
Summary:
Original commit changeset: 06888d7ebff0

Original Phabricator Diff: D82932788

Restricted the test to SM90 for scaled_grouped_mm

Test Plan: TBD (will share the linux CI results)

Differential Revision: D83283991

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163905
Approved by: https://github.com/angelayi
2025-09-30 07:05:13 +00:00
bbf6816f35 [dynamo] Special path for cloning of torch dispatch tensors (#164081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164081
Approved by: https://github.com/tugsbayasgalan, https://github.com/mlazos
2025-09-30 05:15:56 +00:00
ace89350fc better error handling for rrelu when lower or upper range is infinite (#160965)
… - issue#153281

Fixes #153281

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160965
Approved by: https://github.com/janeyx99
2025-09-30 05:01:32 +00:00
7d59e37434 Add Comm-Compute Preserving Bucketer (#163960)
tl;dr performs bucketing while preserving comm-compute overlap.

In comm-compute overlap we will have a graph with:

```
def foo(...):
     ag = all_gather(...)
     hiding_compute = mm(...)
     wait(ag)
```

There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap.

Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set.

We perform bucketing while augmenting the graph with these relationships. This can be done separably from comm-compute overlap, so long as the hiding compute relationships are passed in.

TODO:
- need to instrument fx graph so inductor respects these relationships.
- the compile time of the bucketing search can be sped up significantly by limiting what portion of the graph we traverse through
- more memory aware handling

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163960
Approved by: https://github.com/ruisizhang123, https://github.com/v0i0, https://github.com/IvanKobzarev
ghstack dependencies: #163215, #163754, #163959
2025-09-30 04:53:58 +00:00
92108f4abd Helper to augment graph with additional deps (#163959)
In comm-compute overlap we will have a graph with:

```
def foo(...):
     ag = all_gather(...)
     hiding_compute = mm(...)
     wait(ag)
```

There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap.

Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set.

This pr adds `AugmentedGraphHelper` that adds the apis, and allows querying for dependency with this augmented graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163959
Approved by: https://github.com/v0i0, https://github.com/IvanKobzarev
ghstack dependencies: #163215, #163754
2025-09-30 04:53:58 +00:00
0b2fdc30a2 refactor bucketing (#163754)
Preparatory refactory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163754
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #163215
2025-09-30 04:53:58 +00:00
0d7994ca97 [inductor] do comm compute overlap at aten fx level (#163215)
This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing.

Subsequent prs will handle:
- use of exposure analysis to do bucketing
- make sure inductor respects comm/compute overlapping done at fx level
- non-profiling mm estimation/rank broadcasting of profile results

Other mis:
- Validate accuracy of nccl estimations  ( use ruisi's profiling instead ?)

For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives.

fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3

bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215
Approved by: https://github.com/IvanKobzarev
2025-09-30 04:53:58 +00:00
c39357bab6 [torchfuzz] Make scalar and tensor distribution configurable (#164034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164034
Approved by: https://github.com/pianpwk
2025-09-30 04:50:54 +00:00
a293206bd5 Fix invalid f-strings (#164112)
Fixes invalid f-strings detected by `ruff`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164112
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-09-30 04:17:13 +00:00
9f27b0c245 [CI] Push viable/strict/${time} tags (#164183)
Every time viable strict is updated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164183
Approved by: https://github.com/seemethere
2025-09-30 04:00:22 +00:00
85012fe167 Remove unnecessary list comprehensions (#164103)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164103
Approved by: https://github.com/Lucaskabela, https://github.com/mlazos
2025-09-30 03:56:54 +00:00
ca19815e3c Revert "Enable outer reductions in fbcode (#163884)"
This reverts commit 872edd89d62f0095d3fbd8ae9204d7c8bd980460.

Reverted https://github.com/pytorch/pytorch/pull/163884 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/163884#issuecomment-3349822031))
2025-09-30 03:42:24 +00:00
0b0ed6fd33 [doc] Add AOTInductor intermediate debug printer OSS user manual (#163794)
Summary: Add a OSS user manual for AOTI intermediate debug printer so we can link it in the Pytorch conference poster.

Test Plan: N/A

Differential Revision: D83171374

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163794
Approved by: https://github.com/yushangdi
2025-09-30 03:01:03 +00:00
55840fb4bb [CMake] Fix USE_FBGEMM_GENAI option (#164165)
----

- `cmake_dependent_option` condition should be `USE_ROCM OR (USE_CUDA AND NOT MSVC)` (similar to the one for flash attention)
- Default settings should be user overridable, i.e. even if one builds for SM_10, they should be able to pass `USE_FBGEMM_GENAI=0` and skip the build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164165
Approved by: https://github.com/Skylion007
2025-09-30 02:38:03 +00:00
b7419b920d [ROCm][CI] Upgrade ROCm to 7.0 (#163140)
Upgrade all the ROCm docker image to ROCm 7.0 release version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163140
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-30 02:23:26 +00:00
3b4ad4a17d [AARCH64][CD][CUDA13][Triton][PTXAS] Turn on BUILD_BUNDLE_PTXAS=1 (#163988)
See also #163972, which was intended to be this PR.

Triton (release/3.5.x) by default ships CUDA12.8 ptxas.
This PR tries to bundle a ptxas version for cuda13, so that it can help https://github.com/pytorch/pytorch/issues/163801 when users run on new devices like THOR and Spark.

Fixes https://github.com/pytorch/pytorch/issues/163801

Test Plan:

Check binary size increase against nightly or v2.9RC
Install the binary from into a working THOR and GB200/GH100 machine (reproduce the original issue first on THOR), then install the binary built from this PR and we expect the issue to be gone without any additional user setting. Testing on GB200 is to ensure no regression.
Reference: https://github.com/pytorch/pytorch/pull/119750 and 5c814e2527

Note: with this PR, the pytorch world's torch.compile is supposed to find ptxas via "torch/_inductor/runtime/compile_tasks.py" and "_set_triton_ptxas_path". Use cases that do not go through "_set_triton_ptxas_path" may not be able to use the cuda13 ptxas binary.
However, as is, the triton world does not know the existence of this new cuda13 ptxas. So IF a users thinks there is already pytorch/bin/ptxas and delete the ptxas from triton, then  c6ad34f7eb/python/triton/knobs.py (L216) would still complain ptxas not found (if removed - it won't know this new one available)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163988
Approved by: https://github.com/atalman
2025-09-30 01:56:12 +00:00
4cf2900474 CUDACachingHostAllocatorImpl skip event query during capture (#164001)
The CUDACachingAllocator already does this, so there is precedent.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164001
Approved by: https://github.com/eqy
2025-09-30 01:19:53 +00:00
474d07554a [dynamic shapes] unbacked-safe slicing (#161414)
Summary:
Generates new unbacked symbols for slice output size & storage offset, when appropriate semantics are unclear. Teaches inductor to codegen the slice with flexible semantics.

Test Plan:
contbuild & OSS CI, see 56218d85e2

Rollback Plan:

Differential Revision: D80948073

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161414
Approved by: https://github.com/laithsakka
2025-09-30 01:15:19 +00:00
089f9130ed Install fmtlib headers. (#164139)
`fmtlib` version was updated to 12.0.0 in #163441.

In this new version, due to https://github.com/fmtlib/fmt/pull/4536, PyTorch started not installing `fmtlib` headers anymore. Because of that, PyTorch/XLA build CI started to fail https://github.com/pytorch/xla/issues/9653. While we did fix it internally https://github.com/pytorch/xla/pull/9650, I believe that PyTorch should continue installing the `fmtlib` headers, since it is a dependency of its C API [`python_arg_parser.h`][1].

PyTorch/XLA CI was moved to `unstable.yml` in #159272, and later removed in #163564. This PyTorch/XLA build failure went under the radar, since the `fmtlib` update only landed on September 22.

[1]: 84d673ef57/torch/csrc/utils/python_arg_parser.h (L42)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164139
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-09-30 01:10:13 +00:00
da003d7b95 [3/N] Import Callable from collections.abc in torch/distributed (#164104)
This is the result of applying the ruff `UP035` check.
`Callable` is imported from `collections.abc` instead of `typing`.
This PR is the follow-up of #164054.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164104
Approved by: https://github.com/Skylion007
2025-09-30 00:28:53 +00:00
cee4e36f9a [BE] remove manylinuxcxx11-abi-builder:cpu-cxx11-abi docker image (#164187)
I believe this image is not used anywhere anymore.

Test:
```
git grep manylinuxcxx11-abi-builder
git grep manylinuxcxx11
```
Return no results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164187
Approved by: https://github.com/izaitsevfb, https://github.com/malfet, https://github.com/seemethere
2025-09-30 00:26:20 +00:00
704cd771f6 [PP] Customize pipeline's submod name (#164037)
Changing PP submodules' name from `submod_i` to `submod_pp_i` to distinguish from the submodule created by HOP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164037
Approved by: https://github.com/H-Huang
ghstack dependencies: #164045, #164035
2025-09-29 23:29:52 +00:00
d58f7c3ad1 [Easy] Add pointwise tag to fma (#164149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164149
Approved by: https://github.com/fmassa
2025-09-29 22:40:04 +00:00
170e0309ca Bump protobuf from 5.29.4 to 5.29.5 in /.ci/docker (#156157)
* Bump protobuf from 5.29.4 to 5.29.5 in /.ci/docker

Bumps [protobuf](https://github.com/protocolbuffers/protobuf) from 5.29.4 to 5.29.5.
- [Release notes](https://github.com/protocolbuffers/protobuf/releases)
- [Changelog](https://github.com/protocolbuffers/protobuf/blob/main/protobuf_release.bzl)
- [Commits](https://github.com/protocolbuffers/protobuf/compare/v5.29.4...v5.29.5)

---
updated-dependencies:
- dependency-name: protobuf
  dependency-version: 5.29.5
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update .ci/docker/requirements-ci.txt

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-29 15:20:44 -07:00
0f619c1f89 Revert "[inductor] do comm compute overlap at aten fx level (#163215)"
This reverts commit c9b5af9a384e7ef5f95613abe1622f5f55133c3a.

Reverted https://github.com/pytorch/pytorch/pull/163215 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see c9b5af9a38 (51526707590-box) ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))
2025-09-29 21:53:42 +00:00
b28e4f1f87 Revert "refactor bucketing (#163754)"
This reverts commit e1bd5b60cf243d3a026a6c89733488a6d9d4b33d.

Reverted https://github.com/pytorch/pytorch/pull/163754 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see c9b5af9a38 (51526707590-box) ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))
2025-09-29 21:53:42 +00:00
84dc54ae5e Revert "Helper to augment graph with additional deps (#163959)"
This reverts commit b5d4d350f573db12b8181ee13f9386d6ef8a1e57.

Reverted https://github.com/pytorch/pytorch/pull/163959 on behalf of https://github.com/yangw-dev due to seems fails inductor/test_aten_comm_compute_reordering for macos test, see c9b5af9a38 (51526707590-box) ([comment](https://github.com/pytorch/pytorch/pull/163215#issuecomment-3349177940))
2025-09-29 21:53:42 +00:00
50d418f69f Replace setup.py bdist_wheel with python -m build --wheel (#156712)
Previously we already replaced most use of `python setup.py develop/install`.

This PR also replaces the use of `setup.py bdist_wheel` with the modern `python -m build --wheel` alternative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156712
Approved by: https://github.com/atalman
ghstack dependencies: #156711
2025-09-29 21:51:32 +00:00
c332d58184 [testing] upload test stats: Add info to the invoking file summary and some other changes (#164016)
* Changes some internal logic for grouping so hopefully it's slightly less annoying write code for
* Changes the invoking file summary to just use file, which I think is correct most of the time
* Adds some fields to the file summary, like skips, errors, etc so I can reuse it for file report regression things

Output should be the same, maybe with slightly more fields since I got rid of some of the pops

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164016
Approved by: https://github.com/huydhn
2025-09-29 21:20:18 +00:00
efd7fd5ed5 Consistently use c10_ovrsource in arvr mode everywhere (#164128)
Summary:
Previously, many arvr targets transitively depended on c10, not c10_ovrsource,
because they either explicitly depended on c10 (because they didn't know
better) or they depended on legacy Caffe2, which never got the ovrsource
treatment.  So we found all these spots (driven by D82283623) and forced them
to query arvr mode to figure out which one they should use.  The goal is you
NEVER have both targets in the same build rule at the same time.

This diff could be reverted if D82224960 works out but I haven't gotten it to work yet.

Test Plan: sandcastle

Reviewed By: EscapeZero

Differential Revision: D82390436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164128
Approved by: https://github.com/albanD, https://github.com/malfet
2025-09-29 20:47:20 +00:00
b5d4d350f5 Helper to augment graph with additional deps (#163959)
In comm-compute overlap we will have a graph with:

```
def foo(...):
     ag = all_gather(...)
     hiding_compute = mm(...)
     wait(ag)
```

There is no explicit dependency between the hiding compute and the collectives, but we want to add implicit dependencies from wait->hiding_compute, and from hiding_compute->all_gather to preserve overlap.

Additionally, while bucketing, we will merge collective starts and collective waits together. In this case, we will want to treat the two nodes as a single subgraph - each node in the merged set will have the union of all deps in the set.

This pr adds `AugmentedGraphHelper` that adds the apis, and allows querying for dependency with this augmented graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163959
Approved by: https://github.com/v0i0, https://github.com/IvanKobzarev
ghstack dependencies: #163215, #163754
2025-09-29 20:43:12 +00:00
6db1b9dd21 [MPS] Chunk fillBuffer into 4Gb slices (#164108)
To avoid regression on MacOS 26, which one could observe by running the following script
```swift
import Metal

let bufferSize = 1<<32 + 4

guard let device = MTLCreateSystemDefaultDevice() else { fatalError("No Metal device found") }
guard let buffer = device.makeBuffer(length: bufferSize, options: .storageModeShared) else { fatalError("Failed to create buffer") }

guard let cmdQueue = device.makeCommandQueue() else { fatalError("Failed to create command queue") }
guard let cmdBuffer = cmdQueue.makeCommandBuffer() else { fatalError("Failed to create command buffer") }
guard let blitEncoder = cmdBuffer.makeBlitCommandEncoder() else { fatalError("Failed to create blit encoder") }

blitEncoder.fill(buffer: buffer, range: 0..<bufferSize, value: 0x42)
blitEncoder.endEncoding()

cmdBuffer.commit()
cmdBuffer.waitUntilCompleted()

let tailOffs = 8
let hostPtr = buffer.contents().bindMemory(to: UInt8.self, capacity: bufferSize)
let tail = Array(UnsafeBufferPointer(start: hostPtr + (bufferSize - tailOffs), count: tailOffs))

for (idx, val) in tail.enumerated() {
    print("Offs 0x\(String(bufferSize - tailOffs + idx, radix: 16)): 0x\(String(val, radix: 16))")
}
```

Test plan: run `test_indexing.py` on MacOS-26

Fixes https://github.com/pytorch/pytorch/issues/161265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164108
Approved by: https://github.com/Skylion007
2025-09-29 20:19:29 +00:00
9e792f583a Revert "[export] Skip the check instead of disable (#164084)"
This reverts commit c2768d0f5af840a94c342ed9eac3e26c819aa3f0.

Reverted https://github.com/pytorch/pytorch/pull/164084 on behalf of https://github.com/yangw-dev due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/164084#issuecomment-3348862668))
2025-09-29 20:09:13 +00:00
6650f5af74 Revert "[dynamo] Special path for cloning of torch dispatch tensors (#164081)"
This reverts commit 811c693c49f7cd3da2ea174955d12f2f8780bd46.

Reverted https://github.com/pytorch/pytorch/pull/164081 on behalf of https://github.com/yangw-dev due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/164084#issuecomment-3348862668))
2025-09-29 20:09:13 +00:00
349c960970 Use linux.g4dn.4xlarge.nvidia.gpu for cuda 12.4 legacy driver tests (#163956)
Workaround for https://github.com/pytorch/pytorch/issues/163658

Looks like the workflow passes on 12.8 build that use inux.g4dn.4xlarge.nvidia.gpu but its failing on 12.6 builds that use linux.4xlarge.nvidia.gpu: https://github.com/pytorch/pytorch/actions/runs/17953843505/job/51080623612#step:13:470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163956
Approved by: https://github.com/malfet

Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
2025-09-29 19:38:17 +00:00
f090818a40 Rename remaining periodic and xpu workflows py3.9->py3.10 (#164127)
Fix naming py3.9 should be py 3.10
These jobs where already migrated to 3.10
Please see: https://github.com/pytorch/pytorch/actions/runs/18091356163/job/51472526131#step:16:224

```
Python version:
+ python --version
Python 3.10.18
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164127
Approved by: https://github.com/malfet
2025-09-29 19:26:21 +00:00
e1bd5b60cf refactor bucketing (#163754)
Preparatory refactory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163754
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #163215
2025-09-29 18:32:41 +00:00
c9b5af9a38 [inductor] do comm compute overlap at aten fx level (#163215)
This is first part of the stack that does comm/compute reordering, and then uses the exposure analysis to do bucketing.

Subsequent prs will handle:
- use of exposure analysis to do bucketing
- make sure inductor respects comm/compute overlapping done at fx level
- non-profiling mm estimation/rank broadcasting of profile results

Other mis:
- Validate accuracy of nccl estimations  ( use ruisi's profiling instead ?)

For a llama 2d parallelism test, on forward, we overlap all but 2 of potentially hidden collectives. For backward, we overlap 217/269 of potentially hidden collectives. If you increase `compute_overlap_multipler` (for fudge factor of inaccurate comms estimation), that goes down to all but 16 of potentially hidden collectives.

fwd example: https://gist.github.com/eellison/76209c49d8829c5f1e323d34a3f040c3

bwd example: https://gist.github.com/eellison/6cfc2285df53a94cfa4012f5fdae5c51

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163215
Approved by: https://github.com/IvanKobzarev
2025-09-29 18:18:03 +00:00
604da4bb9a [Inductor-FX] Support unbacked symbol definitions (#163729)
# Problem
Inductor sometimes generates unbacked symints to handle things like mismatched branches of `torch.cond`. This code is represented by `pytree.KeyPath`, with special codegen logic to convert it to Python and C++. This was not previously supported by the FX backend.

# Feature
This PR adds support for unbacked symbol declarations to the FX backend. The implementation is fairly straightforward.
1. Instead of raw Python/C++, update the wrapper codegen method to emit a new Wrapper IR line called `UnbackedSymbolDefsLine`. This contains all the information needed to  generate the Python and C++ code.
2. Move the existing Python/C++ codegen to a private method, which is invoked by `UnbackedSymbolDefsLine.codegen()`.
3. Implement a method to generate FX IR from unbacked symbol definitions. The implementation is based on recursive descent, consuming some keypath entries, emitting an FX IR node, and recursing to the rest of the keypath. It is conceptually identical to the existing algorithm for Python and C++, except it generates FX nodes.
4. The FX backend currently relies on size hints to generate autotuning arguments, and consequently autotuning does not support unbacked SymInts. At some point, we would like to generalize the autotuning logic to support these. But for now, simply emit a warning and skip autotuning when we see them.
5. The new test case exposed some tricky issues reconciling Triton call args with constants stored in `triton_meta`. This PR rewrites the relevant helper function to do this in a more principled way.

# Test plan
This PR imports an existing control flow test to the FX backend's test suite. The test uses unbacked symbol definitions to handle mismatched dynamic shapes coming from `torch.cond` branches.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163729
Approved by: https://github.com/jansel
2025-09-29 18:10:37 +00:00
8f32adc90a [MPSHooks] Release pending command encoder (#164093)
Before returning a comand buffer, as subsequent calle are very likely to allocate their own encoder, which results in the following runtime error
```
 tryCoalescingPreviousComputeCommandEncoderWithConfig:nextEncoderClass:]:1090: failed assertion `A command encoder is already encoding to this command buffer'
```

Added regression test to `test_mps_extension`

Please note, that `torch::mps::get_command_buffer()` should be called with dispatch_queue held, both before and after this change, but many implementations skip that

Fixes https://github.com/pytorch/pytorch/issues/163721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164093
Approved by: https://github.com/atalman, https://github.com/Skylion007
2025-09-29 17:50:12 +00:00
3fa3bfbfda [EZ][BE] Fix unused parameter warnings in EmbeddingBag (#164135)
Before this change following were emitted during compilation
```
[7/31] Compiling /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal to EmbeddingBag_31.air
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:28:12: warning: unused parameter 'is_first' [-Wunused-parameter]
      bool is_first) {
           ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:47:16: warning: unused parameter 'per_sample_weights_index' [-Wunused-parameter]
      uint32_t per_sample_weights_index,
               ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:48:19: warning: unused parameter 'per_sample_weights' [-Wunused-parameter]
      constant T* per_sample_weights,
                  ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:49:16: warning: unused parameter 'per_sample_weights_stride' [-Wunused-parameter]
      uint32_t per_sample_weights_stride) {
               ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:74:19: warning: unused parameter 'weight_val' [-Wunused-parameter]
      opmath_t<T> weight_val,
                  ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:75:19: warning: unused parameter 'out_val' [-Wunused-parameter]
      opmath_t<T> out_val,
                  ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:76:12: warning: unused parameter 'is_first' [-Wunused-parameter]
      bool is_first,
           ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:77:17: warning: unused parameter 'max_idx' [-Wunused-parameter]
      thread I& max_idx,
                ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:78:9: warning: unused parameter 'weight_idx' [-Wunused-parameter]
      I weight_idx,
        ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:79:12: warning: unused parameter 'pad' [-Wunused-parameter]
      bool pad) {}
           ^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164135
Approved by: https://github.com/Skylion007
2025-09-29 17:44:09 +00:00
8701f18bc0 Adjust ...mark_unbacked() -> ...decorators.mark_unbacked() in logs. (#164131)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164131
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-29 17:44:00 +00:00
a56e7a1920 [Max Autotune][B200] Add addmm config to avoid test OOM (#164020)
Summary: Add a new `addmm` config that is small enough to not cause an OOM (out of memory error), since the configs for `blackwell_persistent_mm_configs`, which `addmm` used, are too large.

Test Plan: `test_max_autotune.py`

Differential Revision: D83378477

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164020
Approved by: https://github.com/coconutruben, https://github.com/njriasan
2025-09-29 17:38:46 +00:00
e2c894c97d [Inductor][ATen][FP8] Relax stride check for block-wise scaling when scaling dimension is 1 (#163829)
Summary: Relax stride check for block-wise scaling (1x128, 128x128) when a dimension of the scaling factor is 1. When the scaling tensor has a dimension of size 1, the stride is effectively "meaningless" to PyTorch, i.e. PyTorch decides to replace its stride with a default of `[1, 1]`. However, the old stride check required the stride to match one of the scaling dimensions. Here, we relax the stride check when the effective stride is 1 in order to allow for cases in which `K <= 128` and `N <= 128`.

Test Plan:
```
pytest -s -v test/test_matmul_cuda.py::TestFP8MatmulCUDA::test_scaled_mm_vs_emulated_block_wise_float32_lhs_block_1_rhs_block_128_cuda   2>&1 | tee ~/personal/stride_check.log
```

Differential Revision: D83023706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163829
Approved by: https://github.com/lw, https://github.com/eqy
2025-09-29 17:28:26 +00:00
6b473c90cf Revert "[inductor] require shape in TritonCSEVariable (#162275)"
This reverts commit c257570e6cd25753f9f0a640b965148ead2cf918.

Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/jeffdaily due to sorry this broke rocm CI; inductor/test_select_algorithm.py::TestTemplateRender::test_finalized_subclass_hooks [GH job link](https://github.com/pytorch/pytorch/actions/runs/18048893250/job/51366715091) [HUD commit link](c257570e6c) ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3348159095))
2025-09-29 17:26:54 +00:00
6bcc6bbc85 [Inductor][FP8] Add op_name for ScaledMM TMA template heuristic (#164019)
Summary: For H100s and below, add `op_name="scaled_mm"` to the template heuristic for `CUDAScaledTMATemplateConfigHeuristic` such that `scaled_mm` persistent + TMA tests do not default to the "mm" heuristics.

Test Plan: `test_max_autotune.py`

Differential Revision: D83390775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164019
Approved by: https://github.com/njriasan
2025-09-29 17:24:26 +00:00
95be302889 Skip test_conv3d_cudnn_broken on ROCM (#164138)
Followup after https://github.com/pytorch/pytorch/pull/163903  Fixes https://github.com/pytorch/pytorch/issues/164137

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164138
Approved by: https://github.com/Camyll
2025-09-29 16:56:51 +00:00
f433e681b9 Remove export of slice_in_dim (#164117)
Cannot find `slice_in_dim` in OSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164117
Approved by: https://github.com/soulitzer
2025-09-29 16:56:14 +00:00
5ff2387dbe Fix comment on broadcasting example to clarify dimension mismatch (#162177)
Fixes #162116

Updated the comment in the broadcasting example to clarify that tensors with mismatched dimension sizes (0 vs 2) are not broadcastable. Removed incorrect reference to missing dimensions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162177
Approved by: https://github.com/soulitzer
2025-09-29 16:47:48 +00:00
84b57c93db [MPSInductor] Unskip test_repeat_interleave_Tensor_decomp (#164136)
Not sure what was the problem, but it passes for me locally

Fixes https://github.com/pytorch/pytorch/issues/159408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164136
Approved by: https://github.com/v0i0
2025-09-29 16:20:34 +00:00
069ccf5f1e [inductor] pdl: enable launch and deduplicate waits (#162014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162014
Approved by: https://github.com/eellison
2025-09-29 16:10:26 +00:00
1c12d7416b [SDPA] [MPS] Fixes regression in 2.8.0 for scaled_dot_product_attention using mps (#163598)
Fixes #163597

- Updates fast SDPA implementations to take in query tensor stride info similar to key and value instead of assuming stride.
- Updated tests with additional transpose/permutation layouts. New tests catch the regression.

### Benchmarking with script found in [implementation PR](https://github.com/pytorch/pytorch/pull/152781#:~:text=19.8%25%20speed%20improvement-,Script%20to%20get%20perf%3A,-import%20torch%0Aimport)

Times are averaged over 100000 iterations. This change should not have any significant performance difference. Tested on an M3 Pro

### Vector Fast Path (q_len=1, k_len=256)

- Before: 0.160 ms
- After: 0.157 ms

### Vector 2-pass (q_len=1, k_len=4096)

- Before: 0.342 ms
- After: 0.339 ms

### Vector Fast Path (q_len=8, k_len=256)

- Before: 0.228 ms
- After: 0.231 ms

### Vector 2-pass (q_len=8, k_len=4096)

- Before: 0.432 ms
- After:  0.436 ms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163598
Approved by: https://github.com/malfet
2025-09-29 16:09:46 +00:00
3746039b47 [inductor] fix: 'get_raw_stream' undefined (#163707)
Summary:
ran into this when precompiling baidu/ERNIE-4.5-21B-A3B-PT

codegen after fix:
```py
import triton
import triton.language as tl
from torch._inductor.runtime.triton_heuristics import start_graph, end_graph
from torch._C import _cuda_getCurrentRawStream as get_raw_stream
with torch.cuda._DeviceGuard(0):
    stream0 = get_raw_stream(0)
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163707
Approved by: https://github.com/jamesjwu
2025-09-29 15:48:16 +00:00
872edd89d6 Enable outer reductions in fbcode (#163884)
Summary: Enabling the outer reduction optimization in fbcode

Test Plan: Evals in https://docs.google.com/document/d/1-tcItRsyEaibaXL56Zq2-CWh5wCmHXDDgDQT_9uOvXE/edit?tab=t.0#bookmark=id.tkgzaitxacg0

Differential Revision: D81948542

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163884
Approved by: https://github.com/Skylion007
2025-09-29 15:25:17 +00:00
47ed41109f Fix PgNccl coalseced profiling (#160680)
Admittedly I'm a noob when looking at traces, but this looked pretty off to me:
<img width="1528" height="824" alt="Screenshot 2025-08-14 at 5 27 49 PM" src="https://github.com/user-attachments/assets/871e7b4c-0e47-4c84-97cc-8198b7b76d4b" />
1. Why are there so many "nccl:coalesced" on the CPU thread
2. Why is there "nccl:coalesced" on compute stream (stream 7)

Here is what is happening:

**CPU side**: In `endCoalescing`, we create a [work object ](3be70dc30e/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp (L3473)) with the profiling title "nccl:coalesced"
**GPU side**: The CUDA kernels will inherit this profiling title

What is missing:

We forgot to call the record function [callback](3be70dc30e/torch/csrc/distributed/c10d/Work.cpp (L35-L38)). With this change we finishs immediately on the CPU side, but the ncclDevKernel_SendRecv still have the coalesced title. New trace looks like this:

<img width="1123" height="637" alt="image" src="https://github.com/user-attachments/assets/f015fd64-85cd-452a-be24-3e7724f84e44" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160680
Approved by: https://github.com/fegin, https://github.com/kwen2501
2025-09-29 15:21:55 +00:00
fa54b08cd5 Replace setup.py install with pip install (#156711)
#156027 already replaced most use of `python setup.py install`.
This PR only adds a few more occurrences and adds `--no-build-isolation` in a few places.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156711
Approved by: https://github.com/atalman
2025-09-29 15:15:10 +00:00
92284fb2ff Add SVE128 ISA (#158932)
Summary: Partly Importing and adapting https://github.com/pytorch/pytorch/pull/138388, adding SVE128 as ISA.

Intention is to add SVE128 translation layers for Vectorized data types.
Idea is to have 1 PR per file, aside from the current one, plus a last one modifying cmake files to enable the new ISA selectively.

Tested current changes on a nightly run, to verify no regressions occur on systems leveraging SVE256.

No regressions spotted when running test_ops.py, a set of 34k unit tests. A machine leveraging SVE128 was used towards this testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158932
Approved by: https://github.com/malfet
2025-09-29 14:49:19 +00:00
84d673ef57 Add less warps config to inner reductions (#162447)
Add less warps to ensure proper vectorization + memory coalescing for inner reductions, prefer more work per thread

<img width="1717" height="731" alt="Screenshot 2025-09-17 at 10 03 25 AM" src="https://github.com/user-attachments/assets/7b1f4a30-62f2-4bee-bb9c-122501bde63e" />

Differential Revision: [D83343892](https://our.internmc.facebook.com/intern/diff/D83343892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162447
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
2025-09-29 13:48:36 +00:00
d633bac252 Update issue templates adding a DISABLE AUTOREVERT option (#163858)
This should be used to disable autorevert functionality if users feels the need to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163858
Approved by: https://github.com/izaitsevfb
2025-09-29 13:10:05 +00:00
d81476e211 [xla hash update] update the pinned xla hash (#163494)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163494
Approved by: https://github.com/pytorchbot
2025-09-29 12:31:16 +00:00
a0ae2f9aa0 Update slow tests (#163493)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163493
Approved by: https://github.com/pytorchbot
2025-09-29 11:58:17 +00:00
615da7b95e [fx] Allow customization of submod name in split graph (#164035)
Fixes #164030: HOP and pipelining both name things submod_i
by adding an optional argument `partition_affix` to `split_module` API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164035
Approved by: https://github.com/ezyang
ghstack dependencies: #164045
2025-09-29 09:16:36 +00:00
4fd70d4e7b [1/N]Enable some tests in test_ops.TestCommon on Intel GPU (#159944)
For https://github.com/pytorch/pytorch/issues/114850, we will port aten unit tests to Intel GPU. This PR will work on some test case of test/test_ops.py. We could enable Intel GPU with following methods and try the best to keep the original code styles:

1. Extended XPUTestBase.get_all_devices to support multiple devices
2. Added skipXPU decorator
3. Extended onlyOn to support device list
4. Enabled 'xpu' for some test pathes
5. Added allow_xpu=True for supported test class.
6. Replaced onlyCUDA with onlyOn(['cuda', 'xpu']) for supported tests
7. Use skipIfXpu and skipXPU to disable unsupported test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159944
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/albanD
2025-09-29 09:08:04 +00:00
e1e5e040cd [dynamo][export] Add some missing trace rules (#164080)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164080
Approved by: https://github.com/tugsbayasgalan
2025-09-29 08:47:24 +00:00
5ddad22196 [PP] Use default export mode (non-strict) (#164045)
export's default mode has switched from strict to non-strict. We just follow suit in PP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164045
Approved by: https://github.com/H-Huang
2025-09-29 06:31:06 +00:00
90512fa5bd [Quant] extend the op list for quant lift up (#163621)
Add `aten.reshape.default` into the op list of quant lift up, in order to fuse more potential quantized kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163621
Approved by: https://github.com/mingfeima, https://github.com/Xia-Weiwen, https://github.com/jansel
2025-09-29 06:14:45 +00:00
48a5470cf8 [CUDA] fix indexing on large tensor causing nvalid configuration argument (#164049)
Fixes #164048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164049
Approved by: https://github.com/eqy
2025-09-29 06:07:35 +00:00
b9854c9d89 [Inductor][CPP] Fix the test case of test_linear_reuse_kernels (#163723)
Fixes #163491.
Add tolerances to make `test_linear_reuse_kernels` more stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163723
Approved by: https://github.com/leslie-fang-intel
2025-09-29 05:29:01 +00:00
eb4361a801 [Fix] Adding missing f prefixes to formatted strings [1/N] (#164065)
As stated in the title.

* #164068
* #164067
* #164066
* __->__ #164065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164065
Approved by: https://github.com/Skylion007
2025-09-29 04:53:00 +00:00
d131f213ac [vllm hash update] update the pinned vllm hash (#164092)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164092
Approved by: https://github.com/pytorchbot
2025-09-29 04:41:06 +00:00
7c7ae86991 [Fix] Adding missing f prefixes to formatted strings [2/N] (#164066)
As stated in the title.

* #164068
* #164067
* __->__ #164066
* #164065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164066
Approved by: https://github.com/Skylion007
2025-09-29 04:40:44 +00:00
ad32ed83b3 [Fix] Adding missing f prefixes to formatted strings [3/N] (#164067)
As stated in the title.

* #164068
* __->__ #164067
* #164066
* #164065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164067
Approved by: https://github.com/Skylion007
2025-09-29 04:35:23 +00:00
d8becd1cf4 [dynamo][export] Make the source_stack and fqn info same between dynamo and export (#164085)
preparing for landing the install_free_tensors flag

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164085
Approved by: https://github.com/tugsbayasgalan
2025-09-29 04:35:13 +00:00
e64dd8c694 [Fix] Adding missing f prefixes to formatted strings [4/N] (#164068)
As stated in the title.

* __->__ #164068
* #164067
* #164066
* #164065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164068
Approved by: https://github.com/Skylion007
2025-09-29 04:07:07 +00:00
047ae24e34 Eliminate setup.py install/develop in the codebose (#162329)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162329
Approved by: https://github.com/ezyang
2025-09-29 03:54:28 +00:00
3cda34ebde [2/N] Apply ruff UP035 check in torch files (#164054)
This is the result of applying the ruff `UP035` check.
`Callable` is imported from `collections.abc` instead of `typing`.
`TypeAlias` is also imported from `typing`.
This PR is the follow-up of #163947.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164054
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-09-29 03:35:32 +00:00
352197c508 Remove old ROCm skip conditions in tests (#164058)
This PR removes skip conditions for ROCM <= 3.5.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164058
Approved by: https://github.com/kwen2501
2025-09-29 03:00:58 +00:00
811c693c49 [dynamo] Special path for cloning of torch dispatch tensors (#164081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164081
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #164084
2025-09-29 01:44:44 +00:00
c2768d0f5a [export] Skip the check instead of disable (#164084)
Its unclear why we had disable in the first place. With
install_free_tensors, we are tracing into this hook. A better way would
be to place the tracer without any hook. For now, disable the checking
while dynamo is tracing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164084
Approved by: https://github.com/tugsbayasgalan
2025-09-29 01:44:44 +00:00
a8c528c105 [1/N] Apply UP035 rule in tests (#163947)
Apply UP035 `ruff` rule in tests, but some tests for `fx` and `dynamo` are excluded in case the old typing is the test target.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163947
Approved by: https://github.com/ezyang
2025-09-29 01:42:01 +00:00
dc54ce7554 [hops] Support unspecialized nn module for export hops (#164082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164082
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #164079
2025-09-29 01:34:10 +00:00
1981ed4f60 [dynamo][logging] Add to param_count only if metrics_count is active (#164079)
This is rare but happens with executorch tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164079
Approved by: https://github.com/tugsbayasgalan
2025-09-29 00:59:18 +00:00
54b38f3b46 Add operator benchmarking run to CI nightly (#162530)
This PR introduces a new "operator microbenchmark" CI workflow and GitHub Actions for operator microbenchmarks, updating test scripts and job matrices to support new parameters, and broadening the operator benchmark tests to include more data types, larger shapes, and gradient tests. The benchmark configurations now focus more on different cuda hardware and multiple dtypes (bf16, fp16, fp32), for both compile and eager mode.

**Benchmark Configuration and Coverage:**

* Expanded operator benchmark configurations in `addmm_test.py`, `bmm_test.py`, `matmul_test.py`, and `mm_test.py` to benchmark multiple dtypes on CUDA devices, in eager and compile mode, for forward and backward run. The configs with tag "long" for the above mentioned files are being run in CI.
* The CI benchmarking is running on various hardwares: H100, A100.
* The CI job also uploads the microbenchmarking outputs to a [HUD](https://hud.pytorch.org/benchmark/llms?repoName=pytorch%2Fpytorch&benchmarkName=PyTorch+operator+microbenchmark) dashboard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162530
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-29 00:46:38 +00:00
bc5a072ebf fixes import error 'functionalize' from functorch (#163746)
Fixes #163637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163746
Approved by: https://github.com/malfet
2025-09-28 23:16:45 +00:00
d1b3481131 registraion replaced with registration in jit_type.h file comment (#164072)
Fixes #164071

typo correction done
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164072
Approved by: https://github.com/Skylion007
2025-09-28 22:55:24 +00:00
3766513d25 Remove C++ workarounds for Python < 3.10 (#164055)
Remove two unnecessary `PY_VERSION_HEX` branches.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164055
Approved by: https://github.com/ezyang
2025-09-28 20:00:02 +00:00
ea6846b231 [CI] Remove the unnecessary workflow related functorch (#162581)
The [docs](https://docs.pytorch.org/functorch/stable/) about `functorch` has been migrated into [PyTorch Doc](https://docs.pytorch.org/docs/stable/func.html) since PyTorch 2.0, so I think we can remove it right now to reduce the compute resources usages.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162581
Approved by: https://github.com/ezyang
2025-09-28 19:56:20 +00:00
f6537d9616 Move control flow export tests to new tracer (#163259)
Differential Revision: [D82732614](https://our.internmc.facebook.com/intern/diff/D82732614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163259
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #163136, #163137, #163258
2025-09-28 19:56:09 +00:00
cc0332563e Use new_tracer_experimental for torchao strict export (#163258)
Export team is fixing up the old strict export implementation, as a result it fails a check where we proxy the whole module under given directories. _WrapperModule is a way for torchao to workaround the issue where export requiring nn.module to trace so it should never get proxied in the graph.

Differential Revision: [D82732613](https://our.internmc.facebook.com/intern/diff/D82732613)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163258
Approved by: https://github.com/anijain2305
ghstack dependencies: #163136, #163137
2025-09-28 19:55:54 +00:00
8239ba4087 Fix various bugs in subclass input in export (#163770)
This adds basic support for subclass inputs in export (specifically for non-strict). I had to make fakify little more complicated which risks further divergence from dynamo fakification. But dynamo one is so complex, so i feel it is better to do this way. Also improved fake mode detection logic to recursively look into subclass inner tensors.

Differential Revision: [D83156489](https://our.internmc.facebook.com/intern/diff/D83156489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163770
Approved by: https://github.com/avikchaudhuri
2025-09-28 18:03:32 +00:00
1fdd99de71 Building guards should be under metrics_context (#163967)
Differential Revision: [D83354042](https://our.internmc.facebook.com/intern/diff/D83354042)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163967
Approved by: https://github.com/avikchaudhuri
2025-09-28 16:28:34 +00:00
38ed608956 Better error handling in torch/nativert/* (#163308)
Replace the **runtime_error** of the vallina C++ exceptions with **TORCH_CEHCK** in **torch/nativert/***

The vallina C++ exception should not exist in the core part of pytorch for its corss-languanges trait. Comparing with the vallina C++ exceptions, TORCH_CHECK have the richer error context and It has the unified error handling mechanism. This commit replace the runtime_error with TORCH_CHECK of the files in
torch/nativert/*   .

Fixes part of #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163308
Approved by: https://github.com/dolpm
2025-09-28 14:23:44 +00:00
238dc65368 [ROCm] use hipSolver instead of MAGMA for Cholesky (#163977)
Currently, the Cholesky factorization and least squares operation defaults to magma when Pytorch is compiled for ROCm. This shows suboptimal performance.
This change allows PyTorch to rely on hipSolver instead of Magma.
@jeffdaily

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163977
Approved by: https://github.com/Skylion007
2025-09-28 06:53:06 +00:00
7bbde0c094 Remove unused argument from DEFINE_BINARY macro. (#163868)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163868
Approved by: https://github.com/Skylion007
ghstack dependencies: #163822
2025-09-28 06:32:41 +00:00
dfcab0e7e1 Handle DDE in infer_size_impl (#163822)
hit this while running VLLM with unbacked for model Qwen/Qwen2-1.5B-Instruct

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163822
Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007
2025-09-28 06:32:41 +00:00
1cc9263f52 [vllm hash update] update the pinned vllm hash (#164053)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164053
Approved by: https://github.com/pytorchbot
2025-09-28 04:35:17 +00:00
c2862c8e66 [distributed] Remove python code older than 3.10 (#163613)
Because now that the minimum Python version is 3.10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163613
Approved by: https://github.com/XuehaiPan, https://github.com/kwen2501
2025-09-28 04:15:24 +00:00
b377c9e365 graph break on tolist if capture_scalar_outputs is false (#163807)
address https://github.com/pytorch/pytorch/issues/163798

its problematic to not graph break because:
1. break current contract.
2. well dynamo trace then we have .item call then if we ever re-trace later in autograd for example we hit a
 failure (We do not know where to graph break at that point)! see the added unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163807
Approved by: https://github.com/bobrenjc93
2025-09-28 04:02:52 +00:00
3059b08012 [inductor] add subsystem to pattern matcher (#163922)
Summary:
Running a toy example through `torch.compile(fullgraph=True, backend="inductor")` with default inductor config, I tried to see what passes are run in each of pre-grad, joint-graph, and post-grad phases by printing out the subsystem in `GraphTransformObserver`. However the subsystem showed up as None in a bunch of transforms that were run in each of those phases, so this PR adds some additional annotations.

Note that these annotations are probably not a complete set, since other transforms may run based on changes to the config that are not covered here.

Hopefully this doesn't change behavior. However, I did notice that bisecting relies on disabling various phases, which means that while before some passes would *not* be disabled (because their subsystem was `None`), now they would.

Test Plan: existing tests + manual test described in summary

Differential Revision: D83306676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163922
Approved by: https://github.com/jansel
2025-09-28 03:15:23 +00:00
5504a06e01 [BE]: Update NCCL to 2.28.3 (#162351)
@eqy New NCCL has some a bunch of bugfixes for features including reducing the number SMs needed by NVLINK collectives as well as some very useful new APIs for SymmetricMemory.  Also allows FP8 support for non-reductive operations on pre-sm90 devices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162351
Approved by: https://github.com/ezyang, https://github.com/malfet, https://github.com/atalman
2025-09-28 01:38:59 +00:00
1ad491dd88 Better error handling in torch/csrc/jit/ir/* (#163757)
Refactor error handling to use TORCH_CHECK for improved clarity in constants and scope management

Fixes some parts of ISSUE #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163757
Approved by: https://github.com/albanD
2025-09-28 01:18:24 +00:00
fd20889d0b Add type annotations to MPS profiler utilities (#163486)
## Summary
- drop the local mypy allow-untyped-defs escape hatch in the MPS profiler helpers
- annotate the context managers and bool helpers so they type-check cleanly

## Testing
- python -m mypy torch/mps/profiler.py --config-file mypy-strict.ini

------
https://chatgpt.com/codex/tasks/task_e_68d0ce4df2e483268d06673b65ef7745
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163486
Approved by: https://github.com/Skylion007
2025-09-27 23:00:53 +00:00
2ce2e48a05 [WIP][symm_mem] Add a wait for signal and put signal for one side API (#159837)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159837
Approved by: https://github.com/kwen2501
2025-09-27 21:20:13 +00:00
1d98be6abf [NFC] fixed typo in sparse semi structured filename (#163904)
Make sure all semi structured files use "SparseSemiStructured"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163904
Approved by: https://github.com/Skylion007
2025-09-27 21:19:48 +00:00
dfda239cce [DTensor] Raise an RuntimeError when checkpointing APIs are used with Partial placement (#163941)
A DTensor that contains partial placement shouldn't be checkpointed (DCP.save) -- the result is not correct and DCP doesn't know how to handle it.

There are several APIs that are only used by checkpointing, e.g.,`__create_write_items__`. These APIs should raise an exception if the DTensor, `self`, has Partial placement.

Ideally, we want to add the following test:

```
        with self.assertRaisesRegex(
            RuntimeError, "Any checkpointing related operations are not supported for"
        ):

            dcp.save({"dtensor": dtensor}, checkpoint_id=tempfile.gettempdir())
```

While we do see the RuntimeError is raised, the error was raised in another thread due to DTensor checkpoint APIs are called by DCP in a separate thread, which assertRaisesRegex cannot capture.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163941
Approved by: https://github.com/tianyu-l
2025-09-27 19:50:16 +00:00
991e3d0d16 [dynamo][guards] Revert introduction of different types of lambda_guards (#163385)
With
https://fb.workplace.com/groups/260102303573409/permalink/787294574187510/
issue, it might be a better idea to just speedup _realize_dict and keep
the changes very local. So reverting this PR as well, to return to clean
slate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163385
Approved by: https://github.com/jansel
2025-09-27 18:20:48 +00:00
8f6dbc0ba8 [scan] create fw and bw graphs via partitioning (#162754)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162754
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664, #161808, #162025, #161732
2025-09-27 18:13:15 +00:00
3413490f53 [scan] materialize combine_fn in forward add more autograd tests (#161732)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161732
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664, #161808, #162025
2025-09-27 18:13:15 +00:00
b85bee3bbb [hop] refactor check input alias and mutation to be a graph pass (#162025)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162025
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664, #161808
2025-09-27 18:13:15 +00:00
66dbf2c9f5 [scan][autograd] clone outputs that's aliasing with inputs or outputs in bw (#161808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161808
Approved by: https://github.com/zou3519
ghstack dependencies: #161557, #161664
2025-09-27 18:13:15 +00:00
f5d85874dd [scan][be] remove unnecessary tensor checks (#161664)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161664
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #161557
2025-09-27 18:13:14 +00:00
8f15d6a0c9 [test][scan] refactor inductor test and prepare for adding bw tests (#161557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161557
Approved by: https://github.com/zou3519
2025-09-27 18:13:14 +00:00
e78792a70d Update ctc loss docs float32 input required for CuDNN (#162042)
Discovered while working on https://github.com/pytorch/pytorch/pull/159106 the non-obvious requirement that inputs must be float32 to use CuDNN (https://github.com/pytorch/pytorch/pull/159106#issuecomment-3189981705), otherwise the native CUDA implementation is called.

Updates the docs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162042
Approved by: https://github.com/mikaylagawarecki

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
2025-09-27 18:10:17 +00:00
d9db838f58 [CI] Re-enable test_all_to_all_vdev_2d_offset (#163985)
Fixes https://github.com/pytorch/pytorch/issues/163847
Moving allocations upfront and collectives later. The hang goes away.

My investigation indicates that the hang is inside the last call `torch.testing.assert_close(out_expected, out[:out_numel])`. Rank 3 calls into it, but never gets out. Don't know why yet. I will investigate more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163985
Approved by: https://github.com/fegin
2025-09-27 16:56:25 +00:00
6ba83e06a5 [AMP] Add deprecated decorator for torch.xxx.amp.autocast class (#163654)
As the title stated.

**Changes:**
- torch.cuda.amp.autocast
- torch.cpu.amp.autocast
- add explicit `__new__` and `__init_subclass__` for those class above for inspect.signature to retrieve correct signature

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163654
Approved by: https://github.com/Skylion007
2025-09-27 14:37:12 +00:00
960290d629 [Docs] Add standard-imghdr for PyTorch Doc (#163944)
As the title stated.

Python [Pep-0594](https://peps.python.org/pep-0594) have removed imghdr from python standard libaries, the older version of sphinx don`t add it as installation dependencies, so we need to add it to requirement as an temporary dependencies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163944
Approved by: https://github.com/albanD, https://github.com/svekars
2025-09-27 08:14:51 +00:00
b1a4efc302 [amd] Add cudaHostFn_t to cuda_to_hip_mappings (#164007)
Summary: See title

Test Plan:
```
buck build --flagfile fbcode//mode/opt-amd-gpu fbcode//comms/ctran/algos/common/tests:ctran_algo_gpe_kernel_sync_test
```
After fix: https://www.internalfb.com/buck2/362ff91e-53f2-4b82-9536-cb84c91384a2

Before fix: failed in D83294731 (version 1):
https://www.internalfb.com/sandcastle/workflow/1792432651703947243

Differential Revision: D83375414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164007
Approved by: https://github.com/llxxee
2025-09-27 06:09:50 +00:00
96182faf96 [CI][Distributed][CUDA][Symm-Mem] Enable B200 Symm Mem Test (#162988)
Inspired by https://github.com/pytorch/pytorch/pull/162981 and motivated by https://github.com/pytorch/pytorch/pull/159323 taking a total of 20 hours to finish (and unlikely to make it in short time due to https://github.com/pytorch/pytorch/issues/162178 )

Creating this subtest to get *something distributed* on B200.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162988
Approved by: https://github.com/malfet
2025-09-27 05:12:05 +00:00
dcb8af7501 [torchfuzz] fix bool propagation (#164003)
bools can't propogate through the current pointwise ops such as add/mul. once we add more that can, we'll probably want to add an additional subclass that supports pointwise bools, but for now just don't allow it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164003
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743, #163812, #163890, #164002
2025-09-27 04:51:29 +00:00
280e712c13 [vllm hash update] update the pinned vllm hash (#164029)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164029
Approved by: https://github.com/pytorchbot
2025-09-27 04:34:57 +00:00
254d2864d6 Add runtime_overhead PR Time Benchmark (#163866)
This adds a PR time benchmark that checks for runtime overhead on a very small graph. This will help track regressions in runtime overhead.

Example Results:
```
runtime_overhead_inductor,instruction_count,222645
runtime_overhead_inductor_inference_mode,instruction_count,234998
runtime_overhead_inductor_requires_grad,instruction_count,293556
runtime_overhead_inductor_requires_grad_backward,instruction_count,78181
runtime_overhead_inductor_dynamic,instruction_count,234870
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,248711
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,309979
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77599
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163866
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/anijain2305
2025-09-27 03:26:59 +00:00
9dac6437da lint: Filter out /usr/include from results (#164012)
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164012
Approved by: https://github.com/ZainRizvi
ghstack dependencies: #164008
2025-09-27 00:54:07 +00:00
8a0e8cad5f lint: Only include files in pytorch (#164008)
We were seeing instances of stdlib files in clang-tidy output so this
just essentially removes them from the things that lintrunner will
report up. Longer term fix here would be to just modify the clang-tidy
configuration in order to do the correct thing here but that requires a
bit more investigation as to why this is only happening in CI and is not
reproduceable locally.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164008
Approved by: https://github.com/ZainRizvi
2025-09-27 00:54:07 +00:00
3a115da3e6 [torchfuzz] ones over zero (#164002)
reduces likelihood of divide by zero errors. long term we'll probably want to just fuzz these values entirely

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164002
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743, #163812, #163890
2025-09-27 00:53:02 +00:00
b48a3d0a38 [CuTe] Add layout overlap checking util function in _MeshLayout (#163367)
While refactoring the bookkeeping for DeviceMesh while leveraging CuTe layout, we found that we need to have two more util functions. One is to check whether one layout has overlap inside it or not. For example, (2,2):(2:1) has no overlap while (2,2):(2:2) has overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163367
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288, #163928, #163930
2025-09-27 00:22:14 +00:00
8d474bdc14 Change python grid calc for MTIA back to python mode (#163601)
Differential Revision: D83000165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163601
Approved by: https://github.com/blaine-rister
2025-09-27 00:12:53 +00:00
008051b13c [Dynamic Shape][BE] trim _DimHint serialization (#163891)
Summary:
current serialization is a bit hard to read
```
Exporting with the dynamic shape spec: {getitem_123: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False)), getitem_118: (_DimHint(type=<_DimHintType.DYNAMIC: 3>,
min=489, max=31232, _factory=False)), getitem_117: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=489, max=31232, _factory=False)), getitem_116: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=489, max=31232, _factory=False)), getitem_115: (
_DimHint(type=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True), _DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False)), getitem_46: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=29, max=1792, _factory=False),
 _DimHint(type=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True)), _predict_module__base_model_model_ro_sparse_arch_ebc__output_dists_0__dist: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=1, max=64, _factory=False), _DimHint(t
ype=<_DimHintType.STATIC: 2>, min=None, max=None, _factory=True)), _predict_module__base_model_model_nro_sparse_arch_ebc__output_dists_0__dist: (_DimHint(type=<_DimHintType.DYNAMIC: 3>, min=29, max=1792, _factory=False)...
```

Test Plan: UT

Differential Revision: D83175131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163891
Approved by: https://github.com/pianpwk
2025-09-27 00:08:01 +00:00
e4ffd718ec Fix setting of memory fraction in test_garbage_collect_expandable (#164000)
Fixes #160598
Fixes #160551
Fixes #160507

This PR fixes a bug in the `test_garbage_collect_expandable` unit test where the finally block incorrectly re-reads the current per process memory fraction instead of setting the original value. With out the fix the other tests in the `test/test_cuda.py` test suite were impacted and failed with OOM error on ROCm.

This ensures proper cleanup and isolation of test state, maintaining test correctness and avoiding side effects like the below OOM error that it caused.

For example, `test_autocast_checkpointing`  failed with the below error https://github.com/pytorch/pytorch/actions/runs/17982223758/job/51153974194 on ROCm

`torch.OutOfMemoryError: HIP out of memory. Tried to allocate 76.00 MiB. GPU 0 has a total capacity of 255.69 GiB of which 252.97 GiB is free. 1.20 GiB allowed; Of the allocated memory 1.14 GiB is allocated by PyTorch, with 17.00 MiB allocated in private pools (e.g., HIP Graphs), and 18.63 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164000
Approved by: https://github.com/jeffdaily
2025-09-26 23:57:32 +00:00
ed3085814a [cuDNN][SDPA] Disable dropout for cuDNN SDPA on 9.11 - 9.13 (#163903)
cuDNN introduced some broken heuristics for these cases so we need to disable dropout to avoid unexpected crashes due to heuristics refusing to proceed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163903
Approved by: https://github.com/ngimel, https://github.com/malfet, https://github.com/atalman
2025-09-26 23:50:09 +00:00
e2817ac204 [cuDNN][Convolution] Disable cuDNN for 3D convolutions with kernel size != 1 for cuDNN 9.8+ (#163581)
To workaround #163539

Still confirming whether 9.10 is affected. The original test states that the convolution is "large," but note that the input size does not apepar to require 64-bit indexing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163581
Approved by: https://github.com/ngimel, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-09-26 23:47:29 +00:00
1d138e658d [AOTI] log error triton kernel name during autotune (#163889)
Summary: can't tell from current error msg which kernel got exception

Test Plan: lint & pyre

Reviewed By: muchulee8

Differential Revision: D83246522

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163889
Approved by: https://github.com/jansel
2025-09-26 23:29:49 +00:00
f9095fb285 [Windows] Update libuv version from 1.39 to 1.51 (#160318)
Fixes: [#148315](https://github.com/pytorch/pytorch/issues/148315)

The PR updates `libuv` version as `conda-forge` channel doesn't contain `libuv=1.39` for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160318
Approved by: https://github.com/iremyux, https://github.com/malfet
2025-09-26 23:29:21 +00:00
a0136f149c [MPS] Fix nan behavior in grid_sampler_3d (#163881)
Fixes #163851
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163881
Approved by: https://github.com/malfet
2025-09-26 23:08:00 +00:00
62b0ebd8f9 [MPS] [Sparse] unique_dim and sparse broadcast (#163694)
Implements unique_dim, sparse broadcast ops and adds dtypes for mps for tests where we expect to fail, otherwise they would always fail due to being run in double precision

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163694
Approved by: https://github.com/malfet
2025-09-26 23:03:13 +00:00
19f16a65b4 [torchfuzz] Add support for fuzz templates (#163890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163890
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743, #163812
2025-09-26 22:51:45 +00:00
0ebfa3d7d2 Avoid fast path mask left-align check in compiled TransformerEncoder (#163773)
Fixes #163640

This PR avoids a mask left align check in the case that we're operating under torch.compile / torch.export. Originally, I planned to make a more invasive change to auto-disable the fast path entirely underneath torch.compile / torch.export, but I realized during testing that the fast path wasn't actually causing compile issues outside of the narrow issue identified here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163773
Approved by: https://github.com/mikaylagawarecki
2025-09-26 22:29:37 +00:00
eqy
0ea10f9912 [cuDNN][conv][64-bit] Disable cuDNN for 64-bit depthwise convs again (#163171)
test is breaking, will check if there's an older version that we can enable on to avoid completely dropping support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163171
Approved by: https://github.com/ngimel, https://github.com/malfet
2025-09-26 22:12:17 +00:00
48a852b7ae [AOTI] Update AOTInductor tutorial (#163808)
Summary: Remove the BC breaking warning. Add inductor_config to the example code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163808
Approved by: https://github.com/yushangdi
2025-09-26 22:01:31 +00:00
f1260c9b9a [ROCm][CI/CD] upgrade nightly wheels to ROCm 7.0 (#163937)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163937
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-26 21:42:09 +00:00
28c7d11428 [AOTI] Pass in shape_env for get_stride_order (#163925)
Summary:
As titled.
Without the diff, we got P1963055009

With the diff passing in the enviroment, we can do correct sym_int deduction:
https://fburl.com/mlhub/p5zy7o28

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_sdfpa_unbacked_strides --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)"
```
Without the fix: P1964887260
With the fix: P1964888579

Differential Revision: D83211018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163925
Approved by: https://github.com/ColinPeppler
2025-09-26 21:10:03 +00:00
a60c6ed99f [DeviceMesh][ez] Extract the pg creation as a util function (#163930)
This is just to extract common logic into a util function because we will use it many times for the following stack of Device Mesh refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163930
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288, #163928
2025-09-26 20:42:58 +00:00
c257570e6c [inductor] require shape in TritonCSEVariable (#162275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275
Approved by: https://github.com/mlazos
2025-09-26 20:41:12 +00:00
2f85de0b42 Fix preserve annotation with decomp (#163896)
If we use `fx_traceback.preserve_node_meta()`, we will have a few extra node.meta fields on nodes, such as "seq_nr", added from `fx/proxy.py`. As a result, there might be non-empty node.meta on graph nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163896
Approved by: https://github.com/SherlockNoMad, https://github.com/ydwu4
2025-09-26 20:28:47 +00:00
e21b037756 Add tests for aot_export_joint_with_descriptors annotation (#163893)
As title, test

1) Annotation works with aot_export_joint_with_descriptor API
2) Annotation works with the 2 step "strict export.export + aot_export_joint_with_descriptor"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163893
Approved by: https://github.com/SherlockNoMad
2025-09-26 19:25:44 +00:00
f8c7505855 [inductor] Fix unbounded number of substitutions when equality checks contain Max expr (#163685)
## Issue

From an internal use case, we found that if we have an equality rule like:

```
Max(15, u0) == s0 * Max(15, u0)
```

This would lead to wrong substitution rule being generated in the substitution table, the result would be the process got stuck in the substitution loop as if it hangs indefinitely, as it's doing the following substitutions:

```
Max(15, u0)
--> s0 * Max(15, u0)
--> s0 ** 2 * Max(15, u0)
--> s0 ** 3 * Max(15, u0)
--> s0 ** 4 * Max(15, u0)
...
```

The root cause is with SymPy expression comparison: as `Max` is [not inside the op class table](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L50-L86), it'll take the [UNKNOWN](https://github.com/sympy/sympy/blob/1.14/sympy/core/basic.py#L120) order, and considered bigger than any other types of expressions.

## Fix
1. Added a breaking-out from the substitution while-loop to warn about any exccessive substitutions, what threshold should be used here and how to pass it are open to suggestion, using a hard-coded static value to be simple for now
2. Enhanced the sympy expression comparison logic, so that we first check if one expr "has" the other one or not, to help work around the issue with `Max` here

## Testing

- with the unittiest alone --> unittest stuck
- with the unittest and while-loop breakout, we could see tests finished with warning "**Substitution limit reached**":
```
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+
W0923 13:00:37.864000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] |     !!!   WARNING   !!!    |
W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+
W0923 13:00:37.865000 46140 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead.
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [5.6947s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda W0923 13:00:39.633000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ u1**30*Max(15, u0)
W0923 13:00:39.679000 46140 /data/users/q1l1/pytorch/torch/_inductor/sizevars.py:765] [0/0] Substitution limit (30) reached w/ 64*u1**30*Max(15, u0)
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [5.6278s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s]

============================ 2 passed, 1 skipped, 870 deselected in 19.66s ============================
```

- with the unittest + comparison logic enhanced, we don't see the warning any more:
```
Running 3 items in this shard

test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCpu::test_unbounded_expr_substitutions_cpu W0923 13:15:39.560000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:70] +============================+
W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:71] |     !!!   WARNING   !!!    |
W0923 13:15:39.561000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:72] +============================+
W0923 13:15:39.562000 290812 /data/users/q1l1/pytorch/torch/_export/__init__.py:73] torch._export.aot_compile()/torch._export.aot_load() is being deprecated, please switch to directly calling torch._inductor.aoti_compile_and_package(torch.export.export())/torch._inductor.aoti_load_package() instead.
stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [6.6093s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleGpu::test_unbounded_expr_substitutions_cuda stats [('calls_captured', 5), ('unique_graphs', 1)]
inductor [('extern_calls', 2), ('benchmarking.InductorBenchmarker.benchmark_gpu', 2), ('async_compile_cache_miss', 1)]
graph_break []
aten_mm_info [('aten.mm_Max(15, u0)_16_64', 1)]
PASSED [6.0502s]
test/inductor/test_aot_inductor.py::AOTInductorTestABICompatibleMps::test_unbounded_expr_substitutions_mps SKIPPED [0.0002s]

============================ 2 passed, 1 skipped, 870 deselected in 21.99s ============================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163685
Approved by: https://github.com/jansel
2025-09-26 18:46:36 +00:00
425ea90f95 [testing] Add test owner labels for some cuda? tests (#163296)
I am trying to give some test files better owner labels than `module: unknown`.  I am not sure them, but they seem pretty reasonable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163296
Approved by: https://github.com/eqy, https://github.com/msaroufim
2025-09-26 18:26:56 +00:00
5b764267f4 [testing] Add test owner labels for some distributed tests (#163174)
I am trying to give some test files better owner labels than `module: unknown`.  I am not sure them, but they seem pretty reasonable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163174
Approved by: https://github.com/ezyang
2025-09-26 18:19:04 +00:00
50c0550f5a Add magic TORCH_MAKE_PYBIND_ENUM_FASTER macro (#163527)
See comment on the macro definition. In short, pybind11 3.x
added `py::native_enum`, and also had to add overhead for that new way
to bind enums on the critical path for calling functions that take
regular old `py::enum_`s as arguments (for example, `__eq__`).

Differential Revision: [D82873169](https://our.internmc.facebook.com/intern/diff/D82873169/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163527
Approved by: https://github.com/ezyang
2025-09-26 17:59:22 +00:00
d7491fb1c1 Fix tensor creation with empty names crash (#163957)
Partially fixes #148324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163957
Approved by: https://github.com/malfet, https://github.com/janeyx99
2025-09-26 17:41:00 +00:00
9534c59311 [Inductor] address comments from https://github.com/pytorch/pytorch/pull/163803 (#163901)
Summary: address comments from https://github.com/pytorch/pytorch/pull/163803

Differential Revision: D83291637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163901
Approved by: https://github.com/desertfire
2025-09-26 17:18:44 +00:00
5880996b4c Expose torch.nn.utils.parametrize (#163835)
`torch.nn.utils.parametrize` is not imported from `torch/nn/utils/__init__.py`, thus is not exposed and make it hard for code editors to statically analyze the code and provide auto-completion based on the function signature.

<img width="615" height="292" alt="Screenshot 2025-09-25 at 12 01 52 PM" src="https://github.com/user-attachments/assets/a276f6f0-87f3-4732-943d-2a92ea871974" />

after the fix:

<img width="964" height="393" alt="Screenshot 2025-09-25 at 12 02 16 PM" src="https://github.com/user-attachments/assets/ca47f09e-dc4e-4420-a2d2-11669e07471a" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163835
Approved by: https://github.com/albanD
2025-09-26 16:38:18 +00:00
1d26eb0fcc Move inductor.aot_compile to use new tracer (#163137)
Differential Revision: [D82603768](https://our.internmc.facebook.com/intern/diff/D82603768)

I feel no one probably uses this API now but still useful path for more test cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163137
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #163136
2025-09-26 15:54:24 +00:00
a05f6ecfec Fix bug with renaming submodules in dynamo for new tracer (#163136)
Differential Revision: [D82603767](https://our.internmc.facebook.com/intern/diff/D82603767)

Previously, i forgot to add handle call_module case which now will have export_root prepended to their names. Basically i want to clean up sth like:
```
graph():
      %l_self_export_root_sub_mod = call_module[target=l_self_export_root_sub_mod](%x, %y)
      %l_self_export_root_sub_mod_1 = call_module[target=l_self_export_root_sub_mod](%x, %y)
  ```

Dynamo graph can have call_module nodes that have messed up name due to our wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163136
Approved by: https://github.com/avikchaudhuri
2025-09-26 15:54:24 +00:00
c106ee8515 [FakeTensor] Supplement the relevant logic for converting conv1d to conv2d in meta_conv (#160408)
## Fixes https://github.com/pytorch/pytorch/issues/159462 also fixes #163569 , #163604

## summary
the issue is caused by the wrong stride of conv1d's result generated by meta_conv:
4d5b3f2d5a/torch/_meta_registrations.py (L2453-L2471)

and the wrong stride will be used to codegen size assert in inductor:
4d5b3f2d5a/torch/_inductor/ir.py (L6152-L6163)

## reason
So why the computed stride is wrong in the meta_conv function? because the corresponding backend will convert conv1d to conv2d and change the input tensor' size and memory_format(channel last). but the meta_conv do not do this transformation, so a mismatch happend.
4d5b3f2d5a/aten/src/ATen/native/Convolution.cpp (L1502-L1510)
 just add corresponding logic in meta_conv.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160408
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/mlazos
2025-09-26 15:45:02 +00:00
8aba513506 [MPS] test sparse add MPS dtypes so we get proper expected failure (#163951)
Adds dtypeIfMPS so if op is supported we get proper error like unexpected success. Before we would never get unexpected success because tests were run in torch.double dtype which will always fail on MPS due to it not supporting the dtype
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163951
Approved by: https://github.com/malfet
2025-09-26 14:48:58 +00:00
8c194a367e [DeviceMesh][ez] Add a type alias for backend config (#163928)
Create a type alias for `tuple[Optional[str], Optional[C10dBackend.Options]]` since it is too long.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163928
Approved by: https://github.com/fegin
ghstack dependencies: #163212, #163288
2025-09-26 14:46:53 +00:00
33f3413bd3 [WIP][precompile] Set fake_mode of base tensor in fx graph pickler (#163738)
Summary:
When unpickling a fake tensor in fx graph pickler. It only sets the fake mode of the current tensor's metadata to the one that is consistent with pickler's `unpickle_state`. However, it doesn't set the fake mode of a tensor's base tensor when that tensor is a view.

This will cause an issue when dumping and loading the following graph
```
class GraphModule(torch.nn.Module):
    def forward(self, s77: "Sym(s77)", L_x_: "f32[s77, 8]"):
        l_x_ = L_x_
        chunk = l_x_.chunk(2, dim = -1);  l_x_ = None
        y: "f32[s77, 4]" = chunk[0];  chunk = None
        y_repeat: "f32[s77, 8]" = y.repeat_interleave(2, dim = -1);  y = None
        return (y_repeat,)
```
because `repeat_interleave` will create an intermediate fake tensor of size `[s77, 2, 4]` and it will become the base of the node `y_repeat`'s `meta['val']`.

This causes issues during the deserialization phase when applying AOT precompile to DeepSeek in vLLM.

Test Plan:
This has been tested in vLLM with DeepSeek.

As for unittest, ideally it should be `test_aot_compile_repeat_interleave` with mark_dynamic turned on. However, that's leading to some other pickle issues.

```
python test/dynamo/test_aot_compile.py -k test_aot_compile_repeat_interleave
```

I have yet to figure out a more appropriate unittest. But a proof-of-concept demo would be the following:
```
import inspect
import sympy
import torch
from torch.fx._graph_pickler import GraphPickler
from torch.fx.experimental.symbolic_shapes import ShapeEnv
from torch._subclasses import FakeTensorMode
from torch.fx._graph_pickler import GraphPickler, Options
from unittest.mock import patch

class M(torch.nn.Module):
    def forward(self, x):
        chunk = x.chunk(2, dim=-1)
        y = chunk[0]
        y_repeat = y.repeat_interleave(2, dim=-1)
        return y_repeat

def my_custom_backend(gm, example_inputs):
    global gm_global
    gm_global = gm
    return gm.forward

m = M()
m_opt = torch.compile(m, backend=my_custom_backend, fullgraph=True)

sample_inputs = (torch.randn(2, 8),)
torch._dynamo.mark_dynamic(sample_inputs[0], [0])
opt_out = m_opt(*sample_inputs)

graph_reducer_override = GraphPickler.reducer_override

def _graph_reducer_override(self, obj):
    if (inspect.isclass(obj) and issubclass(obj, sympy.Function)
            and hasattr(obj, "_torch_unpickler")):
        return obj._torch_unpickler, (obj._torch_handler_name, )
    if isinstance(obj, FakeTensorMode):
        return type(None), ()
    return graph_reducer_override(self, obj)

with patch.object(GraphPickler, "reducer_override", _graph_reducer_override):
    pickled_gm = GraphPickler.dumps(gm_global, Options(ops_filter=None))

fake_mode = FakeTensorMode(shape_env=ShapeEnv())
loaded_gm = GraphPickler.loads(pickled_gm, fake_mode)
```

Differential Revision: D83112599

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163738
Approved by: https://github.com/zhxchen17
2025-09-26 14:36:37 +00:00
d4e4f70768 Fix overflow in slow_conv3d when kernel size is too large. (#162718)
Also, adding check for padding to avoid segmentation fault caused by overflow.

Fixes #141846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162718
Approved by: https://github.com/jgong5, https://github.com/Skylion007
2025-09-26 13:39:29 +00:00
bfd21cd3e6 Revert "Add less warps config to inner reductions (#162447)"
This reverts commit 768361e67f0eb36491d7b763ef38d7c928ebefe6.

Reverted https://github.com/pytorch/pytorch/pull/162447 on behalf of https://github.com/PaulZhang12 due to failed to land internally ([comment](https://github.com/pytorch/pytorch/pull/162447#issuecomment-3338680532))
2025-09-26 13:16:04 +00:00
7441a1b9b1 Update ruff to 0.13.1 (#163744)
Update ruff to 0.13.1 so that we can remove `UP038` from `pyproject.toml` because it has been removed from supported rules of ruff.
There are some fixes, the most notable one is [(PYI059)](https://docs.astral.sh/ruff/rules/generic-not-last-base-class/#generic-not-last-base-class-pyi059)
```
Checks for classes inheriting from typing.Generic[] where Generic[] is not the last base class in the bases tuple.

```

A BC-breaking change is introduced to change the typing of `OrderedSet .storage`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163744
Approved by: https://github.com/Skylion007, https://github.com/jingsh
2025-09-26 10:12:21 +00:00
6a2bd1f4ee [inductor] skip bmm when converting channel last (#159459)
Workaround of #159458 by remove some nodes output channel last set

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159459
Approved by: https://github.com/etaf, https://github.com/eellison, https://github.com/shunting314
2025-09-26 09:11:40 +00:00
4783e3ff49 Update torch-xpu-ops commit pin (#163758)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@229e8b](229e8ba104), includes:

- Revert tracking of Work status for FlightRecorder in ProcessGroupXCCL to fix memory leak
- Enable SYCL warnings on Linux
- Fix accuracy issues with CTC loss
- Enable aten::nonzero_static on XPU backend
- Stop recursive calculations in polynomial kernels if tensor has NaNs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163758
Approved by: https://github.com/EikanWang
2025-09-26 09:05:08 +00:00
c8e5b7dabb Add SDPA patterns for T5 variants when batch size is 1 (#163252)
As mentioned in
https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/fuse_attention.py#L838, this PR generates patterns  for the cases batch size == 1.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163252
Approved by: https://github.com/Valentine233, https://github.com/jansel
2025-09-26 08:50:06 +00:00
04b51499f7 [CPU] Support transpose and packing fusion for bit8 (#163233)
To be used by CPU INT8 SDPA in TorchAO https://github.com/pytorch/ao/pull/3025. This change has a kernel improvement of about 9%.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163233
Approved by: https://github.com/mingfeima, https://github.com/jansel
2025-09-26 07:15:04 +00:00
54461a53bd [Inductor] Check if profiling before using record_function in CompiledFxGraph (#163747)
The call to `record_function` adds overhead even if profiling is disabled, which can as much as double the total runtime overhead of a compiled function. #163566 aims to make `record_function` more efficient, but doesn't fully eliminate overhead. This change adds a check if profiling is active before using `record_function`, which avoids this issue all together.

`TestExecutionTrace.test_execution_trace_with_pt2` in https://github.com/pytorch/pytorch/blob/main/test/profiler/test_execution_trace.py#L372 already checks that the `record_function` region is tracked during profiling.

Comparison of the `benchmarks/dynamo/microbenchmarks/overheads.py ` results:

Before Change:
```
requires_grad=False
compiled 56.9us (warmup=10.7s)

requires_grad=True
compiled 99.4us (warmup=0.2s)

inference_mode()
compiled 55.7us (warmup=0.1s)
```

After Change:
```
requires_grad=False
eager    6.9us (warmup=0.0s)
compiled 23.9us (warmup=22.3s)

requires_grad=True
eager    8.7us (warmup=0.0s)
compiled 56.8us (warmup=0.1s)

inference_mode()
eager    6.3us (warmup=0.0s)
compiled 22.2us (warmup=0.1s)
```

Additionally, #163866 introduces an instruction count benchmark. Because that is not merged and activated yet, here is a comparison:

Before Change:
```
runtime_overhead_inductor,instruction_count,222645
runtime_overhead_inductor_inference_mode,instruction_count,234998
runtime_overhead_inductor_requires_grad,instruction_count,293556
runtime_overhead_inductor_requires_grad_backward,instruction_count,78181
runtime_overhead_inductor_dynamic,instruction_count,234870
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,248711
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,309979
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77599
```

After Change:
```
runtime_overhead_inductor,instruction_count,149997
runtime_overhead_inductor_inference_mode,instruction_count,163397
runtime_overhead_inductor_requires_grad,instruction_count,220722
runtime_overhead_inductor_requires_grad_backward,instruction_count,78276
runtime_overhead_inductor_dynamic,instruction_count,161177
runtime_overhead_inductor_inference_mode_dynamic,instruction_count,175495
runtime_overhead_inductor_requires_grad_dynamic,instruction_count,235674
runtime_overhead_inductor_requires_grad_backward_dynamic,instruction_count,77475
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163747
Approved by: https://github.com/mlazos, https://github.com/anijain2305
2025-09-26 06:49:40 +00:00
d1403250c9 Fix specialize_impl from triton.runtime.jit (#163844)
Summary:
In https://github.com/triton-lang/triton/pull/7771/ , create_specialize_impl is removed. We extend the support using native_specialize_impl.

Otherwise, PyTorch won't work with trunk triton.

Test Plan:
scripts/lufang/llm/launch_qwen3_vl_235b_a22b_thinking_2507_h100.sh

No more error message like
```
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last):
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 419, in generate_ttir
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     specialization = _get_specialization(ordered_args.values())
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 390, in _get_specialization
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     from triton.runtime.jit import specialize_impl as specialize_impl_orig
(Worker_TP0_EP0 pid=190353) [rank0]:W0924 23:24:48.190000 190353 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] ImportError: cannot import name 'specialize_impl' from 'triton.runtime.jit' (/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inf
erence_platform_sp/llm_predictor_gpu/__service__/service#link-tree/triton/runtime/jit.py)
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last):
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     ttir_module, ordered_tensor_names = generate_ttir(
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 419, in generate_ttir
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     specialization = _get_specialization(ordered_args.values())
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 390, in _get_specialization
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]     from triton.runtime.jit import specialize_impl as specialize_impl_orig
(Worker_TP1_EP1 pid=190354) [rank1]:W0924 23:24:48.210000 190354 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] ImportError: cannot import name 'specialize_impl' from 'triton.runtime.jit' (/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inf
erence_platform_sp/llm_predictor_gpu/__service__/service#link-tree/triton/runtime/jit.py)
(Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Encountered an exception in identify_mutated_tensors, assuming every input is mutated
(Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0] Traceback (most recent call last):
(Worker_TP5_EP5 pid=190359) [rank5]:W0924 23:24:48.216000 190359 /data/users/lufang/fbsource/fbcode/caffe2/torch/_higher_order_ops/triton_kernel_wrap.py:948] [0/0]   File "/data/users/lufang/fbsource/buck-out/v2/gen/fbcode/4e83bca020adbfd7/smart/inference_platform_sp/llm_predictor_gpu/__service__/service#link-tree/to
rch/_higher_order_ops/triton_kernel_wrap.py", line 924, in identify_mutated_tensors
```

Differential Revision: D83229128

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163844
Approved by: https://github.com/henryoier, https://github.com/davidberard98, https://github.com/BoyuanFeng
2025-09-26 06:37:26 +00:00
b42e81def5 Allow unbacked to unbacked replacements if rhs unbacked symbols are all inputs (#163652)
This partially solve the issue https://github.com/pytorch/pytorch/issues/163641. We do not need to ban unbacked to unbacked replacement if all rhs symbols are inputs since we know those symbols are seen by the whole program.

This issue was found as i was tracing some vllm models with unbacked, namely  Qwen/Qwen2-1.5B-Instruct it makes reasoning logic easier to do those replacements.

as for data dependent similar pattern, I am thinking to create a set of replacements that we apply only during static eval
instead of none. to make reasoning better.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163652
Approved by: https://github.com/bobrenjc93
2025-09-26 06:23:22 +00:00
2a45f30ae7 Exporting aten.conv with cuda under fake mode on a cuda-less machine (#163912)
Summary:
Improve op coverage of exporting a CUDA model on a CPU-only machine under fake tensor mode.

For `torch.nn.functional.conv2d`, it will `_select_conv_backend` based on input and weight shapes.

When calling into `supportsDepthwiseConvolutionWithCuDNN()`, it calls `at::cuda::getCurrentDeviceProperties()` and fails on a CPU-only machine.

So we check if CUDA is actually enabled first.

Test Plan: TORCH_SHOW_CPP_STACKTRACES=1 buck2 run fbcode//caffe2/test:test_export -- --r nn_functional_conv2d

Reviewed By: angelayi, henryoier

Differential Revision: D80562984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163912
Approved by: https://github.com/SherlockNoMad
2025-09-26 06:04:20 +00:00
11b4c0eb9e [aoti] Save compute information (#163792)
Metadata looks like:
```
{
  'AOTI_DEVICE_KEY': 'cpu',
  'AOTI_PLATFORM': 'linux',
  'AOTI_MACHINE': 'x86_64',
  'AOTI_CPU_ISA': 'AVX512',
  'AOTI_COMPUTE_CAPABILITY': '90'
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163792
Approved by: https://github.com/yushangdi, https://github.com/desertfire
ghstack dependencies: #163779
2025-09-26 05:40:44 +00:00
fb93491ddc [aoti] Load metadata w/o loading package (#163779)
Add a function to load the metadata stored in aoti without needing to load the .so. This can be used to store what platform we are compiling the .so on which we can check before loading the .so.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163779
Approved by: https://github.com/yushangdi, https://github.com/desertfire
2025-09-26 05:40:44 +00:00
39df24fe04 [Code Clean] Replace std::runtime_error with TORCH_CHECK (#163610)
Including:
- `torch/csrc/instruction_counter`
- `torch/csrc/lazy`
- `torch/csrc/monitor`
- `torch/csrc/profiler`
- `torch/csrc/dynamo`

Fixes part of #148114

Personal mistake about (PR #163317), this PR does the same thing **and PR #163317 has already been approved by @albanD.**

This is a personal mistake on my part, and I'm so sorry about that. Hope you won't mind @albanD. 🥹

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163610
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-26 04:52:48 +00:00
bbde16fe98 [vllm hash update] update the pinned vllm hash (#163823)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163823
Approved by: https://github.com/pytorchbot
2025-09-26 04:29:52 +00:00
1b78ca2ef5 [Triton] [Inductor] Prune template selection based on decompose_k (#163781)
Summary:

Triton templates tend to perform very poorly on large K, hence the introduction of decompose_k. As a result, when decompose_k is selected will disable exploring the Triton templates. We may want to consider an override in the future.

Note: Based on the timing results it may be desirable to better refine/prune the decompose k decisions.

Testing:

Tested by looking at the autotune/compilation time using a single shape in TritonBench.
`TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 python run --op gemm --rep 1000 --sleep 1.0 --m 512 --n 512 --k 300000 --only pt2_matmul_maxautotune`
Before this change:
`SingleProcess AUTOTUNE benchmarking takes 13.5368 seconds and 0.1595 seconds precompiling for 38 choices`
With this change:
`SingleProcess AUTOTUNE benchmarking takes 9.9626 seconds and 0.0020 seconds precompiling for 11 choices`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163781
Approved by: https://github.com/eellison, https://github.com/PaulZhang12
2025-09-26 04:09:35 +00:00
082eaf4aae [DeviceMesh] Add extra check in flatten result cache lookup (#163288)
while refactoring DeviceMesh bookkeeping, we found that there is one corner case which we just don't check whether the dims to be flattened into is same as the dims which an existing flattened name maps to. So we need to add extra cases in the unit test and extra check logic in the code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163288
Approved by: https://github.com/wz337, https://github.com/ezyang, https://github.com/fegin
ghstack dependencies: #163212
2025-09-26 03:41:58 +00:00
f1f2e3e4da [DeviceMesh] Introduce CuTe layout into devicemesh code base for internal bookkeeping (#163212)
DeviceMesh essentially is a way to specify how devices interact with each other or device layout. They are all integers but because they can have various shapes and meshes, it make internal bookkeeping internally way more challenging. Currently our internal bookkeeing inside DeviceMesh is not scalable, so in order to support new functions like `_unflatten`, we need to introduce very complicated logics inside DeviceMesh as pointed out per comment (https://github.com/pytorch/pytorch/pull/159482/files#r2256025452). So thanks to @lw 's suggestion and PoC PR (https://github.com/pytorch/pytorch/pull/160429), we realize that by leveraging CuTe layout algebra([ref](https://docs.nvidia.com/cutlass/media/docs/cpp/cute/02_layout_algebra.html)) from Cutlass will greatly simply our internal mechanical bookkeeping for and make the abstraction ops way easier on top of it. So to make things go incrementally, we propose couple steps here https://github.com/pytorch/pytorch/issues/160337#issuecomment-3195106243.

On top of what we have been doing about PyCute we want to continue add methods into the wrapper class so that we can get rank indexes needed for ProcessGroup Creation with a layout object. We also added detailed explanations and comments (thanks to llm) and unit test to show case the code indeed is working as expected.

More PRs are on the way.

This is a continue of https://github.com/pytorch/pytorch/pull/161016 (originally messed with EasyCLA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163212
Approved by: https://github.com/ezyang, https://github.com/fegin, https://github.com/lw
2025-09-26 03:32:19 +00:00
67cc0e0ac9 Add Static Dispatch Kernels (#163676) (#163870)
Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1951

X-link: https://github.com/pytorch/FBGEMM/pull/4927

Add a few missing static dispatch kernels for remote_ro.

Test Plan: Tested with scripts in D83028841.

Differential Revision: D83258808

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163870
Approved by: https://github.com/henryoier
2025-09-26 03:00:07 +00:00
bbf8aa43ef [a2av] Separate in/out splits into two tensors (#163837)
Old signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor(a!) in_out_splits, str group_name)`
New signature:
`all_to_all_vdev(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name)`

i.e. split `in_out_splits` into IN tensor and OUT tensor so that we can define the TORCH_LIBRARY signature better.
Also to be in line with the 2D version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163837
Approved by: https://github.com/fduwjj
ghstack dependencies: #163886
2025-09-26 01:03:54 +00:00
5daa79fd6e Remove dataclass_slots (#163623)
`dataclass` now has `slots` kwarg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163623
Approved by: https://github.com/Skylion007
2025-09-26 00:54:42 +00:00
b776e0c71e [ROCm][CI/CD] create ROCm 7.0 magma tarball (#163883)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163883
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-26 00:51:17 +00:00
5c2f09d1f9 [export] _detect_attribute_assignment gives warning instead of raising ValueError (#163809)
Summary:
LSTM was not exportable with non-strict export as it failed at `_detect_attribute_assignment`

This is because the `_flat_weights` attribute in LSTM is a list of registered parameters and will be updated by the `_update_flat_weights` method in `forward`.

However, in `_detect_attribute_assignment`, we manually restore the state of the module by `mod.__dict__.update(snapshot)`. Therefore, it should be fine to turn the `ValueError` into a warning so that RNN models are exportable with non-strict export.

Added test to verify that there is no lifted tensor constant and no fake tensor leakage.

Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_export_rnn_variants_with_warning

Differential Revision: D83196971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163809
Approved by: https://github.com/tugsbayasgalan
2025-09-26 00:43:29 +00:00
b4be380480 [ROCm] Implement float32 copy kernel (#163869)
* Add `float32_copy_kernel` for vectorizing float16/bfloat16 to float32 conversion

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163869
Approved by: https://github.com/jeffdaily
2025-09-26 00:39:30 +00:00
5b8fef3f17 Extend triton_mm auto-tune options for HIM shapes (#163273)
Summary:
Add an option to auto-tune for shape:
```
M=1024 N=171712 K=1024
```

Test Plan:
```
TRITON_PRINT_AUTOTUNING=1 buck2 run mode/opt-amd-gpu -c fbcode.enable_gpu_sections=true //pytorch/tritonbench:run -- --op fp8_gemm_rowwise --no_use_tma --no_use_persistent --m 1024 --n 171712 --k 1024 --bias
```
Before:
 {F1982074581}
After, saw 10%~ boost:
{F1982074585}

Differential Revision: D82687336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163273
Approved by: https://github.com/jananisriram, https://github.com/Camyll
2025-09-26 00:05:57 +00:00
ff2f319e6e [MPS] Fix conv layout handling (#162776)
What started as simple fix for `mps_convolution_backward_input` resulted in a pretty significant refactor/fixes:
- Updated `mps_conv_use_channels_last` to return channels last output if either input or weights are channels last
- Use the same primitive throughout `Convolution.mm` to determine wether output should be allocated in channels last format or not

But doing only those two, resulted in crash in `test_memory_format_nn_Conv2d_mps_float32`, when weights were backward, and bias is present:
```
% python -c "import torch;print(torch.nn.functional.conv2d(torch.rand(2, 4, 3, 4,device='mps'), torch.rand(5, 4, 3, 3,device='mps').to(memory_format=torch.channels_last), torch.rand(5,device='mps')))"
/AppleInternal/Library/BuildRoots/4~B5E4ugDCh2RsPWAjMEoPu8LC5w1yXEwd7XweDhg/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:3619: failed assertion `Error: MLIR pass manager failed'
zsh: abort      python -c
```

Which requires a more thorough redesign/cleanup, namely:
- Do not alter the layout based on MacOS version, but rather do additional copies on MacOS-14 if inputs/output or weight are in channels last format ( done by defining `std::optional<Tensor> output_c;` that contains a contiguous copy of the output tensor
- Introduced `input_suggested_layout` which is set to ChannelsLast if and only if input is channels last and is running on MacOS-15+
- Delete unused `memory_layout` and `group` arguments from `fill_depthwise_conv_desc`
- Fix bias broadcasting logic for channels last

As result, in addition to adding one more regression test this change removes `expectedFailures` from:
- `TestModule.test_memory_format` for `Conv2d`, `ConvTranspose2d`, `LazyConv1d`, `LazyConvTranspose1d`
- `test_require_stride_expanded_dynamic_shapes`
-  `test_mutable_custom_op_fixed_layout2` for MacOS-14

Fixes https://github.com/pytorch/pytorch/issues/161905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162776
Approved by: https://github.com/Skylion007
2025-09-25 23:41:34 +00:00
94195a37ae [BE] Remove HermeticPyObjectTLS and Simplify PythonOpRegistrationTrampoline (#163464)
Removes HermeticPyObjectTLS as we no longer need since torch deploy is no longer supported. PythonOpRegistrationTrampoline is also drastically simplified as and being prepped for removal in a future PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163464
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-25 23:30:50 +00:00
suo
c58e096cd0 [DTensor] implement logsumexp (#163879)
as title, mostly copypasta from internal. I am a dtensor noob, so please scrutinize my added test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163879
Approved by: https://github.com/XilunWu
2025-09-25 23:08:30 +00:00
2a6e6a9e3b [FSDP][Replicate] tests replicate parity for shared parameters (#162836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162836
Approved by: https://github.com/mori360
ghstack dependencies: #162830
2025-09-25 23:08:22 +00:00
6e6c899347 [Reland][163423] Promote @requires_nvshmem instead of enable_triton (#163549)
#163423 was approved but reverted due to a revert of base.
Relanding without base.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163549
Approved by: https://github.com/wdvr

Co-authored-by: Wouter Devriendt <wouterdevriendt@meta.com>
2025-09-25 23:02:00 +00:00
366961df78 [FSDP][Replicate] tests replicate parity with activation checkpointing (#162830)
**Summary:**  In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. This tests that replicate function works correctly when combined with activation checkpointing

**Test Case**
1. pytest test/distributed/_composable/test_replicate_training.py -k test_train_parity_with_activation_checkpointing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162830
Approved by: https://github.com/mori360
2025-09-25 22:57:00 +00:00
520fca82c8 Refactor Provenance Tracking (#163378)
Summary:
- Move the `provenance_level` flag check to inside the `set_kernel_post_grad_provenance_tracing` call to simply the code

- Move the `set_kernel_post_grad_provenance_tracing` call and `write_provenance_debug_handle` call to `codegen_comment`.

- If some `call_kernel` call sites don't have a proceeding `codegen_comment` call, add one. Now all `call_kernel` call sites are accompanied with a  `codegen_comment` call.

- Add a `codegen_comment` method to BaseScheduling and remove the noop `codegen_comment` method in Scheduling

- Remove `debug_handle` from `call_kernel`.

Test Plan:
CI

```
buck run @//mode/opt-split-dwarf fbcode//caffe2/test/inductor:provenance_tracing
```

Differential Revision: D82839271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163378
Approved by: https://github.com/angelayi
2025-09-25 22:55:59 +00:00
908bcfd403 [AOTInductor] Add input information for Triton Kernels in AOTI (#160380)
Summary:
We use record_function to pass in input information to let Kineto show
input information.

Test Plan:
Before:
<img width="459" height="582" alt="Screenshot 2025-09-19 at 10 45 10 AM" src="https://github.com/user-attachments/assets/baa0c251-86e9-49ca-8c6c-fcd2619f7f48" />

After:
<img width="473" height="1130" alt="Screenshot 2025-09-19 at 10 44 53 AM" src="https://github.com/user-attachments/assets/b7942d84-0362-4b9e-9232-14de92bbdd00" />

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160380
Approved by: https://github.com/desertfire
ghstack dependencies: #163593
2025-09-25 22:41:04 +00:00
96275dbf88 [CI] Fix test_triton_wait_until hang (#163886)
I don't know why `nvshmem_barrier_all_kernel`  leads the test to hang. Will investigate.
But since it is an unnecessary call here, I am removing it to unblock other PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163886
Approved by: https://github.com/fegin
2025-09-25 22:22:16 +00:00
b14a14a662 [torchfuzz] make generated code much more concise and cleaner (#163812)
```
import torch

torch._dynamo.config.capture_scalar_outputs = True
torch.manual_seed(42)

def fuzzed_program(arg_0, arg_1, arg_2):
    var_node_3 = arg_0 # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_4 = torch.full((1,), (-0.29262632146522655-0.7687848816195035j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_2 = torch.ops.aten.add(var_node_3, var_node_4) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_6 = arg_1 # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_7 = arg_2 # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_5 = torch.ops.aten.add(var_node_6, var_node_7) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_1 = torch.ops.aten.add(var_node_2, var_node_5) # size=(1,), stride=(1,), dtype=complex128, device=cuda
    var_node_0 = var_node_1.item() # dtype=complex128
    return var_node_0

arg_0 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,))
arg_1 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,))
arg_2 = torch.as_strided(torch.randn(1).to(torch.complex128), (1,), (1,))

args = (arg_0, arg_1, arg_2)
result_original = fuzzed_program(*args)
print(' eager success')
compiled_program = torch.compile(fuzzed_program, fullgraph=False, dynamic=True)
result_compiled = compiled_program(*args)
print(' compile success')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163812
Approved by: https://github.com/pianpwk
ghstack dependencies: #163743
2025-09-25 22:12:33 +00:00
92f7361e27 [DTensor] fix uneven _StridedShard (#163843)
Previous uneven `_StridedShard` in https://github.com/pytorch/pytorch/pull/150490 seems failing cases like sharding `tensor = torch.arange(6)` with FSDP 2, TP 2.

This PR attempts to reinvent `_StridedShard`.

I didn't test nested `_StridedShard`, because there shouldn't be any use cases. I think it will become quite messy when it comes to **nested uneven** `_StridedShard`. We are probably going to deprecate it anyway after @zpcore 's work https://github.com/pytorch/pytorch/pull/160266 on ordered sharding, so IMO not worth it to make it too general.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163843
Approved by: https://github.com/ezyang
2025-09-25 22:12:29 +00:00
6a6d838832 Add H100 runner to be recognized in actionlint (#163795)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163795
Approved by: https://github.com/huydhn, https://github.com/seemethere
2025-09-25 22:09:11 +00:00
183dca423f [Inductor] add a new config fallback_embedding_bag_byte_unpack (#163803)
Differential Revision: D82988783

introduce an inductor config fallback_embedding_bag_byte_unpack so we can have options to not let inductor decompose the op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163803
Approved by: https://github.com/henryoier
2025-09-25 22:07:04 +00:00
b8efa336d2 [torchfuzz] simplify codegen and runner (#163743)
much less code. a followup PR will make these repro files even smaller.
small is important since it reduces the time for users to understand
what the repro is doing. here's a sample:

```
(/home/bobren/local/a/pytorch-env) [21:34] devgpu009:/home/bobren/local/a/pytorch/tools/experimental/dynamic_shapes/torchfuzz [130] python fuzzer.py --seed 42
Running single fuzz_and_execute...
Using seed: 42, max_depth: 10
Running generated program...
Selected CUDA_VISIBLE_DEVICES=2
=== Program Output ===
 eager success
 compile success

===============================
=== Program Source ===
import torch
import sys
import os
fuzzer_dir = r'/home/bobren/local/a/pytorch/tools/experimental/dynamic_shapes/torchfuzz'
if fuzzer_dir not in sys.path:
    sys.path.insert(0, fuzzer_dir)
from tensor_fuzzer import fuzz_scalar, fuzz_tensor_simple, ScalarSpec, TensorSpec

def fuzzed_program(arg_0, arg_1, arg_2, arg_3, arg_4, arg_5, arg_6, arg_7, arg_8, arg_9, arg_10, arg_11, arg_12, arg_13, arg_14, arg_15, arg_16, arg_17, arg_18, arg_19, arg_20, arg_21, arg_22, arg_23, arg_24, arg_25, arg_26):
    # Node node_4: arg (depth 6)
    var_node_4 = arg_0 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_7: constant (depth 4)
    var_node_7 = torch.full((1,), (-0.8353595860703585-0.8384634248041143j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_8: arg (depth 4)
    var_node_8 = arg_1 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_6: tensor_pointwise (depth 5)
    var_node_6 = torch.ops.aten.mul(var_node_7, var_node_8) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_9: constant (depth 5)
    var_node_9 = torch.full((1,), (-0.32478860712861235+0.033909682598544454j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_5: tensor_pointwise (depth 6)
    var_node_5 = torch.ops.aten.mul(var_node_6, var_node_9) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_3: tensor_pointwise (depth 7)
    var_node_3 = torch.ops.aten.sub(var_node_4, var_node_5) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_11: arg (depth 6)
    var_node_11 = arg_2 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_18: constant (depth 0)
    var_node_18 = torch.full((1,), (0.12855308616305575+1.5268033634325642j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_19: arg (depth 0)
    var_node_19 = arg_3 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_17: tensor_pointwise (depth 1)
    var_node_17 = torch.ops.aten.mul(var_node_18, var_node_19) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_21: arg (depth 0)
    var_node_21 = arg_4 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_22: arg (depth 0)
    var_node_22 = arg_5 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_20: tensor_pointwise (depth 1)
    var_node_20 = torch.ops.aten.sub(var_node_21, var_node_22) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_16: tensor_pointwise (depth 2)
    var_node_16 = torch.ops.aten.add(var_node_17, var_node_20) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_25: arg (depth 0)
    var_node_25 = arg_6 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_26: arg (depth 0)
    var_node_26 = arg_7 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_24: tensor_pointwise (depth 1)
    var_node_24 = torch.ops.aten.add(var_node_25, var_node_26) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_27: constant (depth 1)
    var_node_27 = torch.full((1,), (-0.6315711191260084+1.342004076501214j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_23: tensor_pointwise (depth 2)
    var_node_23 = torch.ops.aten.mul(var_node_24, var_node_27) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_15: tensor_pointwise (depth 3)
    var_node_15 = torch.ops.aten.mul(var_node_16, var_node_23) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_28: constant (depth 3)
    var_node_28 = torch.full((1,), (1.064498531874825-0.37289464356501284j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_14: tensor_pointwise (depth 4)
    var_node_14 = torch.ops.aten.mul(var_node_15, var_node_28) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_30: arg (depth 3)
    var_node_30 = arg_8 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_32: arg (depth 2)
    var_node_32 = arg_9 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_33: constant (depth 2)
    var_node_33 = torch.full((1,), (1.5815627438573372+0.5124667911691704j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_31: tensor_pointwise (depth 3)
    var_node_31 = torch.ops.aten.div(var_node_32, var_node_33) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_29: tensor_pointwise (depth 4)
    var_node_29 = torch.ops.aten.div(var_node_30, var_node_31) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_13: tensor_pointwise (depth 5)
    var_node_13 = torch.ops.aten.div(var_node_14, var_node_29) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_39: arg (depth 0)
    var_node_39 = arg_10 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_40: constant (depth 0)
    var_node_40 = torch.full((1,), (-0.5987350493494642-0.5711360569376475j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_38: tensor_pointwise (depth 1)
    var_node_38 = torch.ops.aten.mul(var_node_39, var_node_40) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_41: arg (depth 1)
    var_node_41 = arg_11 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_37: tensor_pointwise (depth 2)
    var_node_37 = torch.ops.aten.add(var_node_38, var_node_41) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_42: constant (depth 2)
    var_node_42 = torch.full((1,), (0.7246044564672116-0.5930730980273312j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_36: tensor_pointwise (depth 3)
    var_node_36 = torch.ops.aten.mul(var_node_37, var_node_42) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_43: constant (depth 3)
    var_node_43 = torch.full((1,), (-0.7582976293117148+1.1880929376258396j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_35: tensor_pointwise (depth 4)
    var_node_35 = torch.ops.aten.mul(var_node_36, var_node_43) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_45: constant (depth 3)
    var_node_45 = torch.full((1,), (1.0896212896322774+0.3124038130417098j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_46: arg (depth 3)
    var_node_46 = arg_12 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_44: tensor_pointwise (depth 4)
    var_node_44 = torch.ops.aten.add(var_node_45, var_node_46) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_34: tensor_pointwise (depth 5)
    var_node_34 = torch.ops.aten.div(var_node_35, var_node_44) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_12: tensor_pointwise (depth 6)
    var_node_12 = torch.ops.aten.div(var_node_13, var_node_34) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_10: tensor_pointwise (depth 7)
    var_node_10 = torch.ops.aten.mul(var_node_11, var_node_12) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_2: tensor_pointwise (depth 8)
    var_node_2 = torch.ops.aten.div(var_node_3, var_node_10) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_48: constant (depth 7)
    var_node_48 = torch.full((1,), (-1.047745491289218+0.279447315087422j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_54: arg (depth 2)
    var_node_54 = arg_13 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_55: arg (depth 2)
    var_node_55 = arg_14 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_53: tensor_pointwise (depth 3)
    var_node_53 = torch.ops.aten.div(var_node_54, var_node_55) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_56: arg (depth 3)
    var_node_56 = arg_15 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_52: tensor_pointwise (depth 4)
    var_node_52 = torch.ops.aten.div(var_node_53, var_node_56) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_59: arg (depth 2)
    var_node_59 = arg_16 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_60: arg (depth 2)
    var_node_60 = arg_17 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_58: tensor_pointwise (depth 3)
    var_node_58 = torch.ops.aten.div(var_node_59, var_node_60) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_61: constant (depth 3)
    var_node_61 = torch.full((1,), (-0.7386327586576402-0.027025998767172658j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_57: tensor_pointwise (depth 4)
    var_node_57 = torch.ops.aten.add(var_node_58, var_node_61) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_51: tensor_pointwise (depth 5)
    var_node_51 = torch.ops.aten.sub(var_node_52, var_node_57) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_64: arg (depth 3)
    var_node_64 = arg_18 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_67: arg (depth 1)
    var_node_67 = arg_19 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_68: constant (depth 1)
    var_node_68 = torch.full((1,), (-0.6840241429755998+1.327637020136433j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_66: tensor_pointwise (depth 2)
    var_node_66 = torch.ops.aten.mul(var_node_67, var_node_68) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_69: arg (depth 2)
    var_node_69 = arg_20 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_65: tensor_pointwise (depth 3)
    var_node_65 = torch.ops.aten.sub(var_node_66, var_node_69) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_63: tensor_pointwise (depth 4)
    var_node_63 = torch.ops.aten.sub(var_node_64, var_node_65) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_70: arg (depth 4)
    var_node_70 = arg_21 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_62: tensor_pointwise (depth 5)
    var_node_62 = torch.ops.aten.sub(var_node_63, var_node_70) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_50: tensor_pointwise (depth 6)
    var_node_50 = torch.ops.aten.mul(var_node_51, var_node_62) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_76: constant (depth 1)
    var_node_76 = torch.full((1,), (1.864651314238342+0.27066487315113186j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_77: arg (depth 1)
    var_node_77 = arg_22 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_75: tensor_pointwise (depth 2)
    var_node_75 = torch.ops.aten.mul(var_node_76, var_node_77) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_78: arg (depth 2)
    var_node_78 = arg_23 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_74: tensor_pointwise (depth 3)
    var_node_74 = torch.ops.aten.add(var_node_75, var_node_78) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_79: arg (depth 3)
    var_node_79 = arg_24 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_73: tensor_pointwise (depth 4)
    var_node_73 = torch.ops.aten.mul(var_node_74, var_node_79) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_80: arg (depth 4)
    var_node_80 = arg_25 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_72: tensor_pointwise (depth 5)
    var_node_72 = torch.ops.aten.mul(var_node_73, var_node_80) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_82: constant (depth 4)
    var_node_82 = torch.full((1,), (1.6341547018841247+0.3096989611326181j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_84: constant (depth 3)
    var_node_84 = torch.full((1,), (0.9609065596935821+0.2920229825681946j), dtype=torch.complex128) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_85: arg (depth 3)
    var_node_85 = arg_26 # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_83: tensor_pointwise (depth 4)
    var_node_83 = torch.ops.aten.add(var_node_84, var_node_85) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_81: tensor_pointwise (depth 5)
    var_node_81 = torch.ops.aten.sub(var_node_82, var_node_83) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_71: tensor_pointwise (depth 6)
    var_node_71 = torch.ops.aten.sub(var_node_72, var_node_81) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_49: tensor_pointwise (depth 7)
    var_node_49 = torch.ops.aten.mul(var_node_50, var_node_71) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_47: tensor_pointwise (depth 8)
    var_node_47 = torch.ops.aten.add(var_node_48, var_node_49) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_1: tensor_pointwise (depth 9)
    var_node_1 = torch.ops.aten.add(var_node_2, var_node_47) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_0: torch.ops.aten.item (depth 10)
    var_node_0 = var_node_1.item() # dtype=complex128

    # Final result from root node
    return var_node_0

arg_0 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10042)
arg_1 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10043)
arg_2 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10044)
arg_3 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10045)
arg_4 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10046)
arg_5 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10047)
arg_6 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10048)
arg_7 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10049)
arg_8 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10050)
arg_9 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10051)
arg_10 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10052)
arg_11 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10053)
arg_12 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10054)
arg_13 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10055)
arg_14 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10056)
arg_15 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10057)
arg_16 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10058)
arg_17 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10059)
arg_18 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10060)
arg_19 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10061)
arg_20 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10062)
arg_21 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10063)
arg_22 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10064)
arg_23 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10065)
arg_24 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10066)
arg_25 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10067)
arg_26 = fuzz_tensor_simple((1,), (1,), torch.complex128, seed=10068)
import torch
import sys
torch._dynamo.config.capture_scalar_outputs = True

args = (arg_0, arg_1, arg_2, arg_3, arg_4, arg_5, arg_6, arg_7, arg_8, arg_9, arg_10, arg_11, arg_12, arg_13, arg_14, arg_15, arg_16, arg_17, arg_18, arg_19, arg_20, arg_21, arg_22, arg_23, arg_24, arg_25, arg_26)
result_original = fuzzed_program(*args)
print(' eager success')
sys.exit(1)
compiled_program = torch.compile(fuzzed_program, fullgraph=False, dynamic=True)
result_compiled = compiled_program(*args)
print(' compile success')

======================
Program exited with code: 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163743
Approved by: https://github.com/pianpwk
2025-09-25 21:42:22 +00:00
1cffa42d4d PyTorch histc fix for values with large magnitudes (#163506)
Summary:
The current implementation of the `histc` function on CPU doesn't take into account the nature of the floating point precision represenation when two numbers have very different magnitudes.

In the code of `histc` there is a following logic, which tries to fix an issue when automatically calculated `min` and `max` are identical:
```
if (leftmost_edge == rightmost_edge) {
        leftmost_edge -= 1;
        rightmost_edge += 1;
    }

...

TORCH_CHECK(leftmost_edge < rightmost_edge, "torch.histc: max must be larger than min");
```

But, not for all floating point values expanding the range exactly by 1 will give the representable result that is different from the original value.

The test code:

```
info = th.finfo(th.float32)
f_min = info.min

test_tensor = th.ones((224, 224), dtype=th.float64) * f_min
res = th.histc(test_tensor, bins=10)
```

Actual result:
```
RuntimeError: torch.histc: max must be larger than min
```

Expected result:
Everything should work fine.

NOTICE: If we set `f_min` just to small enough number, code works, which demonstrates the correct purpose of the possible range correction.

In short, `f_min + 1 == f_min` executes to true, since we reach the precision of the floating point prepresentation.
Please notice, this is not limitation of the float32 data type, since all computations happen in float64 (C++ data type `double`). The magnitudes are just different enough, that we reach the precision representation with simple approach of `+/-1`.

Interesting is that `histogram` function doesn't throw an exception, because edges range selection is implemented differently.

The fix we propose is to use `std::nextafter` which returns next representable floating point value starting from the current one in the direction of the lowest or max numbers. In theory, mathecmatically correct is to use this function without constrains, but to maintain backward compatibility in case if there is a code which relies on the current logic of `+/-1` offset we call `std::min` and `std::max` to pick the right representable value (i.e. for small floating point values the next representable value has step smaller than 1 for large values it's larger than 1).
We could stick to `histogram` implementation, but again, to avoid possible backward compatibility breaks, we decided to use the fix presented in this change.

*The real use case scenario:*
In our project we use the well-known transformer version from HuggingFace which fills up the buffer with float32 min (please note this is not a minimal value closer to 0, it's minimal absolute value which is often like `-max`).
The code where it sits is here:
https://github.com/huggingface/transformers/blob/v4.51.1/src/transformers/models/mimi/modeling_mimi.py#L1159

Switching to other version of the transformer will lead to other issues in our project and the bug which we fix here may appear in other projects and scenarios.

The real world problem appears when for such tensor the CPU version of the `histc` is called. In our usecase, it happens because this tensor is an input to the softmax activaiton function and as part of the quantisation the input parameter should go trough the observer as well. In our case the default Histogram observer is selected, which calls the `histc`.

Test Plan:
The simple test code snippet doesn't produce failure:
```
f_min = th.finfo(th.float32).min
test_tensor = th.ones((224, 224), dtype=th.float32) * f_min
th.histc(test_tensor, bins=10)
```

**Testing update:**
The `test_histc` has been updated accordingly.
Now when we have +INF as all values of the tensor, the previous representation of the floating number should be <max_float>, hence the assert message is changed from `[inf, inf]` to `[<max_float>|inf, inf]`.
The test also extended to check the assert message when tensor is filled with values -INF and with combination of (-INF, +INF).
The new regexp assert includes possible output as `inf` and any floating point number in scientific representation for one of the bin edges. We left `inf` as possible value due to possible difference in implementation between CPU and CUDA.

Differential Revision: D82955597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163506
Approved by: https://github.com/jermenkoo, https://github.com/malfet
2025-09-25 20:55:25 +00:00
ebfc87e303 Always produce kernel_info.json (#163715)
Summary: Always produce kernel_info.json so zoomer can use this json to populate GPU traces

Test Plan: CI

Differential Revision: D82762435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163715
Approved by: https://github.com/angelayi
2025-09-25 19:38:49 +00:00
21a41edd4f Add fake_impl for _native_multi_head_attention (#163700)
Test Plan: See added test in test_export.py

Differential Revision: D83099187

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163700
Approved by: https://github.com/angelayi
2025-09-25 19:01:27 +00:00
7bad9c5a64 Revert "Update ruff to 0.13.1 (#163744)"
This reverts commit 3dd89a079f2b0c1d39351f98ff5d5ca882523152.

Reverted https://github.com/pytorch/pytorch/pull/163744 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/18016220484/job/51261729375 looks like a landrace with PR that updated min-version to 3.10 ([comment](https://github.com/pytorch/pytorch/pull/163744#issuecomment-3335534084))
2025-09-25 18:54:03 +00:00
151e66e50d Update documentation for torch.index_select (#163616)
Description said "entries in index which is a LongTensor" but index_select can accept an IntTensor as the parameter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163616
Approved by: https://github.com/jbschlosser

Co-authored-by: Joel Schlosser <75754324+jbschlosser@users.noreply.github.com>
2025-09-25 18:29:17 +00:00
b61bdc7cc4 Fix cpp build (#162774)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162774
Approved by: https://github.com/malfet, https://github.com/atalman
2025-09-25 18:21:45 +00:00
3dd89a079f Update ruff to 0.13.1 (#163744)
Update ruff to 0.13.1 so that we can remove `UP038` from `pyproject.toml` because it has been removed from supported rules of ruff.
There are some fixes, the most notable one is [(PYI059)](https://docs.astral.sh/ruff/rules/generic-not-last-base-class/#generic-not-last-base-class-pyi059)
```
Checks for classes inheriting from typing.Generic[] where Generic[] is not the last base class in the bases tuple.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163744
Approved by: https://github.com/Skylion007, https://github.com/jingsh
2025-09-25 17:52:35 +00:00
6539537a59 [ROCm][CD] create ROCm 7.0 images for binary builds (#163860)
Adds gfx950.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163860
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-25 17:26:40 +00:00
3cbfbbd691 [ROCm] Transformer/SDPA unit test parity (#163745)
## Major Changes

* Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes.
  - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required.
* Fix `atomic_counter` handling in varlen FA API
* Unskips a few unit tests.

Fixes #157120
Fixes #157121
Fixes #157122
Fixes #157167
Fixes #155217
Fixes #157043
Fixes #157060

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163745
Approved by: https://github.com/jeffdaily
2025-09-25 17:14:19 +00:00
112e204797 Revert "[CUDA] Compare major version of the runtime device arch against the built version of the pytorch binary (#161299)"
This reverts commit 7163dce1e091cb5564c723110314bb372b5e81a8.

Reverted https://github.com/pytorch/pytorch/pull/161299 on behalf of https://github.com/nWEIdia due to Incorrectly suppressing useful warnings when running sm89 binary on sm86 ([comment](https://github.com/pytorch/pytorch/pull/161299#issuecomment-3335127621))
2025-09-25 17:13:32 +00:00
f9821b1be7 DebugMode supports_higher_order_operators=True (#163824)
Make DebugMode supports HOP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163824
Approved by: https://github.com/ydwu4
2025-09-25 17:11:43 +00:00
c4312b443f [Tools] Adapting the Hypothesis library (version 5.x) for use with the PyTorch framework (#163748)
Starting from version 5.x, the Hypothesis library removed the timeout setting and only retained the deadline.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163748
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-09-25 16:41:50 +00:00
7194d77550 Revert "enable test_sampled_addmm_zero_sized_cuda for rocm (#121940)" (#163848)
This reverts commit 5494b2a8d38c3ddbeb2d96a5ac990e20ec4c48fd.

Need to skip `test_sparse_csr.py::TestSparseCSRCUDA::test_sampled_addmm_zero_sized_cuda_*` again. Tests are failing now with "core dumped" error
```
python test_sparse_csr.py -v -k test_sampled_addmm_zero_sized_cuda_float64

  test_sampled_addmm_zero_sized_cuda_float64 (__main__.TestSparseCSRCUDA) ... /tmp/pytorch/test/test_sparse_csr.py:2503:   c = torch.empty(m, n, dtype=dtype, device=device, layout=torch.sparse_csr)
GPU core dump created: gpucore.186789
:0:rocdevice.cpp            :2992: 4701819131755 us:  Callback: Queue 0x760cdcd00000 aborting with error : HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
Aborted (core dumped)
```
These failures are linked to `test_sparse_csr.py::TestSparseCSRCUDA::test_select_SparseBSC_int32_cuda_*` due to incorrect test log parsing. We will be able to close these issues also:

- Fixes https://github.com/pytorch/pytorch/issues/163663
- Fixes https://github.com/pytorch/pytorch/issues/160786
- Fixes https://github.com/pytorch/pytorch/issues/160785
- Fixes https://github.com/pytorch/pytorch/issues/160784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163848
Approved by: https://github.com/jeffdaily
2025-09-25 16:38:00 +00:00
22d5f5ff94 [OpenReg][BE] Replacing explicit prefix/suffix with CMake variables (#163850)
As the title states, suffixes like`.dylib` and `lib` can be replaced by `CMAKE_SHARED_LIBRARY_SUFFIX`, and prefixes like `lib` can be replaced by `CMAKE_SHARED_LIBRARY_PREFIX` on Unix or `CMAKE_IMPORT_LIBRARY_PREFIX` on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163850
Approved by: https://github.com/albanD
2025-09-25 16:33:16 +00:00
c8e75c48b9 [fr] Skip the dtype check for some one to all or all to one collective (#163839)
As title, in practice we found that sometimes, the dtype of gather does not match when it comes to output among all ranks, which is a undefined behavior. Same with broadcast and scatter. And they are all completed, so we should not think they are errors, we can skip it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163839
Approved by: https://github.com/VieEeEw
2025-09-25 16:02:06 +00:00
e8f5f1b1a2 [NFC] fixed mistake in comment (#163697)
I used "floor" instead of "ceil", so fix it. Also fixed other typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163697
Approved by: https://github.com/jcaip
2025-09-25 15:53:51 +00:00
10e69a6e17 Preserve user annotation in graph (#163673)
```
import torch
import torch.fx.traceback as fx_traceback
import torch.export

class M(torch.nn.Module):
    def forward(self, x):
        with fx_traceback.annotate({"pp_stage": 0}):
            with fx_traceback.annotate({"fdsp_bucket": 0}):
                x = x + 1
            x = x - 2
            with fx_traceback.annotate({"cuda_stream": 2, "fsdp_bucket": 1}):
                x = x * 2
        x = x / 3
        return x

m = M()

with fx_traceback.preserve_node_meta():
    ep = torch.export.export(m, (torch.randn(10),))

for node in ep.graph.nodes:
    if node.op == "call_function":
        print(f"{node.target}, {node.meta.get("custom", {})}")

```

prints

```
aten.add.Tensor, {'pp_stage': 0, 'fdsp_bucket': 0}
aten.sub.Tensor, {'pp_stage': 0}
aten.mul.Tensor, {'pp_stage': 0, 'cuda_stream': 2, 'fsdp_bucket': 1}
aten.div.Tensor, {}
```

TODOs:
- run_decomposition is failing
- Need to test with the new full graph capture + aot_export_joint apis
- Need to make the annotation propagate through autograd engine to reach the bw nodes. Sample impl here: https://github.com/pytorch/pytorch/pull/83558
- Edward want to restrict the key in custom field to be top-level singleton objects only
- also need to take care of metadata merging when passes are fusing nodes

Thanks @angelayi  for contributing the dynamo fixes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163673
Approved by: https://github.com/albanD, https://github.com/angelayi
2025-09-25 15:50:15 +00:00
5fcde74aed Fix pipeline parallelism not correctly initializing backwards stages when evaluating before training. (#162823)
Previously, an eval() call before a training step() would not correctly initialize the backward pass of the pipeline stages, leading to errors during the subsequent training step. This PR ensures that the backward stages can still be initialized after an eval() call.

Fixes #162822

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162823
Approved by: https://github.com/dcci, https://github.com/H-Huang
2025-09-25 15:13:19 +00:00
6fa3715c12 Expose Kineto event metadata in PyTorch Profiler events (#161624)
## Overview
This PR allows the profiler users to access `Kineto` and `TorchOp` metadata in JSON string format through a new `metadata_json` attribute in `FunctionEvent` objects, which is triggered through a new `expose_kineto_event_metadata` flag in `ExperimentalConfig`.

## Testing
A unit test was added to validate functionality.

## Documentation
Added/updated function doc strings where appropriate.

## Example output
```python
import torch
from torch.profiler import profile

with profile(experimental_config=torch._C._profiler._ExperimentalConfig(expose_kineto_event_metadata=True)) as prof:
    res = torch.mm(torch.rand(1024, 1024), torch.rand(1024, 1024))

for event in prof.events():
    print(f'name: {event.key}, metadata: {event.metadata_json}')
```

```
name: aten::rand, metadata: "Ev Idx": 0
name: aten::empty, metadata: "Ev Idx": 1
name: aten::uniform_, metadata: "Ev Idx": 2
name: aten::rand, metadata: "Ev Idx": 3
name: aten::empty, metadata: "Ev Idx": 4
name: aten::uniform_, metadata: "Ev Idx": 5
name: aten::mm, metadata: "Ev Idx": 6
name: aten::resolve_conj, metadata: "Ev Idx": 7
name: aten::resolve_conj, metadata: "Ev Idx": 8
name: aten::resolve_conj, metadata: "Ev Idx": 9
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161624
Approved by: https://github.com/sraikund16
2025-09-25 14:58:30 +00:00
98c4e35f14 [CD] Add statically linked windows libraries to exclude list (#163768)
Fixes: https://github.com/pytorch/pytorch/issues/159514

Seeing following in the Wheel build logs:
```
Linking CXX static library lib\kineto.lib
Linking CXX static library lib\dnnl.lib
....
```

These files are around 800MB uncompressed and 109MB compressed, hence provide ~50% size reduction for Windows CPU builds.

Test Plan: Build Pytorch Windows binary. Build vision, audio and torchcodec with this binary. Smoke test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163768
Approved by: https://github.com/albanD, https://github.com/malfet
2025-09-25 14:03:14 +00:00
00059db034 Revert "[RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)"
This reverts commit 09cb34c1dce8fe1b880bbf3115d8ddad3401d871.

Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/malfet due to reverted internally and now can be safely reverted in OSS ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3334176367))
2025-09-25 13:47:46 +00:00
22fcc8b76b [async_tp] Support mm+rs with scatter_dim matmul K by sharding B (#162794)
Current state: Shape mismatch failure when mm+rs on the last mm scatter dim.

Adding separate path to handle lastdim for aten.mm, scaled_mm should be handled similarly, but needs additional PR.
So disabling scaled_mm case with filter matmul function.

Adding inductor.config for this change that is True by default for fast debuggability of new path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162794
Approved by: https://github.com/fegin
2025-09-25 12:18:39 +00:00
ab2ce3c50e [Code Clean] Replace std::runtime_error with TORCH_CHECK (#163264)
Related ISSUE: https://github.com/pytorch/pytorch/issues/148114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163264
Approved by: https://github.com/albanD, https://github.com/cyyever
2025-09-25 11:28:51 +00:00
7d710403b0 Reapply "Make functionalization ViewMeta serializable with pickle. (#143712)" (#163769)
### Summary:
NOTE: This is a re-export of https://github.com/pytorch/pytorch/pull/161994 ; the changes between these two PRs is exclusively to the buck/build files

(Summary from #161994 )
Attempted rebase of https://github.com/pytorch/pytorch/pull/143712.

This reverts commit 6c713ccb5e0df227dd5b630057cbccd373cbe7d6.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames Lucaskabela

imported-using-ghimport

Test Plan: Imported from OSS

Differential Revision: D81524507

Pulled By: Lucaskabela

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163769
Approved by: https://github.com/dolpm

Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
2025-09-25 10:27:37 +00:00
29cbcbac42 [BE] Make PyObjectSlot use a global PyInterpreter (#162659)
This pr gets rid of the pyobj_interpreter_ variable from PyObjectSlot and saves a word in the process

Gonna ask for review from @huydhn as there are some changes to CI.

Testing: imported internally and the failed android build seems to work now!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162659
Approved by: https://github.com/albanD, https://github.com/huydhn
2025-09-25 08:53:19 +00:00
5f90e8c7ae [PGO] ignore extra PGO key if warm/cold cache present (#163810)
Summary: avoids PGO profile merges

Test Plan: test_pgo

Differential Revision: D83200714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163810
Approved by: https://github.com/bobrenjc93
2025-09-25 07:16:05 +00:00
eb7f4e0004 Add PEP 517 compliant Python source distribution to release process (#157815)
This adds the actual creation of a standards compliant sdist along with its upload to s3 to the create release workflow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157815
Approved by: https://github.com/malfet, https://github.com/atalman
ghstack dependencies: #157814, #160315
2025-09-25 07:15:52 +00:00
42928876eb Add sdist handling to version finding (#160315)
The version finding logic triggered from `setup.py` generally tries to take the git information into account.
This is fine for most situations where we are building from a checkout, but it creates a problem in the case of sdists, as here the version is determined at the time of sdist creation, taking the git information into account, but then later recalculated when building wheels or installing from the sdist, now with the git information missing.

The solution is to take the version information directly from the sdist, which this PR adds by means of parsing the `PKG-INFO` which marks an unpacked sdist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160315
Approved by: https://github.com/atalman
ghstack dependencies: #157814
2025-09-25 07:15:51 +00:00
c44ec9f4c2 Improve MANIFEST.in for source distribution (#157814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157814
Approved by: https://github.com/XuehaiPan, https://github.com/atalman
2025-09-25 07:15:42 +00:00
353991dd92 [PGO] distinguish sticky PGO put (#163799)
Summary: put_remote_code_state vs. put_extra_remote_code_state

Test Plan: test_pgo

Differential Revision: D83195687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163799
Approved by: https://github.com/bobrenjc93
2025-09-25 06:59:25 +00:00
2b6a74abf1 [optim] prevent unintended aliasing in lr_scheduler; update type annotations/docs (#163120)
1. Prevents unintended aliasing of `self._last_lr`/`get_last_lr(...)` with `group["lr"]` when `group["lr"]` is a tensor.
2. Prevents unintended aliasing of `LRScheduler.base_lrs` with the `group["initial_lr"]`s.
3. Updates `test/optim/test_lrscheduler.py` to test tensor LRs.
4. Changes type annotations for `_last_lr`, `get_last_lr()`, `base_lrs`, `get_lr()`, and `_get_closed_form_lr()` from `list[float]` to `list[float | Tensor]`; adds documentation.

Fixes #163103

LR schedulers can behave in unexpected ways when using a tensor LR due to patterns like this:
```python
self._last_lr: list[float] = [group["lr"] for group in self.optimizer.param_groups]
```

This PR adds a helper to address this:
```python
def _param_groups_val_list(optimizer: Optimizer, key: str) -> list[Any]:
    """Create a list containing group[key] for each optimizer param_group.
    Prevents aliasing when group[key] could be a Tensor.
    Raises a KeyError when group[key] does not exist.
    """
    return [
        group[key].clone() if isinstance(group[key], Tensor) else group[key]
        for group in optimizer.param_groups
    ]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163120
Approved by: https://github.com/janeyx99
2025-09-25 06:58:58 +00:00
ad869c58f5 remove allow-untyped-defs from ./torch/utils/benchmark/op_fuzzers/sparse_unary.py (#163476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163476
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #163478, #163475, #163471
2025-09-25 06:48:44 +00:00
d5afb9e31a remove allow-untyped-defs from ./torch/ao/quantization/quantizer/utils.py (#163471)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163471
Approved by: https://github.com/Skylion007
ghstack dependencies: #163478, #163475
2025-09-25 06:48:44 +00:00
e7d6ea65ca remove allow-untyped-defs from ./torch/nn/utils/_expanded_weights/embedding_expanded_weights.py (#163475)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163475
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #163478
2025-09-25 06:48:44 +00:00
a6974195da remove allow-untyped-defs from ./torch/fx/experimental/unification/multipledispatch/core.py (#163478)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163478
Approved by: https://github.com/ezyang
2025-09-25 06:48:44 +00:00
a213848703 [Code Clean] Remove deadcodes about Python3.9 [8/N] (#163728)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163728
Approved by: https://github.com/albanD, https://github.com/cyyever
ghstack dependencies: #163626, #163627, #163629, #163643, #163644, #163645, #163646
2025-09-25 05:12:46 +00:00
cde5c9aebd fix pickling for BitwiseFn (#163571)
Summary:
ran into AttributeError: Can't get local object 'make_opaque_bitwise_fn.<locals>.BitwiseFn'

looks like it was fixed for UnaryFn but not BitwiseFn in https://github.com/pytorch/pytorch/pull/138395

Fixes #147841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163571
Approved by: https://github.com/jamesjwu
2025-09-25 04:52:11 +00:00
783a9dcb6d [6/n] Quantization with min & max bounds support - using fbgemm changes in ATen (#162924)
Summary:
This diff uses the FBGEMM changes made in D78181177 & D81858256 to support using the provided per row min/max values while quantizaing float/half to 8-bit, 4-bit & 2-bit in ATen library.

Please find more context on this here: https://fburl.com/gdoc/yutf32a0

Test Plan:
```
buck test mode/opt caffe2/torch/fb/model_transform/splitting/tests:split_dispatcher_test
```
https://www.internalfb.com/intern/testinfra/testrun/7881299640979446

Please refer to D80905814's test plan for integration testing.

Rollback Plan:

Differential Revision: D81327342

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162924
Approved by: https://github.com/jerryzh168
2025-09-25 02:52:04 +00:00
ad2f7315ca [torchfuzz] print out tensor descriptor as comments in codegen (#163739)
eg.

```
    # Node node_12: tensor_pointwise (depth 6)
    var_node_12 = torch.ops.aten.mul(var_node_13, var_node_34) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_10: tensor_pointwise (depth 7)
    var_node_10 = torch.ops.aten.div(var_node_11, var_node_12) # size=(1,), stride=(1,), dtype=complex128, device=cuda

    # Node node_2: tensor_pointwise (depth 8)
    var_node_2 = torch.ops.aten.div(var_node_3, var_node_10) # size=(1,), stride=(1,), dtype=complex128, device=cuda
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163739
Approved by: https://github.com/pianpwk
ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557, #163558, #163560, #163698
2025-09-25 01:29:29 +00:00
cc660d38ac [CI] Install libuv for Win testing (#163797)
Current working theory why f0078941cf caused a regression, are because Windows CI no longer could be build with distributed, as it could not find libuv
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163797
Approved by: https://github.com/wdvr
2025-09-25 01:10:14 +00:00
00f96dd84d [CI] Run CUDA-13 binary builds on trunk (#163787)
There are numerous other workflows that could be used to catch CUDA-12
build regression (our CI builds are almost identical to CD ones), but not many CUDA-13 builds around, so https://github.com/pytorch/pytorch/issues/163342 are really hard to detect in CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163787
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-09-25 00:58:17 +00:00
77b9aac6c2 Add rule for typechecking maintainers (#161307)
Allow the following people merge rights on type checking configs:
  - @lolpack
  - @maggiemoss
  - @ndmitchell
  - @kinto0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161307
Approved by: https://github.com/albanD, https://github.com/ezyang
2025-09-25 00:14:31 +00:00
7163dce1e0 [CUDA] Compare major version of the runtime device arch against the built version of the pytorch binary (#161299)
Fixes misleading warning messages when running on sm12x devices using binaries built with sm120.
PyTorch binary built with sm120 is compatible with e.g. sm121, so no need for the warning of incompatibility.

Also allow the 'matched_cuda_warn' message to show when e.g. the user is running a binary built with only sm90 on sm12x, so that the user would be prompted to get a build which supports e.g. sm120.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161299
Approved by: https://github.com/eqy, https://github.com/atalman
2025-09-24 23:59:19 +00:00
4ac4a7351e Shortcut redistribution when num_shards == 1 (#163742)
Redistribution doesn't need collectives when num_shards == 1 on a mesh dimension.
Only placement update is needed, local_tensor remains unchanged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163742
Approved by: https://github.com/tianyu-l

Co-authored-by: tianyu-l <150487191+tianyu-l@users.noreply.github.com>
2025-09-24 23:49:08 +00:00
65ddd91421 Fix redundant H2D/D2H memcpy in cpp_wrapper by creating scalar tensors on CPU (#160584)
Fixes #160520

Summary:
When running Inductor with cpp_wrapper under a DeviceContext, non-tensor arguments were being wrapped with torch.tensor(arg) without specifying the device.

creating the tensor on the current active device (like CUDA), and later fetching it back to CPU via .item(), causing unnecessary host-device-host memory transfers.

PR fixes issue by explicitly creating scalar tensors on the CPU:

```
input_tensors = [
    arg if isinstance(arg, torch.Tensor) else torch.tensor(arg, device='cpu')
    for arg in args
]
```

impact: inductor, codegen

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160584
Approved by: https://github.com/benjaminglass1, https://github.com/desertfire, https://github.com/mlazos, https://github.com/jeffdaily
2025-09-24 23:40:37 +00:00
8c98aee436 [Inductor] Update DeviceAssert op to behave like store (#163696)
Updated the DeviceAssert operation to match the behavior of Store, it will fixes the issue mentioned in [this PR](https://github.com/pytorch/pytorch/pull/163023) and updated testcases as Elias [suggested](https://github.com/pytorch/pytorch/pull/160677#discussion_r2353834646).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163696
Approved by: https://github.com/mlazos
2025-09-24 23:35:56 +00:00
d927e55498 [torchfuzz] refactor multi_process_fuzzer to be more readable (#163698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163698
Approved by: https://github.com/pianpwk
ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557, #163558, #163560
2025-09-24 23:32:34 +00:00
754c7e2e88 Update pyrefly configuration file (#163775)
Related to: https://github.com/pytorch/pytorch/issues/163283

This simply updates the existing pyrefly configuration and opts out additional directories. Running `pyrefly check` with this setup will result in ~100 errors reported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163775
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-09-24 23:14:39 +00:00
0ec946a052 [ROCm] Increase binary build timeout to 5 hours (300 minutes) (#163776)
Despite narrowing down the [FBGEMM_GENAI build to gfx942](https://github.com/pytorch/pytorch/pull/162648), the nightly builds still timed out because they [didn't get enough time to finish the post-PyTorch-build steps](https://github.com/pytorch/pytorch/actions/runs/17969771026/job/51109432897).

This PR increases timeout for ROCm builds for both [libtorch ](https://github.com/pytorch/pytorch/actions/runs/17969771026)and [manywheel](https://github.com/pytorch/pytorch/actions/runs/17969771041), because both of those are close to the 4hr mark currently.

This PR is a more ROCm-targeted version of https://github.com/pytorch/pytorch/pull/162880 (which is for release/2.9 branch).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163776
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-24 23:02:08 +00:00
2b1236de61 [dynamo] Fix handling of kwargs in exception constructor (#163390)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163390
Approved by: https://github.com/guilhermeleobas
2025-09-24 22:44:14 +00:00
bc8680c298 Avoid at::alias in the repeat op implementation (#163455)
Avoid `at::alias` in the `repeat` op implementation

## Summary

This PR removed the usage of `at::alias` in the implementation and just `permute`+`reshape` the tensor to fit the specs of the result.
This is a less hacky and a more readable way of implementing the op.
All the new ops we are using are view-only ops, which does not introduce overhead of changing the storage.

## Who want this

We are using `PrivateUse1` and accelerator, but this request to avoid `at::alias` in any op should be general enough for any backend who is using XLA, or who do not have explicit control over the memory allocation on the devices.

## Why we/they need this

As we support TPU, we are overriding some ATen ops by binding them to PrivateUse1.
However, it is not recommended to override the `repeat` op directly as we saw the following in `RegistrationDeclaration.h`.

```
at::Tensor repeat(const at::Tensor & self, c10::SymIntArrayRef repeats); // {"schema": "aten::repeat(Tensor self, SymInt[] repeats) -> Tensor", "dispatch": "True", "default": "True"}
```

We had to reuse the existing implementation of `repeat` to decomposite to other ops.
However, we are unable to support the current implementation, which uses `at::alias`.
It have two tensors share the same storage and modify one of them and return the other assuming it is changed, too.

As, we do not have explicit control over the memory allocation of the tensors using XLA/PJRT.

## Alternatives

We are open to alternative solutions that work for us if this PR is not in favor of the PyTorch community.
For example, we may just bind our version of `repeat` op implementation to both `PrivateUse` and `AutogradPrivateUse1`.
However, to my understanding, this would not work well with torch dynamo and `torch.compile`.

Would you mind guiding us on how to solve this?

Thanks!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163455
Approved by: https://github.com/Skylion007
2025-09-24 22:28:24 +00:00
1495b35d29 Remove Python 3.9 for Triton builds (#163778)
Related to https://github.com/pytorch/pytorch/issues/161167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163778
Approved by: https://github.com/malfet
2025-09-24 20:19:43 +00:00
90a282504e Add inference_mode hint message to use eval with inference. (#163619)
Fixes #162923

## Test Result

### Before

<img width="985" height="889" alt="image" src="https://github.com/user-attachments/assets/41de5cfa-7b25-4ba4-ade8-a6df745dcb30" />

### After

<img width="913" height="977" alt="image" src="https://github.com/user-attachments/assets/b6c06860-8db3-4b5d-9d46-31ece01fb04d" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163619
Approved by: https://github.com/jbschlosser
2025-09-24 20:07:14 +00:00
0dce2afd44 [ROCm][CI] adjust tf32 tolerance for test_compile_kernel_advanced (#163783)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163783
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-24 19:39:15 +00:00
71eec6a0bf [dist] handle discontiguous allgather/reducescatter inputs (#163712)
Fixes #163483

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163712
Approved by: https://github.com/ezyang, https://github.com/kwen2501
2025-09-24 19:38:44 +00:00
0456b23b77 [AOTI] Add verbose error information for extract file (#163718)
This PR optimize `extract_file` functions:
1. `normalize_path_separator` the dest path for Windows.
2. Add verbose error message:
a. On Linux, add mz_zip error string.
b. On Windows, add mz_zip error string and Windows error code.

For the UT `test_package_user_managed_weight`:
<img width="1910" height="442" alt="image" src="https://github.com/user-attachments/assets/6a63eda1-70ce-40fb-9681-adc955463884" />

It still have issue with error code `32`, checked https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499- and find the verbose is `ERROR_SHARING_VIOLATION`.

It is a little complex to debug, I will continue to working on it in further PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163718
Approved by: https://github.com/desertfire
2025-09-24 19:27:30 +00:00
c414f75c8b [WOQ][Inductor] Enable CUDA coverage for _weight_int8pack_mm (#163461)
Summary:
What: Unskip the CUDA path for test_int8_weight_only_quant in test_torchinductor.py as the kernel was added by #159325.

Why: Confirm CUDA backend for _weight_int8pack_mm is registered.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda
```
https://www.internalfb.com/intern/testinfra/testrun/2533275104869494

Differential Revision: D82926440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163461
Approved by: https://github.com/jerryzh168
2025-09-24 19:20:38 +00:00
768361e67f Add less warps config to inner reductions (#162447)
Add less warps to ensure proper vectorization + memory coalescing for inner reductions, prefer more work per thread

<img width="1717" height="731" alt="Screenshot 2025-09-17 at 10 03 25 AM" src="https://github.com/user-attachments/assets/7b1f4a30-62f2-4bee-bb9c-122501bde63e" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162447
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
2025-09-24 19:09:02 +00:00
9341ede617 Revert to old behaviour of not padding strides if shape or stride is dynamic (#163639)
Differential Revision: D83053287

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163639
Approved by: https://github.com/blaine-rister
2025-09-24 18:31:01 +00:00
4c2c401ccf Record redistribute_local_tensor in DebugMode (#163704)
Explicit redistribute_local_tensor API call could also results in communication, record it!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163704
Approved by: https://github.com/ezyang
2025-09-24 16:11:26 +00:00
5d0f639234 Make Tensor.__dlpack__(stream=None) capture-safe during CUDA Graph capture (#163242)
Many extensions (including pybind helpers) call `Tensor.__dlpack__()` without a stream argument. Before #150217, `stream=None` behaved like “no cross-stream sync” and was safe inside CUDA Graph capture. After #150217, `stream=None` maps to the legacy default stream, adding a cross-stream wait that invalidates capture when running on a non-default stream.

See this example

```
import torch
s = torch.cuda.Stream()
x = torch.randn(8, device="cuda")
g = torch.cuda.CUDAGraph()

with torch.cuda.stream(s):
    with torch.cuda.graph(g):
        _ = x + 1
        cap = x.__dlpack__()
        _ = torch.utils.dlpack.from_dlpack(cap)
```

This PR partially reverts #150217 that stream=None defaults to no sync.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163242
Approved by: https://github.com/ngimel
2025-09-24 16:04:19 +00:00
9d0d98acfe Use cuda nvrtc so file based on cuda version used by torch (#163642)
Fixes https://github.com/pytorch/pytorch/issues/162367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163642
Approved by: https://github.com/msaroufim
2025-09-24 14:23:39 +00:00
3b73841f43 update test_quantization tests to run weekly (#163077)
Fixes #162854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163077
Approved by: https://github.com/huydhn
2025-09-24 11:31:11 +00:00
141fc7276e [CD] CUDA 13.0 fix preload logic to include nvidia/cu13/lib/ (#163661)
Preload logic no longer works with CUDA 13.0
See the installation path:
```
ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/cu13/lib/
libcheckpoint.so   libcudadevrt.a      libcufft.so.12   libcufile_rdma.so.1  libcusolver.so.12    libnvJitLink.so.13  libnvperf_target.so            libnvrtc.alt.so.13    libpcsamplingutil.so
libcublas.so.13    libcudart.so.13     libcufftw.so.12  libcupti.so.13       libcusolverMg.so.12  libnvblas.so.13     libnvrtc-builtins.alt.so.13.0  libnvrtc.so.13
libcublasLt.so.13  libcudart_static.a  libcufile.so.0   libcurand.so.10      libcusparse.so.12    libnvperf_host.so   libnvrtc-builtins.so.13.0      libnvtx3interop.so.1

ls /home/ubuntu/.venv/lib/python3.10/site-packages/nvidia/
cu13  cudnn  cusparselt  nccl  nvshmem
```

Test using script from : https://github.com/pytorch/pytorch/issues/162367
```
Kernel test passed!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163661
Approved by: https://github.com/nWEIdia, https://github.com/tinglvv, https://github.com/Camyll
2025-09-24 11:27:05 +00:00
b66aa1ade1 [ARM] Add test_memory_profiler to aarch64 tests (#145260)
TestMemoryProfilerE2E.test_memory_timeline is failing on AArch64, this fixes it and enables it in the opt-in list of tests for AArch64.

Fixes #142371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145260
Approved by: https://github.com/fadara01, https://github.com/sraikund16
2025-09-24 09:29:13 +00:00
207f104594 [Triton] [Inductor] Set default configs for Blackwell Matmul Template (#163740)
Summary: Sets the default configs for the Blackwell Matmul Templates.

Test Plan: NFC

Differential Revision: D83116342

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163740
Approved by: https://github.com/jananisriram
2025-09-24 08:17:35 +00:00
3e1b1a30f2 Revert "[inductor] Fix issue with scalar arg handling" (#163737)
This reverts commit a8cd437183142e17ba6fc8d7b5e9dcee462d7904.

See https://github.com/pytorch/pytorch/pull/163481#issuecomment-3326310774

This PR might also cause issues with cudagraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163737
Approved by: https://github.com/ezyang
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481, #163520, #163482
2025-09-24 07:33:12 +00:00
2390d34c9b [Code Clean] Remove deadcodes about Python3.9 [7/N] (#163646)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163646
Approved by: https://github.com/jansel
ghstack dependencies: #163626, #163627, #163629, #163643, #163644, #163645
2025-09-24 07:30:50 +00:00
a635505a99 [Code Clean] Remove deadcodes about Python3.9 [6/N] (#163645)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163645
Approved by: https://github.com/albanD
ghstack dependencies: #163626, #163627, #163629, #163643, #163644
2025-09-24 07:30:50 +00:00
6f34cc040f [Code Clean] Remove deadcodes about Python3.9 [5/N] (#163644)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163644
Approved by: https://github.com/jansel
ghstack dependencies: #163626, #163627, #163629, #163643
2025-09-24 07:30:50 +00:00
ec0cd81c38 [Code Clean] Remove deadcodes about Python3.9 [4/N] (#163643)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163643
Approved by: https://github.com/albanD
ghstack dependencies: #163626, #163627, #163629
2025-09-24 07:30:50 +00:00
33aabdd8ac [Code Clean] Remove deadcodes about Python3.9 [3/N] (#163629)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163629
Approved by: https://github.com/albanD
ghstack dependencies: #163626, #163627
2025-09-24 07:30:50 +00:00
0bca77951d [Code Clean] Remove deadcodes about Python3.9 [2/N] (#163627)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163627
Approved by: https://github.com/jansel
ghstack dependencies: #163626
2025-09-24 07:30:50 +00:00
bf0747c6c6 [Code Clean] Remove deadcodes about Python3.9 [1/N] (#163626)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163626
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-09-24 07:30:50 +00:00
11a231ef52 [c10d] P2P tensors must be dense (#163719)
Fixes #161324
by adding `is_non_overlapping_and_dense` check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163719
Approved by: https://github.com/ngimel
2025-09-24 06:58:03 +00:00
dad54ca7c0 Add mistral/gpt-oss to benchmarks (#163565)
Potential issues
* gpt-oss-20b is probably too big (I can't run on my devserver)
* Mistral requires HF authentication
* Mistral also takes a while to run the performance checks (need to wait for CI)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163565
Approved by: https://github.com/huydhn
2025-09-24 06:12:36 +00:00
2c5a3d7e60 Delete functorch C extension entirely. (#163340)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163340
Approved by: https://github.com/aorenste, https://github.com/wdvr, https://github.com/albanD, https://github.com/malfet
2025-09-24 06:08:58 +00:00
f68de58c9d [Inductor-FX] Support symbol and dynamic scalar graph inputs and outputs (#163596)
# Problems
This PR fixes a few edge cases that the FX converter missed related to dynamic shapes.

1. Inductor graphs can sometimes take `sympy.Symbol` inputs. We have logic to convert these to FX placeholder nodes. However, this logic did not update the `self.expr_to_proxy` table mapping symbols to proxy nodes. (There was existing logic to do this for `ir.TensorBox` inputs, but not `sympy.Symbol`.) This caused sympy tracing to fail when these symbol inputs were used in other expressions.

2. We lacked codegen for `ShapeAsConstantBuffer`. This IR node is seen when the graph input or output is a scalar computed from dynamic shapes.

# Fixes
a. Update `self.expr_to_proxy` when generating placeholders for `sympy.Symbol` inputs. Change `SymbolBuffer.get_example` to convert the symbol to a `torch.SymInt`, so we can populate `meta["val"]` correctly and use the value in other computations.
b. Support `ShapeAsConstantBuffer` by tracing the sympy expression.
c. Move output generation inside the metadata hook, allowing us to populate `meta["val"]` for the nodes computing `ShapeAsConstantBuffer`.

# Test plan
Added several new CI tests:
 1. `torch.cond` with dynamic shapes. This exposes both issues, as the predicate is a `ShapeAsConstantBuffer` and one of the subgraphs uses a symbol input, due to the closure. Also tests when the parent and subgraphs have different input shapes.
 2. Output dynamic shape scalar. This tests `ShapeAsConstantBuffer` as an output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163596
Approved by: https://github.com/angelayi, https://github.com/jansel
2025-09-24 06:08:14 +00:00
a8e9ed2407 [inductor] turn on loaf (for oss) by default (#162030)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162030
Approved by: https://github.com/eellison, https://github.com/jansel
2025-09-24 06:02:02 +00:00
0390798dad [Triton] [Inductor] Enable Epilogue Subtiling in the blackwell ws template (#163145)
Summary: Enables support for epilogue subtiling in the blackwell ws template. This requires the ability to call `store_output` twice in the same kernel and reuse the same tensor descriptor across allocations.

Test Plan:
Tested with test_max_autotune.py on a Blackwell server.

Rollback Plan:

Differential Revision: D82610077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163145
Approved by: https://github.com/eellison
2025-09-24 05:38:02 +00:00
124dd364e9 [hop] support local_map + SAC (#163322)
Some ops like local_map hop's deferred mode are not desugared by make_fx, this means that when we apply SAC tags, we will need to define dispatch rules for the SAC torch dispatch modes as pointed out here: https://github.com/pytorch/pytorch/issues/162246#issuecomment-3259176721. This PR adds those rules.

Additionally it fixes a pre-existing issue where we weren't coercing tangent layout (that AOTAutograd typically does) when partitioning the HOP joint.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163322
Approved by: https://github.com/ezyang
2025-09-24 04:57:40 +00:00
20eeb54814 Add api info for torch._C._nn.pyi (#162936)
Fix part of #148404

APis involved are as followed:

- silu
- silu_
- smooth_l1_loss
- soft_margin_loss
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162936
Approved by: https://github.com/FFFrog, https://github.com/ezyang
2025-09-24 04:55:57 +00:00
6f1d962d5b [vllm hash update] update the pinned vllm hash (#163711)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163711
Approved by: https://github.com/pytorchbot
2025-09-24 04:31:37 +00:00
42e9902a0f cd: Move arm64 to linux.arm64.r7g.12xlarge.memory (#163681)
This should reduce the amount of build time we have by a lot by just
throwing more hardware at the problem.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163681
Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/malfet
2025-09-24 04:06:09 +00:00
d746b987d8 [inductor] Fix divmod error in decomp (#163482)
Fixes #163457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163482
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481, #163520
2025-09-24 02:52:36 +00:00
6fa972796e [inductor] Fix bugs in emulate_precision_casts (#163520)
Fixes #163449
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163520
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422, #163481
2025-09-24 02:52:36 +00:00
ca512af3e7 [inductor] Fix issue with scalar arg handling (#163481)
Fixes #163420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163481
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412, #163422
2025-09-24 02:52:36 +00:00
c261c71f3e Simplify _compute_local_shape_and_global_offset and make it SPMD. (#163344)
There is only one substantive change: the branch on
`global_offset[shard_dim] <= local_offset[shard_dim]`
is removed because it is unnecessary: you can always treat the
first shard uniformly with the rest of the shards, because your
global offset is guaranteed to be zero in this case anyway.

I also switch the shard_size case to sym_ite, to make it possible
for LocalTensor to deal with the MPMD-ness here, but it's equivalent
to the old if-then-else.

I tried to rewrite the comments to be more clear what is going on
algorithmically here.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163344
Approved by: https://github.com/albanD, https://github.com/zpcore, https://github.com/tianyu-l
2025-09-24 02:24:09 +00:00
e2ce79e4cc [Flex] Fix silent correctness w/ backpropping grads (#163677)
Fixes #https://github.com/pytorch/pytorch/issues/162228

# Summary

Majority of our tests are only compiling flex-attention in isolation. This means that for fake tensor propagation the input primals and all captured buffers dont do any intermediate computation below autograd.  As a result result the by happen chance match the `require_grad`ness of the eager implementation and this check  will pass. However if score_mod is a the result of some other intermediate fake tensor prop then it is not guaranteed to have accurate req_gradness, which was happening here.

TLDR is that this was a boot and suspenders that was actually harmful and we should just let the joint graph handle creating the correct joint graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163677
Approved by: https://github.com/ydwu4
2025-09-24 02:12:19 +00:00
be6c127927 [AOTI] Pass comments from metadata to the autotune block (#163600)
Summary: When generating Triton kernels in the compile-time autotune blocks, it will be useful to generate source information as code comments. Previously we ignore these comments for autotune code blocks because the generated main output code will contain the same information, but it won't work if the generated autotune code crashes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163600
Approved by: https://github.com/yushangdi
2025-09-24 02:01:59 +00:00
1e754d5a80 docs and optional kwargs for full graph capture (#163550)
Test Plan: existing tests

Differential Revision: D82995546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163550
Approved by: https://github.com/tugsbayasgalan
2025-09-24 01:20:27 +00:00
dc9352938b [Triton] [Inductor] Restrict subprocess autotuning to just Triton (#162688)
Summary: Restricts subprocess benchmarking to only `TritonTemplateCaller`, which is expected by the underlying `target` method. THhis triggered a bug with large K shapes because the decompose k is `SubgraphChoiceCaller`.

Test Plan:
mm autotuning with a large k and `TORCHINDUCTOR_AUTOTUNE_IN_SUBPROC=1`

Rollback Plan:

Differential Revision: D82181924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162688
Approved by: https://github.com/PaulZhang12, https://github.com/eellison, https://github.com/mlazos
2025-09-24 01:03:40 +00:00
4535254c28 [3/N] Use std::filesystem in inductor (#163632)
Continued work to use std::fs in inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163632
Approved by: https://github.com/Skylion007
2025-09-24 00:23:34 +00:00
eb3fbf5b08 [inductor] in emulate_precision_casts, disable fma fusion in triton (#163073)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163073
Approved by: https://github.com/eellison, https://github.com/jansel
2025-09-23 23:59:17 +00:00
ee75c3d91f Support for amin, amax, and aminmax (#163669)
Support for amin, amax, and aminmax

Test Plan: E2E tests in the stack with benchmark suite passes.

Differential Revision: D83016894

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163669
Approved by: https://github.com/albanD, https://github.com/malfet
2025-09-23 23:45:43 +00:00
f9fa138a39 [BE] Delete all pre py-3.10 checks (#163653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163653
Approved by: https://github.com/jansel
ghstack dependencies: #163648, #163649
2025-09-23 23:22:53 +00:00
f3f67ff43a Fix warn message (#163578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163578
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/atalman, https://github.com/v0i0
2025-09-23 22:46:51 +00:00
6b5ad5f211 [Kineto] Add list of string parsing for profiler (#163593)
Summary:
We add the parsing for list of string. This is needed for AOTInductor
profiling for input information of Triton kernels.

Test Plan:
Included in commit.
test_profiler_op_event_kwargs_list_of_strings

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163593
Approved by: https://github.com/sraikund16
2025-09-23 22:45:49 +00:00
20149080f2 [MPS] Compute offset2bag/bag_size/max_indices in _embedding_bag (#163281)
Part of #162270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163281
Approved by: https://github.com/malfet
2025-09-23 22:30:48 +00:00
b879ef7c0d [ROCm][CI] skip TestCudaPrimaryCtx.test_set_device_0 (#163693)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163693
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-23 22:15:10 +00:00
c63e417c79 use reduction hint for aggressive rblock (#163371)
I had been using tiling scores to essentially check if this is an inner reduction. since that is not fully rolled out for dynamic shapes, use reduction hint when they are not available.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163371
Approved by: https://github.com/PaulZhang12
2025-09-23 22:04:22 +00:00
c3d9f089d9 [torchfuzz] introduce multi process fuzzer (#163560)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163560
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557, #163558
2025-09-23 22:00:51 +00:00
29af25844b Less aggressive persistent reduction when it could induce large masking with dynamic shapes (#163365)
As per comment in source code:
```
            # If we are are coalescing on xblock (not ReductionHint.INNER) and this is not a tiny kernel
            # (not ReductionHint.OUTER_TINY), do not use persistent reduction if it induces tile
            # quantization. Peristent reduction forces rblock == rnumel, if the bounds between lower
            # and upper are large, for the lower values we will be masking off large % of read/writes,
            # when we could expand the coalescing xblock instead.
```

For the test case in question, this pr improves perf from 0.8573521325143717 -> 0.043151492193814305 because we were egregiously masking out rblock values (58/64 values).

Differential Revision: [D82853279](https://our.internmc.facebook.com/intern/diff/D82853279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163365
Approved by: https://github.com/shunting314, https://github.com/PaulZhang12, https://github.com/jansel, https://github.com/v0i0
2025-09-23 21:58:57 +00:00
8c8416b021 Update pytorch.org links in docs/conf.py (#163682)
Update links in conf.py to docs.pytorch.org

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163682
Approved by: https://github.com/sekyondaMeta, https://github.com/albanD
2025-09-23 21:40:11 +00:00
b182365660 [ez] use list initializer syntax in fill_diagonal_ (#163607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163607
Approved by: https://github.com/Skylion007
ghstack dependencies: #163485
2025-09-23 21:27:12 +00:00
5ca563ea09 symintify fill_diagonol_ (#163485)
Fixes #162271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163485
Approved by: https://github.com/Skylion007
2025-09-23 21:27:12 +00:00
e671dcc969 Update tests to check for more robust pattern (#163107)
Landing this instead of https://github.com/pytorch/pytorch/pull/162994.

Here is how i think the whole dynamo + frame construction logic work:
1) There is no way to create a frame object in python land as this is created in runtime from cpython. So that's why aot_compile creates FrameInfo this way. (kind of like simulating the runtime) i guess you could write your own very simple eval_frame.c where you can interject the frame construction but we probably don't want that.
2) When there is no wrapper (the old export or aot_compile), we first assign sources by iterating over f_locals which contain both local args and closure variables (this is implementation details of cpython frame construction). So thats why closure variables end up getting LocalSource names as can be shown in this test case (f6ea41ead2/test/export/test_export.py (L1369)). Note that L["self"] here means we are referring to local object self. Important thing to keep in mind here is this self is not actually model self, but the outer self.
3) When we switch to wrapper case, we end up trying to inline the original inner module. When doing so, we need to track all local and closures for this inner module as can be seen here (f6ea41ead2/torch/_dynamo/variables/functions.py (L463)) Here we are not looking into inner frame's f_locals but just directly look at closures. I guess this is because we are one more frame up so there is no access to frame f_locals at this point. And it is probably not good idea to change dynamo's logic here. As a result, i get following error message that is different from old export:
"While exporting, we found certain side effects happened in the model.forward. Here are the list of potential sources you can double check: ["L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank", "L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank_dict", "L['self']._export_root.forward.__func__.__closure__[0].cell_contents"]"

My initial attempt of solving this was taking inner closures and put them to f_locals for the frame i am constructing which turned out too compilcated because we needed to muck around bytecode instructions as well. So i am thinking we should just update the test to reflect new names and follow up with better post-processing step to have better names.

Differential Revision: [D82582029](https://our.internmc.facebook.com/intern/diff/D82582029)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163107
Approved by: https://github.com/avikchaudhuri
2025-09-23 21:11:48 +00:00
fc84743707 Implement CUDA stream protocol (#163614)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163614
Approved by: https://github.com/eqy
2025-09-23 21:02:08 +00:00
2a9745de3c [multi-kernel] shape-similarity kernel selection (#163090)
Introduces a variant of size-hint multi-kernel, where for novel runtime shapes, instead of performing full benchmarking to determine the optimal kernel, selects one of many kernels pre-generated from multi-kernel hints, based off similarity b/w hint / runtime input & output shapes (L1 distance in log2 space).

Some caveats/changes:
- Size-hint multi-kernel now only kicks in if the kernel has dynamic shapes
- Pre-generation still only does 1-d search over specified hints, e.g. `matmul([s0, s1], [s1, s2])` with size-hints `[64, 256]` only generates 2 kernels - based on tuning shapes ([64, 64], [64, 64]) and ([256, 256], [256, 256]). Extending this to reasonable n-d search (via user API?) is an extension

Benchmarking results, compared to multi-kernel w/ full benchmarking (hints 64, 4096), and compiling with the ground truth hint:
<img width="1902" height="1222" alt="550541081_1088709150049684_6528797079439730237_n" src="https://github.com/user-attachments/assets/056cca48-c16a-4451-9b4a-fa13a7a058a9" />

Full benchmarking doing worse is extremely weird, but we did see similar spikes in #156628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163090
Approved by: https://github.com/bobrenjc93
2025-09-23 21:00:47 +00:00
22c5e8c17c Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446)
Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores

<img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" />

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #162296
2025-09-23 20:36:39 +00:00
bcb893acb0 [ROCm] Build FBGEMM_GENAI for gfx942 only (#162648)
Fixes build timeouts >4h on libtorch build jobs: 75e7f49f9c/1

Brings back code to narrow down CK compilation targets from 69a25f6888 (diff-ce80f3115ab2f6be5142f0678a1fc92c6b2d7727766ce44f48726c99e720f777)

gfx942 supports fp8

Don't enable gfx950 for now, until more optimizations are in place as per https://github.com/pytorch/pytorch/pull/162648/files#r2369588738

Validation:
[rocm6.4](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483128) and [rocm6.3](https://github.com/pytorch/pytorch/actions/runs/17944766350/job/51028483093) libtorch builds finished within 3.9h.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162648
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-23 18:55:35 +00:00
8e6b0c71fb [Inductor] Remove no_type_check annotation on properties (#163570)
Some properties with `cache_on_self` were prevously annotated with `no_type_check`, to get around mypy limitations. This PR replaces both annotations with `cache_property_on_self`, to enable type checking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163570
Approved by: https://github.com/mlazos, https://github.com/PaulZhang12, https://github.com/Skylion007
2025-09-23 18:20:04 +00:00
0696a4b0b8 [EZ] Perma-ignore UP038 (#163649)
As it has been removed, see https://docs.astral.sh/ruff/rules/non-pep604-isinstance/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163649
Approved by: https://github.com/Skylion007
ghstack dependencies: #163648
2025-09-23 17:58:18 +00:00
ca35dc2fdd [EZ] Fix UP041 violations (#163648)
I.e. use `TimeoutError` instead of `socket.timeout`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163648
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-09-23 17:58:18 +00:00
649ceda8a5 [export] handling NamedTuple inputs (#162959)
Fixes #160547
### Summary:
bug
```
    def test_namedtuple(self):
        from collections import namedtuple
        Point = namedtuple('Point', 'x y')

        class M(torch.nn.Module):
            def forward(self, x, y):
                return x + y

        inp = Point(torch.ones(3), torch.ones(3))
        print(M()(*inp))

        # errors
        ep = torch.export.export(M(), inp, strict=False)
        print(ep)

        # succeeds
        ep = torch.export.export(M(), inp, strict=True)
        print(ep)

        # workaround could be to convert namedtuple to a kwarg
        inp_kwargs =  {field: getattr(inp, field) for field in inp._fields}
        ep = torch.export.export(M(), (), inp_kwargs)
        print(ep)
```
FIx :
namedtuple is subclass of tuple
but namedtuple is not expected
So, this change handles named tuple case

I have added 🧪 test case for this as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162959
Approved by: https://github.com/angelayi

Co-authored-by: Angela Yi <angelayi@meta.com>
2025-09-23 17:43:50 +00:00
2aadcea05c [ROCm] Improve perf for elementwise broadcast with mixed dtype (#163562)
* Unroll loops manually to hide memory access latency

Co-author: @amd-hhashemi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163562
Approved by: https://github.com/jeffdaily
2025-09-23 17:42:48 +00:00
fde929c8a8 [AOTI] Fix model_package_loader get_cpp_compile_command (#163561)
It should fix AOTI UTs of `test_aot_inductor_package.py`, these cases are failed at `compile_so`.

reproducer:
```cmd
pytest test\inductor\test_aot_inductor_package.py -v -k test_multiple_methods
```
<img width="1262" height="95" alt="image" src="https://github.com/user-attachments/assets/49458536-1cfe-498e-a12a-2bfd8da67a9e" />

Major fix at `get_cpp_compile_command`. The code is aligned to cpp_builder frontend code:  3ef1bef36c/torch/_inductor/cpp_builder.py (L1780-L1790)
3ef1bef36c/torch/_inductor/cpp_builder.py (L1959-L1976)

Fixed on Windows:
<img width="1261" height="89" alt="Image" src="https://github.com/user-attachments/assets/9bf43b11-aac1-4161-a625-e602e313a299" />

Also validated on Linux:
<img width="1039" height="81" alt="Image" src="https://github.com/user-attachments/assets/46063e16-6cf1-4a28-8466-0496871b8619" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163561
Approved by: https://github.com/jansel
2025-09-23 17:38:18 +00:00
134dfbeaef [DCP] DTensor slice dequantization with proper block alignment (#163532)
Summary:
When loading quantized tensors with DTensor slicing, the dequantization process was producing numerically incorrect results due to improper block-to-slice coordinate mapping. The previous implementation calculated block boundaries relative to the sliced tensor dimensions instead of the original full tensor dimensions, causing scale factors to be applied to wrong tensor regions.

This fix addresses the issue by:

1. **Proper coordinate mapping**: Added `_get_slice_to_block_mapping()` to correctly map tensor slices to quantization blocks using global coordinates from the full tensor shape.

3. **Block-aligned dequantization**: Updated `_dequantize_tensor()` to use proper block intersection logic, ensuring scale factors are applied to the correct portions of sliced tensors.

The fix ensures that when DTensor requests a slice of a quantized tensor, the dequantization correctly identifies which quantization blocks intersect with the requested slice and applies the appropriate scale factors to the right tensor regions.

Test Plan:
Tested with DTensor configurations where quantized tensors are sliced across different dimensions. Verified that:
1. Dequantized tensor values are numerically correct
2. Block boundaries are properly calculated relative to full tensor shape
3. Scale factors are applied to correct tensor regions
4. Tensor shapes map is built efficiently using only metadata

Correctness validation using https://github.com/wwwjn/torchtitan/blob/dsv3-sd-test/tests/fsdp_dequantized_load.py
```
{
  "model.layers.0.mlp.gate_proj.weight": {
    "mse": 4.30626645453458e-11,
    "mae": 9.98388827611052e-07,
    "max_abs_diff": 0.0009703934192657471,
    "cosine_similarity": 1.010810375213623,
    "relative_error": 0.001330620958469808,
    "kl_divergence_1_to_2": "6.563401e-08",
    "kl_divergence_2_to_1": "-6.522914e-08",
    "js_divergence": 1.3711876079014476e-10,
    "shape": [
      18432,
      7168
    ],
    "t1_stats": {
      "min": -0.4453125,
      "max": 0.30859375,
      "mean": -1.2592146958922967e-05
    },
    "t2_stats": {
      "min": -0.44529813528060913,
      "max": 0.3085886240005493,
      "mean": -1.2624391274584923e-05
    }
  },
  "model.layers.0.mlp.up_proj.weight": {
    "mse": 2.5534721906361746e-11,
    "mae": 3.118609583907528e-06,
    "max_abs_diff": 0.00047551095485687256,
    "cosine_similarity": 1.038962483406067,
    "relative_error": 0.0013681650161743164,
    "kl_divergence_1_to_2": "-5.8253768e-08",
    "kl_divergence_2_to_1": "5.8747577e-08",
    "js_divergence": NaN,
    "shape": [
      18432,
      7168
    ],
    "t1_stats": {
      "min": -0.228515625,
      "max": 0.2333984375,
      "mean": 8.862222955485777e-08
    },
    "t2_stats": {
      "min": -0.2285017967224121,
      "max": 0.23338991403579712,
      "mean": 8.824501662729745e-08
    }
  },
  "model.layers.0.mlp.down_proj.weight": {
    "mse": 2.2803769289536646e-11,
    "mae": 2.8916260816913564e-06,
    "max_abs_diff": 0.0008973777294158936,
    "cosine_similarity": 1.0376262664794922,
    "relative_error": 0.001346255769021809,
    "kl_divergence_1_to_2": "1.2744896e-07",
    "kl_divergence_2_to_1": "-1.2736885e-07",
    "js_divergence": 5.992362162032805e-11,
    "shape": [
      7168,
      18432
    ],
    "t1_stats": {
      "min": -0.54296875,
      "max": 0.546875,
      "mean": -2.9487239316949854e-07
    },
    "t2_stats": {
      "min": -0.5429964661598206,
      "max": 0.5469087362289429,
      "mean": -2.9507478416235244e-07
    }
  }
}
```

https://www.internalfb.com/intern/testinfra/testrun/3940649985202645

Differential Revision: D82975005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163532
Approved by: https://github.com/wwwjn
2025-09-23 16:48:16 +00:00
221ac81043 Revert "[precompile] Add option to disable guard check on aot-compiled function. (#163432)"
This reverts commit 539e84e289fa7563032410706ede50a4eaa7a15d.

Reverted https://github.com/pytorch/pytorch/pull/163432 on behalf of https://github.com/Camyll due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/163432#issuecomment-3324757069))
2025-09-23 16:31:30 +00:00
6e5dddba64 Use accelerator API in common_dtensor (#163498)
Fixes #ISSUE_NUMBER

Try to unify the device checking in common_dtensor (testing module) by accelerator API

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163498
Approved by: https://github.com/albanD, https://github.com/H-Huang
2025-09-23 16:30:20 +00:00
ebddbe787a [ROCm][CI] skip test_sparse_triangular_solve (#163651)
need more time to debug, but also need clean CI signal test was unskipped by #163495, but had been skipp on rocm prior

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163651
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-23 15:55:51 +00:00
5f0c7cb4aa Add B200 smoke test (#159494)
Okay running test_max_autotune locally on B200is horrible read, for now to get something landed I am focusing on test_matmul_cuda.py and test_fp8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159494
Approved by: https://github.com/nWEIdia, https://github.com/huydhn
ghstack dependencies: #163460, #163537, #163552
2025-09-23 15:45:05 +00:00
b3cf5c79dd Skip on sm100 later since Tests are non determinisitic (#163552)
This is tracked https://github.com/pytorch/pytorch/issues/163462

skipping since we are seeing sporadic errors locally and on CI,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163552
Approved by: https://github.com/eqy, https://github.com/Skylion007
ghstack dependencies: #163460, #163537
2025-09-23 15:45:05 +00:00
0f674077f4 Large tests failing on bfloat16 (#163537)
# Summary

I ran these tests locally, each 10k Tests takes over 5 mins for an extremely beefy cpu to run. I think that this is overkill feel free to disagree. Also the 1 test I ran that failed earlier up in the stack failed with 1 ulp difference so I think that this is kind of an edgecase on how we do testing (will right up issue for my thoughts later)

``` Shell
==================================================================================================== FAILURES =====================================================================================================
_________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16 __________________________________________________________
Traceback (most recent call last):
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision
    self.cublas_addmm(size, dtype, True)
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm
    assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol)
  File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp
    raise AssertionError("\n".join(error_parts))
AssertionError: Tensor-likes are not close!

Mismatched elements: 425 / 100030002 (0.0%)
Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed)
Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed)

============================================================
ULP Analysis of Failures:
============================================================

Total failures: 425
ULP distances: min=-32761, max=32763, mean=-11513.7

Top 10 failures by absolute difference:
  #  | Index                      | Abs Diff    | Rel Diff    | ULP  | Expected     | Actual
----------------------------------------------------------------------------------------------------
   1 | (6923, 1580)               | 1.600000e+01 | 5.390625e-01 |  146 |    29.750000 |    13.750000
   2 | (4677, 420)                | 1.600000e+01 | 6.601562e-01 |   95 |    24.250000 |    40.250000
   3 | (2176, 9325)               | 1.600000e+01 | 6.875000e-01 |  210 |    23.250000 |     7.250000
   4 | (5119, 7865)               | 1.600000e+01 | 1.164062e+00 |  146 |   -13.750000 |   -29.750000
   5 | (3218, 8334)               | 1.600000e+01 | 2.593750e+00 |  236 |     6.156250 |    22.125000
   6 | (5245, 241)                | 1.600000e+01 | 5.468750e-01 |   75 |    29.250000 |    45.250000
   7 | (7666, 6549)               | 1.600000e+01 | 1.640000e+03 | 1376 |    -0.009766 |   -16.000000
   8 | (1663, 1115)               | 1.593750e+01 | 8.375000e+00 | -32427 |     1.898438 |   -14.062500
   9 | (3967, 7708)               | 1.593750e+01 | 1.368750e+01 | -32510 |     1.164062 |   -14.750000
  10 | (2874, 2038)               | 1.593750e+01 | 1.710938e+00 |  181 |     9.312500 |    25.250000

Note: Maximum absolute and relative errors occur at different locations
  Max abs diff location (2176, 9325): 210 ULP
  Max rel diff location (376, 3754): 31868 ULP

To execute this test, run the following from the base repo dir:
    python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublas_cuda_bfloat16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
________________________________________________________ TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16 _________________________________________________________
Traceback (most recent call last):
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 3223, in wrapper
    method(*args, **kwargs)
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 426, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_device_type.py", line 1408, in only_fn
    return fn(slf, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/.conda/envs/nightly/lib/python3.12/site-packages/torch/testing/_internal/common_utils.py", line 2024, in wrap_fn
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 190, in test_cublas_addmm_reduced_precision
    self.cublas_addmm(size, dtype, True)
  File "/home/dev/meta/pytorch/test/test_matmul_cuda.py", line 162, in cublas_addmm
    assert_close_with_ulp(res_cpu, res_cuda, atol=tolerance.atol, rtol=tolerance.rtol)
  File "/home/dev/meta/transformer_nuggets/transformer_nuggets/numerics/__init__.py", line 222, in assert_close_with_ulp
    raise AssertionError("\n".join(error_parts))
AssertionError: Tensor-likes are not close!

Mismatched elements: 425 / 100030002 (0.0%)
Greatest absolute difference: 16 at index (2176, 9325) (up to 10 allowed)
Greatest relative difference: 3984 at index (376, 3754) (up to 0.2 allowed)

============================================================
ULP Analysis of Failures:
============================================================

Total failures: 425
ULP distances: min=-32761, max=32763, mean=-11513.7

Top 10 failures by absolute difference:
  #  | Index                      | Abs Diff    | Rel Diff    | ULP  | Expected     | Actual
----------------------------------------------------------------------------------------------------
   1 | (6923, 1580)               | 1.600000e+01 | 5.390625e-01 |  146 |    29.750000 |    13.750000
   2 | (4677, 420)                | 1.600000e+01 | 6.601562e-01 |   95 |    24.250000 |    40.250000
   3 | (2176, 9325)               | 1.600000e+01 | 6.875000e-01 |  210 |    23.250000 |     7.250000
   4 | (5119, 7865)               | 1.600000e+01 | 1.164062e+00 |  146 |   -13.750000 |   -29.750000
   5 | (3218, 8334)               | 1.600000e+01 | 2.593750e+00 |  236 |     6.156250 |    22.125000
   6 | (5245, 241)                | 1.600000e+01 | 5.468750e-01 |   75 |    29.250000 |    45.250000
   7 | (7666, 6549)               | 1.600000e+01 | 1.640000e+03 | 1376 |    -0.009766 |   -16.000000
   8 | (1663, 1115)               | 1.593750e+01 | 8.375000e+00 | -32427 |     1.898438 |   -14.062500
   9 | (3967, 7708)               | 1.593750e+01 | 1.368750e+01 | -32510 |     1.164062 |   -14.750000
  10 | (2874, 2038)               | 1.593750e+01 | 1.710938e+00 |  181 |     9.312500 |    25.250000

Note: Maximum absolute and relative errors occur at different locations
  Max abs diff location (2176, 9325): 210 ULP
  Max rel diff location (376, 3754): 31868 ULP

To execute this test, run the following from the base repo dir:
    python test/test_matmul_cuda.py TestMatmulCudaCUDA.test_cublas_addmm_reduced_precision_size_10000_backend_cublaslt_cuda_bfloat16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```
Okay the bfloat16 are forsure  real cc @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163537
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #163460
2025-09-23 15:45:05 +00:00
720a7b2887 [export] Remove .contiguous() when saving weights to raw bytes (#163587)
Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading.

Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing

Differential Revision: D83016250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587
Approved by: https://github.com/angelayi
2025-09-23 15:44:56 +00:00
49e7b2f69d [inductor] Fix error from custom CUDA allocators (#163422)
Fixes #163257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163422
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393, #163412
2025-09-23 15:37:45 +00:00
6ef74879f6 [dynamo] Fix TorchFunctionMode handling with get_rng_state (#163412)
Fixes #162624
Fixes #162586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163412
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434, #163393
2025-09-23 15:37:45 +00:00
9c4d9f940b [inductor] Support out_dtype arg to matmul (#163393)
Fixes #163275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163393
Approved by: https://github.com/eellison, https://github.com/coconutruben
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419, #163434
2025-09-23 15:37:38 +00:00
ed84e808f0 [inductor] Freeze layouts in FlexAttention (#163434)
Fixes #163300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163434
Approved by: https://github.com/drisspg
ghstack dependencies: #163386, #163398, #163387, #163414, #163415, #163419
2025-09-23 15:37:29 +00:00
518c320676 [inductor] libdevice.sqrt => tl.sqrt_rn (#163419)
Fixes #163082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163419
Approved by: https://github.com/Skylion007, https://github.com/mlazos
ghstack dependencies: #163386, #163398, #163387, #163414, #163415
2025-09-23 15:37:21 +00:00
4264fd34ec Add basic tests for torch.distributed.tensor._utils.compute_global_tensor_info (#162968)
Next PR writes a C++ implementation. Seems good to have tests first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162968
Approved by: https://github.com/ezyang
ghstack dependencies: #161695, #162508
2025-09-23 14:56:32 +00:00
e05c9c0c84 [ROCm][CI] cudagraph trees ut fixes (#163592)
Fixes #162125.
Fixes #160719.
Fixes #157901.
Fixes #157871.
Fixes #157761.
Fixes #157723.
Fixes #157643.
Fixes #157616.
Fixes #157556.
Fixes #157533.
Fixes #157449.
Fixes #157428.
Fixes #157413.
Fixes #157367.
Fixes #157350.
Fixes #157339.
Fixes #157312.
Fixes #157280.
Fixes #157258.
Fixes #157173.
Fixes #157143.
Fixes #157112.
Fixes #157086.
Fixes #157058.
Fixes #157035.
Fixes #156984.
Fixes #156957.
Fixes #156954.
Fixes #156922.
Fixes #156886.
Fixes #156838.
Fixes #156808.
Fixes #156801.
Fixes #156778.
Fixes #156755.
Fixes #156735.
Fixes #156693.
Fixes #152561.
Fixes #130749.
Fixes #100074.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163592
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-09-23 14:45:00 +00:00
aff76c046d Revert "Add fake_impl for _native_multi_head_attention (#163167)"
This reverts commit 27164b6788cab6e6d8095012839e51c958a819d6.

Reverted https://github.com/pytorch/pytorch/pull/163167 on behalf of https://github.com/malfet due to This broke in inductor-cpu-test, see 1a42656d6c/1 ([comment](https://github.com/pytorch/pytorch/pull/163167#issuecomment-3324302026))
2025-09-23 14:36:45 +00:00
1a42656d6c [Flex attention] Fix flex attention head broadcast (#163426)
Fixes part of #163314

In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results**

This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting.

The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426
Approved by: https://github.com/drisspg
2025-09-23 13:01:51 +00:00
bda9ab291d [inductor] fix as_strided lowering with .view(dtype) inputs (#163319)
FIXES https://github.com/pytorch/pytorch/issues/163286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163319
Approved by: https://github.com/eellison
2025-09-23 12:50:57 +00:00
3c64b2abab CUDA 13.0 Warning update for supported architectures (#163585)
Please see build script: 8da008678f/.ci/manywheel/build_cuda.sh (L69-L71)

This should display correct warning:
``
Please install PyTorch with a following CUDA
configurations: 12.6 12.8 13.0 following instructions at
https://pytorch.org/get-started/locally/
``
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163585
Approved by: https://github.com/malfet
2025-09-23 11:27:11 +00:00
5d749ceb92 Remove test conditions for CUDA<12 (#163495)
Because it required that CUDA >=12.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163495
Approved by: https://github.com/janeyx99
2025-09-23 07:52:00 +00:00
8d81564df5 [pt2][cache] rework cache for true generic usage + better tests (#163488)
Differential Revision: D82933509

over the weekend I realized that some of the cache implementation was a bit silly, and too constrained to be actually generic. for example, InMemoryCache[str, bytes] was odd since we'd probably want to be able to store more than just str keys with bytes values. so tldr; everything is now generic, with the one constraint being that Key and Value must both be pickle-able types. this makes things a lot simpler for us, since all caches can now be str -> bytes caches under the hood if we'd like, and Key/Value just get pickled on the way in and out.

with this change, there were also some improvements made to the testing; mainly better coverage, but now we also test each cache across every combination of Key/Value types to ensure that they will work with the types we might specify later

I also hardened some things here and there, for example we now use literal_eval (forgot who mentioned this on the first PR, but thank you for the suggestion!), and all errors coming from the caching will be wrapped in CacheError from now on (although we still raise from the original error context where possible)

putting this PR up now for feedback, in the process of generalizing the code I did remove the documentation since it was becoming outdated but I will add that back in after the PR is green

I have the next PR ready as well (implements a fresh cache context manager), will export once this lands

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163488
Approved by: https://github.com/aorenste, https://github.com/masnesral
2025-09-23 07:31:48 +00:00
b426ba1d5e [torchfuzz] introduce tensor and scalar pointwise ops (#163558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163558
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554, #163555, #163556, #163557
2025-09-23 06:20:13 +00:00
375f3e3a61 [OpenReg][Docs] Correct docs about openreg usage example. (#163235)
## Why this PR?
I've tried to follow the guidance of the `OpenReg` [usage example](https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg) and found that the command for compiling `example.cpp` (`g++ -o out example/example.cpp -L ./build -lopenreg`) is not compatible with my `gcc` (v11.4).

Since I installed my `gcc` through `apt install build-essential`, and I think that's a common way to install `gcc` for a few developers? I believe it's necessary to slightly modify the command to add `-I ./` to explicitly indicate the header file search path.

## What I've changed?
- I added `-I ./` to correctly search for `./include/openreg.h`.
- I also added a `pwd` comment for better readability and removed unused imports in `example/example.cpp`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163235
Approved by: https://github.com/FFFrog, https://github.com/albanD

Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>
2025-09-23 06:16:45 +00:00
45d9dcccc5 Update Kineto Submodule (#162222)
Summary: Update

Test Plan:
CI

Rollback Plan:

Differential Revision: D81727392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162222
Approved by: https://github.com/sanrise
2025-09-23 06:08:55 +00:00
309fe03f4b [torchfuzz] remove unneeded try catch (#163557)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163557
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554, #163555, #163556
2025-09-23 06:05:08 +00:00
1545bb1c00 [torchfuzz] shuffle compatible ops (#163556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163556
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554, #163555
2025-09-23 05:53:44 +00:00
d5e51d34f7 [torchfuzz] decompose -> fuzz_inputs_specs (#163555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163555
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553, #163554
2025-09-23 05:44:59 +00:00
08c5efde5f [torchfuzz] cache operators (#163554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163554
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547, #163553
2025-09-23 05:28:07 +00:00
19b754dff8 Revert "Update cutlass version for fbcode (#163091)"
This reverts commit 509c4e86270cc4decca58905d0f446e1fc0cf618.

Reverted https://github.com/pytorch/pytorch/pull/163091 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/163091#issuecomment-3322428791))
2025-09-23 05:08:42 +00:00
d3a1345ed8 Use functools.cache on has_efa (#163439)
Cache the result of `has_efa` by `functools.cache`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163439
Approved by: https://github.com/janeyx99
2025-09-23 05:03:03 +00:00
e3b392bdfd [BC breaking] Remove deprecated imports for torch.utils.data.datapipes.iter.grouping (#163438)
This PR removes import tricks of `SHARDING_PRIORITIES` and  `ShardingFilterIterDataPipe` from `torch.utils.data.datapipes.iter.grouping`. They are declared to be removed in PyTorch 2.1 but not.
Before change:
```
import torch.utils.data.datapipes.iter.grouping.SHARDING_PRIORITIES
import torch.utils.data.datapipes.iter.grouping.ShardingFilterIterDataPipe
```
works
After change:
there is an import error exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163438
Approved by: https://github.com/janeyx99
2025-09-23 05:02:06 +00:00
bb5be56619 [torch][cuda][device_limits] Library for querying device hardware limits for flops and bandwidth (#162942)
In various benchmarks scattered across the repo, the limits for flops/second and memory bandwidth are usually hardcoded for a single device. This utility could help in providing a more structured way to query the device capabilities. If this is approved, we can use it when reporting flops efficiency and bandwidth relative to peak in the benchmarks and tests. The intent is to add more devices, more parameters (e.g. L2 cache bandwidth, NVLink, etc.) for both CPUs and accelerators.

Testing:

```
import torch

if torch.cuda.is_available():
    device = torch.cuda.current_device()
    mod = torch.get_device_module('cuda')
    hw = mod._device_limits.GPULimits(device)

    print(hw.get_tflops_per_second(torch.float16))
    print(hw.get_tflops_per_second(torch.float32))
    print(hw.get_tflops_per_second(torch.float64))
    print(hw.get_tflops_per_second(torch.bfloat16))
    print(hw.get_tflops_per_second(torch.int8))
    print(hw.get_memory_bandwidth_Bps() / 1e9)
    print(hw.get_shared_memory_bandwidth_Bps() / 1e9)

# Output on an H100 GPU
1070.53056
535.26528
66.90816
1070.53056
2141.06112
4893.696
33454.08
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162942
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-09-23 04:48:19 +00:00
0e122380c2 [torchfuzz] remove supports_variable_inputs for now (#163553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163553
Approved by: https://github.com/laithsakka
ghstack dependencies: #163547
2025-09-23 04:44:54 +00:00
fcd79d5228 [vllm hash update] update the pinned vllm hash (#163590)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163590
Approved by: https://github.com/pytorchbot
2025-09-23 04:44:15 +00:00
95ac7d724e Rename to _debug_mode.py to make it private (#163534)
rename debug_mode.py to _debug_mode.py to make it private, per @alban's request.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163534
Approved by: https://github.com/albanD
2025-09-23 04:27:10 +00:00
0b75a16200 [torchfuzz] Encapsulate fuzzing and codegen logic into ops (#163547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163547
Approved by: https://github.com/laithsakka
2025-09-23 04:26:00 +00:00
27164b6788 Add fake_impl for _native_multi_head_attention (#163167)
Test Plan:
See added test in test_export.py

Rollback Plan:

Reviewed By: henryoier

Differential Revision: D77747446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163167
Approved by: https://github.com/angelayi
2025-09-23 04:02:20 +00:00
cyy
447b8fc56d [2/N] Use filesystem in inductor (#163465)
Use std::filesystem in most inductor code. This is follow-up of https://github.com/pytorch/pytorch/pull/152288 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163465
Approved by: https://github.com/Skylion007
2025-09-23 03:56:16 +00:00
6a48f57d2f [1/N] Remove 'type: ignore' suppressions (#163468)
Remove some unnecessary 'type: ignore' suppressions from python code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163468
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
2025-09-23 03:53:11 +00:00
e9300b2b7c remove allow-untyped-defs from ./torch/onnx/_internal/torchscript_exporter/_globals.py (#163472)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163472
Approved by: https://github.com/Skylion007
ghstack dependencies: #163246, #163469, #163470
2025-09-23 03:50:29 +00:00
8f30a8dc47 [AOTInductor] Add grid information for Triton Kernels (#160131)
Summary:
Add grid information for Triton Kernels for profiling in Kineto.

Test Plan:
Before change:
<img width="539" height="625" alt="Screenshot 2025-08-07 at 1 09 07 PM" src="https://github.com/user-attachments/assets/dd0778a9-2ff3-4819-acd3-de585cf7f9d1" />

After change:
<img width="550" height="898" alt="Screenshot 2025-08-07 at 1 05 49 PM" src="https://github.com/user-attachments/assets/d84988df-bb83-41ed-80ac-8a6d843a1a9d" />

*Note we can extract grid size etc. from device side trace, but we're focusing host side specifically for this PR, mainly to add more host side information in the future needed for performance profiling.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160131
Approved by: https://github.com/desertfire
2025-09-23 02:15:24 +00:00
2c7959eee9 [ignore][codex-test] Add typing to simple library registry (#161367)
## Summary
- add type annotations for simple library registry and dispatch rule holder
- remove allow-untyped-defs directive

## Testing
- `python -m mypy torch/_library/simple_registry.py` *(fails: repo expects mypy==1.16.0)*
- `lintrunner -a torch/_library/simple_registry.py` *(fails: attr-defined error in torchgen/gen_schema_utils.py)*
- `python test/test_torch.py TestTorch.test_dir` *(fails: ModuleNotFoundError: No module named 'torch')*

------
https://chatgpt.com/codex/tasks/task_e_68aa3cc210488326befdd992c79115a0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161367
Approved by: https://github.com/Skylion007
2025-09-23 02:08:55 +00:00
3ef1bef36c [sdpa] make sure to recompile if alignment is different than before (#163083)
## Context
An example from Qwen2-7B
- This come from running torch.compile with a sequence length that is
divisible by 8 (no padding needed). Call this `Run1`.
- If we then run the compiled model with a difference length that isn't
divisible by 8 (requires padding). Call this `Run2`.
- Then we'll see this error.
```
File "/var/tmp/torchinductor_nobody/2w/c2wby7ilxbna45xrtrrfjqpeutwouruviu2742ockunnd2bleeiz.py", line 1963, in call
    buf24 = torch.ops.aten._scaled_dot_product_efficient_attention_backward.default(reinterpret_tensor(buf18, (s85, 3584 // s19, s48, 512 // (512 // s19)), (s48*(512 // (512 // s19))*(3584 // s19), 512 // (512 // s19), (512 // (512 // s19))*(3584 // s19), 1), 0), buf20, buf21, buf22, buf23, getitem, getitem_1, getitem_2, getitem_3, 0.0, [True, True, True, False], scale=0.08838834764831845)
File "torch/_ops.py", line 841, in __call__
    return self._op(*args, **kwargs)
RuntimeError: attn_bias is not correctly aligned (strideM). attn_bias.stride(2) = 6102, and should be a multiple of 4.
```
- We only see the error because we did not recompile on `Run2`. Instead we ran the inputs on the same graph as `Run1`.

### A bit more on why.
Here we check whether to realize the unpadded buffer (unwrapped slice) which we want for `Run1` but not for `Run2`.
0897affcd5/torch/_inductor/lowering.py (L2687-L2694)

## Fix
Size hint doesn't guard, so the fix is to use `guard_or*` to guard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163083
Approved by: https://github.com/eellison
2025-09-23 01:33:33 +00:00
539e84e289 [precompile] Add option to disable guard check on aot-compiled function. (#163432)
Summary:
Under circumstances it seems reasonable to return a callable directly without guard check when user use aot_compile on a function with single compilation result.

When having multiple entries (aot_compile_module), we should start enabling guard check to differetiate different compiled functions apart.

Test Plan: CI

Differential Revision: D82904540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163432
Approved by: https://github.com/dolpm
2025-09-23 01:00:05 +00:00
68e75be86a Update pytorch_sphinx_theme2 to latest hash (#163269)
The updated theme:
- Fixes articleBody in the json+ld that caused previous Google Search issues
- Other minor fixes
- 404.html fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163269
Approved by: https://github.com/albanD
2025-09-22 23:20:23 +00:00
8da008678f Remove outdated commented CMake code (#163442)
Policies `CMP0023` and `CMP0022` have been removed in CMake 4.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163442
Approved by: https://github.com/janeyx99
2025-09-22 23:07:36 +00:00
fa15fb01ab [EZ] Remove XLA from unstable.yml (#163564)
It runs for 30 min on linux.12xlarge and then fails and it has been like
that since Aug 7th

Besides, there are no more python-3.9 builds left.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163564
Approved by: https://github.com/seemethere, https://github.com/atalman, https://github.com/huydhn
2025-09-22 22:11:50 +00:00
clr
33daaad7d0 dynamo: Handle objects in graph that do not support weakref (#163168)
We are seeing crashes of the form
```
Traceback (most recent call last):
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 1487, in run
    while self.step():
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 1348, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 2437, in LOAD_ATTR
    self._load_attr(inst)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/symbolic_convert.py", line 2425, in _load_attr
    result = BuiltinVariable(getattr).call_function(
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 1347, in call_function
    return handler(tx, args, kwargs)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 967, in <lambda>
    tx, [v.realize() for v in args], kwargs
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builtin.py", line 967, in <listcomp>
    tx, [v.realize() for v in args], kwargs
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/lazy.py", line 72, in realize
    self._cache.realize()
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/lazy.py", line 33, in realize
    self.vt = builder.VariableBuilder(tx, self.source)(self.value)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builder.py", line 445, in __call__
    vt = self._wrap(value)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/variables/builder.py", line 1043, in _wrap
    torch._dynamo.utils.store_user_object_weakref(value)
  File "/packages/aps_ads_vm/launcher_multiapp-inplace#link-tree/torch/_dynamo/utils.py", line 4694, in store_user_object_weakref
    user_obj_id_to_weakref[obj_id] = weakref.ref(obj)
torch._dynamo.exc.InternalTorchDynamoError: TypeError: cannot create weak reference to 'torch.Event' object
```

This pull request makes us gracefully graph break, vs explicitly crashing.

I've added a test which reproduces the issue. There is a side discussion re:
how did torch.Event support ever work here, since it appears you cannot take a
weakref to a torch.Event

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163168
Approved by: https://github.com/Lucaskabela, https://github.com/jansel
2025-09-22 22:11:09 +00:00
60c2bdedcd Replace Literal[None] with None in typing (#163489)
This PR replaces Literal[None] with None in typing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163489
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-09-22 22:10:08 +00:00
b756b580fb Improve fake tensor leakage detection in export by not relying on gc too much (#163516)
Previously we relied on gc to get the snapshot of fake tensors before and after export to get list of fake tensors that are created during export. This caused some flakiness in our test suite (https://github.com/pytorch/pytorch/issues/162232). it seems super hard to make gc deterministic, so we just instrument fake tensor creation which seems lot better. In addition, it is also quite faster than previous approach becuase we are no longer manually triggering garbage collector.

Differential Revision: [D82966648](https://our.internmc.facebook.com/intern/diff/D82966648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163516
Approved by: https://github.com/ezyang
2025-09-22 22:04:24 +00:00
e0cbab46ad [Inductor] avoid CUDA__equal when constant tensors are from different device (#163529)
Summary:
otherwise, may hit
```
Exception: Expected all tensors to be on the same device, but got other is on cuda:0, different from other tensors on cpu (when checking argument in method wrapper_CUDA__equal)
```

Test Plan: UTs

Reviewed By: yushangdi

Differential Revision: D82974062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163529
Approved by: https://github.com/yushangdi, https://github.com/Skylion007
2025-09-22 22:04:11 +00:00
4fc271e559 [inductor] Don't require_dense for grid_sampler_2d_backward (#163415)
Fixes #163372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163415
Approved by: https://github.com/Skylion007
ghstack dependencies: #163386, #163398, #163387, #163414
2025-09-22 21:53:01 +00:00
c8fd2b45e5 [inductor] Skip test_baddmm on XPU (#163414)
Fixes #161484
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163414
Approved by: https://github.com/Skylion007
ghstack dependencies: #163386, #163398, #163387
2025-09-22 21:53:01 +00:00
a1bd9248eb [inductor] Fallback on strided complex add (#163387)
Fixes #163243
Fixes #162561

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163387
Approved by: https://github.com/eellison
ghstack dependencies: #163386, #163398
2025-09-22 21:52:53 +00:00
36c2a1325c [inductor] Fix bug where viewed outputs get padded (#163398)
Fixes #163328

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163398
Approved by: https://github.com/eellison
ghstack dependencies: #163386
2025-09-22 21:52:45 +00:00
7ea8998c0b Better decomp for torch.eye (#163386)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163386
Approved by: https://github.com/eellison
2025-09-22 21:52:37 +00:00
2b036632ca Allow add_persistent_r_block to scale up rblock up to a limit (#162296)
<img width="654" height="392" alt="Screenshot 2025-09-18 at 4 22 53 PM" src="https://github.com/user-attachments/assets/975650ec-f769-43a6-bdf5-2885a8d40d3c" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162296
Approved by: https://github.com/eellison
2025-09-22 21:41:46 +00:00
0256f91558 [BUG] MaxUnpool2d/3d should check output dim before accessing its elements (#163507)
Fixes #163409
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163507
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-09-22 21:36:48 +00:00
da05aa7a9d [BE] Use output_t directly (#163518)
Rather than deref the safe tensor wrapped in `TensorArg`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163518
Approved by: https://github.com/Skylion007
2025-09-22 21:33:42 +00:00
e558f7a222 [vllm hash update] update the pinned vllm hash (#163463)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163463
Approved by: https://github.com/pytorchbot

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-09-22 21:24:56 +00:00
09cb34c1dc [RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)
Summary:
Original: D81957844 and D81957923

Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well

#buildall

Test Plan:
sandcastle and oss ci

Rollback Plan:

Reviewed By: H-Huang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594
Approved by: https://github.com/H-Huang, https://github.com/dcci
2025-09-22 21:12:18 +00:00
4027e97791 [BE] Delete skipIfMPSOnMacOS13 (#163515)
As PyTorch needs MacOS-14 or newer to use MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163515
Approved by: https://github.com/Skylion007
2025-09-22 21:10:22 +00:00
8e62d01f7a Add dynamic shapes doc (#159428)
This PR adds new Dynamic Shapes documentation and expands on the existing one.
- Adds a new structure with Intro, Core Concepts, Troubleshooting

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159428
Approved by: https://github.com/bobrenjc93

Co-authored-by: bobrenjc93 <bobren@meta.com>
2025-09-22 21:01:27 +00:00
8abc2af9b9 [STABLE ABI] Add clone method to torch::stable::Tensor (#161896)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161896
Approved by: https://github.com/janeyx99
2025-09-22 20:39:24 +00:00
02da4753f5 Triton template IMA reads on B200 (#163460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163460
Approved by: https://github.com/eqy, https://github.com/alexsamardzic
2025-09-22 20:34:39 +00:00
cf28ab2c88 remove allow-untyped-defs from ./torch/ao/quantization/pt2e/duplicate_dq_pass.py (#163470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163470
Approved by: https://github.com/aorenste
ghstack dependencies: #163246, #163469
2025-09-22 20:29:09 +00:00
46e1b7d70b remove allow-untyped-defs from ./torch/utils/data/datapipes/iter/fileopener.py (#163469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163469
Approved by: https://github.com/aorenste, https://github.com/Skylion007
ghstack dependencies: #163246
2025-09-22 20:29:09 +00:00
e065d35fd3 [BE]: Add a few more missing move from return indices (#163456)
@ezyang A follow up where I found a few more missing returns of this style in the codebase. Follow up to #163416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163456
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-09-22 20:24:23 +00:00
fd785b1762 Add NestedTensor dispatch for _is_any_true/_is_all_true (#162096)
Fixes: https://github.com/pytorch/pytorch/issues/161818

### Summary
Add NestedTensor support for `_is_any_true` and `_is_all_true`.

### Changes
- Register dispatch for `aten._is_any_true.default` and
  `aten._is_all_true.default`
- Add CPU tests:
  - `test_is_any_true_jagged`: dispatch_matches_values_buffer,
    all_false_returns_false, one_true_returns_true
  - `test_is_all_true_jagged`: dispatch_matches_values_buffer,
    all_true_returns_true, any_false_returns_false

### Testing

Before Fix:

`pytest -q test/test_nestedtensor.py -k "test_is_any_true_jagged or test_is_all_true_jagged" -v`

Output:
```
FAILED [0.0129s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_all_true_jagged_cpu - NotImplementedError: aten._is_all_true.default
FAILED [0.0007s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_any_true_jagged_cpu - NotImplementedError: aten._is_any_true.default
```

After Fix:

`pytest -q test/test_nestedtensor.py -k "test_is_any_true_jagged or test_is_all_true_jagged" -v`

Output:

```
Running 2 items in this shard

test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_all_true_jagged_cpu PASSED [0.0277s]                                                                                                                               [ 50%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_is_any_true_jagged_cpu PASSED [0.0013s]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162096
Approved by: https://github.com/jbschlosser
2025-09-22 20:22:44 +00:00
d0086708dd [triton] update 3.5 pin to bbb06c0334a6772b92d24bde54956e675c8c6604 (#163382)
Includes:
* https://github.com/triton-lang/triton/pull/8211 to work around a PTXAS bug that was causing 03-matrix-multiplication tutorial matmuls to underperform due to excessive WGMMA waits
* https://github.com/triton-lang/triton/pull/8157 to fix a convert_layout bug

Verified that this passes Triton CI in https://github.com/pytorch/pytorch/pull/159158 and improves gemm perf (see https://github.com/pytorch/pytorch/issues/159704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163382
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-09-22 20:20:59 +00:00
6f9aef5fef [2/n] Support module.to("cuda:0") in FakeTensorMode on cuda-less machine (#163433)
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call.
This diff supports this.

I expect the following pattern to work

```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])

    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)

```

Before
Moving module.to("cuda:0") under fake tensor mode would have parameter on `meta` device.

After
parameters would be on "cuda:0" .

Test Plan: buck2 run  fbcode//caffe2/test:fake_tensor -- --r test_move_module

Reviewed By: mikaylagawarecki

Differential Revision: D80102876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163433
Approved by: https://github.com/albanD
2025-09-22 20:16:32 +00:00
d15048493c [opaque_obj] Add set_payload + docs (#163276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163276
Approved by: https://github.com/zou3519
ghstack dependencies: #162660
2025-09-22 20:02:29 +00:00
bf28990c3d Add support for NestedTensor share_memory_ (#162272)
Fixes: https://github.com/pytorch/pytorch/issues/161915

### Summary

Implements share_memory_() support for NestedTensor!

### Changes

- Added share_memory_() method to NestedTensor class.
  - Shares storage for all NestedTensor components: _values, _offsets, _lengths, and cached seqlen tensors.
  - Guard for CUDA Tensors.

### Testing

Before Fix:

`pytest -q test/test_nestedtensor.py -k "test_share_memory" -v`

Output:

```
Running 1 items in this shard

test/test_nestedtensor.py Fatal Python error: Segmentation fault
```

After Fix:

`pytest -q test/test_nestedtensor.py -k "test_share_memory" -v`

Output:

```
Running 1 items in this shard

test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_share_memory_cpu PASSED [0.0753s]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162272
Approved by: https://github.com/jbschlosser
2025-09-22 19:59:58 +00:00
eaa613bf66 Revert "[opaque_obj] Add set_payload + docs (#163276)"
This reverts commit dd30667f6c2204a15e91eaeb61c84f9080be7748.

Reverted https://github.com/pytorch/pytorch/pull/163276 on behalf of https://github.com/ZainRizvi due to Sorry but this fails lint on trunk: [GH job link](https://github.com/pytorch/pytorch/actions/runs/17924886989/job/50968430537) [HUD commit link](dd30667f6c) ([comment](https://github.com/pytorch/pytorch/pull/163276#issuecomment-3321054061))
2025-09-22 19:32:30 +00:00
1818c36d6e [Fix] Restrict stride normalization to 1D tensors on export (#163282)
This change restricts the DLPack stride normalization to apply only to 1D tensors of shape (1,).

### Rationale
The previous implementation normalized the strides for any multi-dimensional tensor containing a dimension of size 1. While well-intentioned, this "over-normalization" discards critical memory layout information, causing issues for downstream consumers who rely on strides to infer alignment and contiguity.

For example:

* A row-major tensor with `shape=(1, 128)` and `stride=(128, 1)` would be incorrectly normalized to `stride=(1, 1)`.

* A column-major tensor with `shape=(1024, 1)` and `stride=(1, 1024)` would also be normalized to `stride=(1, 1)`.

This loss of stride information makes it impossible for consumers to detect the original memory layout (e.g., row-major vs. column-major) and breaks assumptions about memory alignment needed for optimized indexing or specialized hardware APIs like GPU TMA.

The original intent of the normalization was to handle the simple case of a 1D tensor with shape=(1,) and a non-standard stride. This fix reverts to that specific, non-problematic behavior, ensuring that multi-dimensional tensors retain their precise stride information during DLPack export.

### Related Issues
#163274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163282
Approved by: https://github.com/eqy
2025-09-22 19:10:05 +00:00
7e9781174c Fix lint (#163542)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163542
Approved by: https://github.com/malfet
2025-09-22 19:10:00 +00:00
4941719061 Enable logging for absolute memory estimation (#158799)
Summary: Update the Auto AC logging so that it also provides the *absolute* memory estimations for each node.

Test Plan:
(aps-gem_omnifm_v2_mwb_dynamic_005_budget-f23a84c3d8): https://fburl.com/ai_infra/0r738h5r

{F1980393481}

* Memory Recorded in bytes

---

```
buck2 test //caffe2/test/functorch:test_ac_logging
```
https://www.internalfb.com/intern/testinfra/testrun/14918173863021573

Rollback Plan:

Differential Revision: D78580107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/158799
Approved by: https://github.com/jansel
2025-09-22 18:36:49 +00:00
dd30667f6c [opaque_obj] Add set_payload + docs (#163276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163276
Approved by: https://github.com/zou3519
ghstack dependencies: #162660
2025-09-22 18:30:28 +00:00
3be9c86c74 [opaque obj] Initial OpaqueObject (#162660)
A big pain point ppl have with custom ops is that they do not accept arbitrary input/outputs. In this PR we create the concept of an "OpaqueObject" which allows users to pass arbitrary python objects into custom operators.

Some still slightly annoying parts with this implementation:
- The schema of the operator is `__torch__.torch.classes.aten.OpaqueObject` instead of whatever python type
- `@torch.library.custom_op` doesn't work.. yet?

UX:
```python
from torch._library.opaque_object import make_opaque, get_payload

# your custom python class
class OpaqueQueue:
    def __init__(self, queue: list[torch.Tensor], init_tensor_: torch.Tensor) -> None:
        super().__init__()
        self.queue = queue
        self.init_tensor_ = init_tensor_

    def push(self, tensor: torch.Tensor) -> None:
        self.queue.append(tensor)

    def pop(self) -> torch.Tensor:
        if len(self.queue) > 0:
            return self.queue.pop(0)
        return self.init_tensor_

    def size(self) -> int:
        return len(self.queue)

queue = OpaqueQueue([], torch.zeros(3))
obj: torch._C.ScriptObject = make_opaque(queue)

# obj.payload stores a direct reference to this python queue object
self.assertEqual(get_payload(obj), queue)

# This is able to be passed through the dispatcher
torch.ops._TestOpaqueObject.queue_push(obj, torch.ones(3))
self.assertTrue(queue.size(), 1)
```

Authoring a custom op:

```python
lib = torch.library.Library("_TestOpaqueObject", "FRAGMENT")

torch.library.define(
    f"_TestOpaqueObject::queue_push",
    "(__torch__.torch.classes.aten.OpaqueObject a, Tensor b) -> ()",
    tags=torch.Tag.pt2_compliant_tag,
    lib=lib,
)

@torch.library.impl(f"{libname}::queue_push", "CompositeExplicitAutograd", lib=lib)
def push_impl(q: torch._C.ScriptObject, b: torch.Tensor) -> None:
    # We can get the payload directly by get_payload(q)
    queue = get_payload(q)
    assert isinstance(queue, OpaqueQueue)
    queue.push(b)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162660
Approved by: https://github.com/zou3519
2025-09-22 18:30:28 +00:00
bec967eaa4 Remove C++ and test branches for CUDA<12 (#163443)
Remove conditional branches for CUDA<12.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163443
Approved by: https://github.com/eqy
2025-09-22 18:20:08 +00:00
d279a6a6f1 ci: Add a way to lint all files in a PR from label (#163525)
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163525
Approved by: https://github.com/ZainRizvi
2025-09-22 18:06:39 +00:00
281f8f407e Combine strong and weak refcounts in intrusive_ptr in a single refcount (#163394)
Summary:
Currently, we assume that refcount_ and weakcount_ are always stored in an 8-byte aligned address right next to each other. Based on this assumption, we load 8 bytes in intrusive_ptr::reset_ to check the values of both counts. However, that assumption is not part of C++ language standard so it's essentially undefined behavior.

This change eliminates that assumption by combining refcount_ and weakcount_ in a single 64-bit count and we use the lower 32 bits for refcount_ and upper 32 bits for the weakcount_.

In addition to eliminating the undefined behavior, the change also eliminates the read of weakcount_ after decrementing refcount_ in intrusive_ptr::reset_. This claws back lost performance introduced in https://github.com/pytorch/pytorch/pull/162784 for non-final refcount_ decrementing.

Reviewed By: yfeldblum

Differential Revision: D82869192

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163394
Approved by: https://github.com/Skylion007
2025-09-22 17:53:28 +00:00
5e7be98800 [BE] Update Python min version to 3.10 (#162310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310
Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-09-22 17:04:21 +00:00
06fe5b9025 [AOTI] fix TestAOTInductorPackage temp file locked handler. (#163499)
Fix `test\inductor\test_aot_inductor_package.py` common class `TestAOTInductorPackage`'s `check_model` function, temp file locked file handler on Windows. It would caused c++ backend open file failed:
```cmd
FAILED [4.5918s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_add - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmp21sjnnhl.pt2 cannot be opened.
FAILED [4.1703s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_bool_input - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmp5kd3apub.pt2 cannot be opened.
FAILED [4.2266s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_linear - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmpkyy3pxow.pt2 cannot be opened.
FAILED [4.2134s] test/inductor/test_aot_inductor_package.py::TestAOTInductorPackage_cpu::test_metadata - RuntimeError: File C:/Users/Xuhan/AppData/Local/Temp/tmphyer7wi9.pt2 cannot be opened.
......
```

Fix it via `WritableTempFile`, it can release file handler for backend use.

After fixed:

<img width="1904" height="176" alt="image" src="https://github.com/user-attachments/assets/e71b3182-0204-497b-9aca-cbbb33bc4687" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163499
Approved by: https://github.com/jansel, https://github.com/desertfire
2025-09-22 16:54:18 +00:00
9ca183e933 switch from stack based to graph based aproach (#163459)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163459
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #163417
2025-09-22 16:41:35 +00:00
e310cc5e06 Update fbgemm submodule (#163411)
Test Plan:

As titled, includes some new changes fbgemm to see if CUDA13 breakage is fixed.

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163411
Approved by: https://github.com/Skylion007
2025-09-22 15:46:11 +00:00
eaac218b64 [ROCm] Fix environment variable AOTRITON_INSTALLED_PREFIX (#163373)
Early assignment of `__AOTRITON_LIB` breaks the usage of environment variable `$AOTRITON_INSTALLED_PREFIX`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163373
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily
2025-09-22 15:01:18 +00:00
509c4e8627 Update cutlass version for fbcode (#163091)
Differential Revision: [D82567751](https://our.internmc.facebook.com/intern/diff/D82567751/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163091
Approved by: https://github.com/drisspg
2025-09-22 14:31:11 +00:00
10adeb9044 Revert "[BE] Update Python min version to 3.10 (#162310)"
This reverts commit 9f5a644f0768258bc81f8b38492754d297399f74.

Reverted https://github.com/pytorch/pytorch/pull/162310 on behalf of https://github.com/malfet due to Broke lint, but to the best of my knowledge it's no longer possible to run lint for all files on PRs ([comment](https://github.com/pytorch/pytorch/pull/162310#issuecomment-3319289031))
2025-09-22 14:13:59 +00:00
9f5a644f07 [BE] Update Python min version to 3.10 (#162310)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162310
Approved by: https://github.com/atalman, https://github.com/Skylion007, https://github.com/ZainRizvi
2025-09-22 13:37:02 +00:00
60b4791d08 [MPS] Fix compile linalg inv (#163452)
Fixes #161969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163452
Approved by: https://github.com/Skylion007
2025-09-22 10:36:52 +00:00
96a3afb8ec Simplify BFLOAT16_AVAILABLE (#163445)
Simplify `BFLOAT16_AVAILABLE` by using `torch.cuda.is_bf16_supported()`  and `torch.xpu.is_bf16_supported()`. Outdated comments are also removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163445
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-09-22 07:31:46 +00:00
edafc902d7 Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
This reverts commit d1993c27ae59842c887d549a3f8936fbcd769498.

Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/wdvr due to reverted internally, please see D82771705 @PaliC ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3317110247))
2025-09-22 06:22:37 +00:00
ae5be038a6 Revert "Delete functorch C extension entirely. (#163340)"
This reverts commit 1faf6367e396b1d0894e8735912a47ac465f469d.

Reverted https://github.com/pytorch/pytorch/pull/163340 on behalf of https://github.com/wdvr due to temporary revert to pull out #162659 ([comment](https://github.com/pytorch/pytorch/pull/163340#issuecomment-3317105243))
2025-09-22 06:20:04 +00:00
f0078941cf Revert "[RELAND] Always build USE_DISTRIBUTED (#160449) and Make distributed modules importable even when backend not built (#159889) (#162594)"
This reverts commit 6c334885d48725197b5d35e2c1543efc0f4198d0.

Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/wdvr due to reverted internally - @ezyang see D82281294 ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3317017530))
2025-09-22 05:39:07 +00:00
3a7db34cf9 Revert "[SymmMem] Promote @requires_nvshmem instead of enable_triton (#163423)"
This reverts commit 5d8a226e23339e7243a2a84afd174f685f145b68.

Reverted https://github.com/pytorch/pytorch/pull/163423 on behalf of https://github.com/wdvr due to temporary reverting to back out #162594 ([comment](https://github.com/pytorch/pytorch/pull/163423#issuecomment-3317011500))
2025-09-22 05:35:41 +00:00
281bb56cc5 Enable half precision types on test_conv_cudnn_nhwc_support (#163444)
This PR adds flaot16 and bfloat16 cases to `test_conv_cudnn_nhwc_support` and removes outdated comments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163444
Approved by: https://github.com/Skylion007
2025-09-22 04:11:20 +00:00
01f927eb40 Remove workarounds for Python 3.6 (#163440)
This PR removes tuple unpacking workarounds for Py 3.6 form two distributed files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163440
Approved by: https://github.com/ezyang
2025-09-22 04:08:04 +00:00
0b59492853 [export] Fix wrap_with_set_grad_enabled retracing (#163295)
Fixes https://github.com/pytorch/pytorch/issues/163294

The code `with torch.set_grad_enabled(enable_grad)` calls `torch._C._set_grad_enabled` three times -- (1) when [initializing set_grad_enabled](bb7c9a2d41/torch/autograd/grad_mode.py (L187C9-L187C35)), (2) when [entering the context](bb7c9a2d41/torch/autograd/grad_mode.py (L194)), and (3) when [exiting the context](bb7c9a2d41/torch/autograd/grad_mode.py (L197)).

This results in the the retraced export module to have a duplicate `torch._C._set_grad_enabled` like:
```
def forward(self, arg0_1):
    add = torch.ops.aten.add.Tensor(arg0_1, 1);  arg0_1 = None
    _set_grad_enabled = torch._C._set_grad_enabled(False);  _set_grad_enabled = None
    _set_grad_enabled = torch._C._set_grad_enabled(False);  _set_grad_enabled = None
    add_1 = torch.ops.aten.add.Tensor(add, 2);  add = None
    _set_grad_enabled_1 = torch._C._set_grad_enabled(True);  _set_grad_enabled_1 = None
    add_2 = torch.ops.aten.add.Tensor(add_1, 3);  add_1 = None
    return (add_2,)
```

When export runs the `replace_set_grad_with_hop_pass`, it will look through the graph for `torch._C._set_grad_enabled` and create subgraphs. The duplicate `torch._C._set_grad_enabled` results in an empty submod in the graph, which resulted in an error in [this post](https://fb.workplace.com/groups/1028545332188949/posts/1844720036398281/?comment_id=1862175381319413).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163295
Approved by: https://github.com/yushangdi
2025-09-21 22:54:40 +00:00
8a281d7214 [submodule] Bump libfmt to 12.0.0 (#163441)
libfmt 12.0 brings new optimisations and fixes some compilation issues for clang 21 (https://github.com/fmtlib/fmt/pull/4477).
For a detailed release log, see https://github.com/fmtlib/fmt/releases/tag/12.0.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163441
Approved by: https://github.com/Skylion007
2025-09-21 22:37:25 +00:00
6ac2b3ae35 [BE] Adding aliases for CUDA and XPU API documentation (#162984)
This PR reorganizes CUDA and XPU API documentation with additional aliases pages. Multiple entries of APIs under torch.cuda are thus removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162984
Approved by: https://github.com/janeyx99
2025-09-21 22:28:27 +00:00
8b14f43da9 [torch] DRY a couple of lines in unpickler (#163447)
Test Plan: CI.

Reviewed By: dolpm

Differential Revision: D82660989

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163447
Approved by: https://github.com/Skylion007
2025-09-21 20:29:33 +00:00
4d3d32f14c Add torchfuzz initial impl. (#163417)
all details are in readme.md
Note: one thing i want to do soonest is to switch to graph representation instead of stack representation
for the fuzzed ops should make things easier as things get more complicated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163417
Approved by: https://github.com/bobrenjc93
2025-09-21 19:17:54 +00:00
5599f487ef Fully native DTensor.__new__ (#162508)
Move the entirety of `__new__` into C++, saving a layer of disable_dynamo and making progress toward all-C++.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162508
Approved by: https://github.com/ezyang
ghstack dependencies: #161695
2025-09-21 18:36:05 +00:00
51152efa67 Remove autograd code for Python < 3.9 (#163313)
As PyTorch is moving to Python 3.10, it is safe to remove code for Python < 3.9.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163313
Approved by: https://github.com/ezyang
2025-09-21 15:35:06 +00:00
f34744d2a5 [inductor] bugfix: keep WeakDeps (WAR deps) during fusion (#162316)
fixes #159855, was not triggered in other tests since it took
more than one round of fusion to get to the problematic code
which prunes WeakDeps. The WeakDeps are important to inhibit
fusion of kernels that read/write data into mutated buffers
with different indexing.

We modify the code to a) always prune before fusion, rather
than after, which improves its coverage and makes our basic
vertical fusion tests surface this issue as well and b)
check whether the weak dep is fusable before eliminating it
(which basically means checking that the producing code and
the consuming code are sufficiently compatible).

The tests that trigger this with change (a) is:
test_fusing_write_into_disjoint_read introduced in #118210.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162316
Approved by: https://github.com/eellison, https://github.com/mlazos, https://github.com/shunting314
2025-09-21 13:08:11 +00:00
5d8a226e23 [SymmMem] Promote @requires_nvshmem instead of enable_triton (#163423)
### Issue
The previous `enable_triton` UI requires the user-defined Triton kernel have a "nvshmem" in its name.
If users did not do so, the kernel would miss the NVSHMEM init, and silently hit CUDA IMA.

The `@require_nvshmem` decorator eliminates the above name requirement (and the `enable_triton` call).

### Usage:
```
@requires_nvshmem
@triton.jit
def foo(...):
    ...

foo[(1, 1)](...)
```
It also remove the need of passing `extern_lib` to `foo` (handled by the decorator now).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163423
Approved by: https://github.com/ngimel
ghstack dependencies: #163025, #163152, #163194
2025-09-21 10:03:20 +00:00
d8cbbc0f70 [Easy][AMP] Refactor the AMP logic for getting dtype (#162796)
As the title stated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162796
Approved by: https://github.com/ezyang
2025-09-21 06:32:35 +00:00
9ba918082a Add api info for torch._C._nn.pyi (#162707)
Fix part of #148404

APis involved are as followed:

- multilabel_margin_loss
- multi_margin_loss
- nll_loss_nd
- relu6
- relu6_

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162707
Approved by: https://github.com/ezyang
2025-09-21 06:17:15 +00:00
1faf6367e3 Delete functorch C extension entirely. (#163340)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163340
Approved by: https://github.com/aorenste
ghstack dependencies: #160236
2025-09-21 06:02:21 +00:00
4a96a6fa4a [Docs] Fix indentations in cond.md (#156147)
This is a follow-up PR to fix indentations mentioned by https://github.com/pytorch/pytorch/pull/155653#issuecomment-2971660356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156147
Approved by: https://github.com/svekars, https://github.com/cyyever
2025-09-21 05:50:50 +00:00
f591bb5056 Remove data_source argument from Sampler (#163134)
`data_source` is declared being removed in PT 2.2 but not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163134
Approved by: https://github.com/ezyang
2025-09-21 05:44:41 +00:00
1ca9445229 [BE][Ez]: Prevent copies of std::vector in CUDA ForeachOps (#163416)
No need for unnecessary copy of std::vectors. This Tensor list is copied throughout the foreach paths and this code is on a hot path for torch optimizers. Auto move elision will not happen on the return statement since it's a subelement of a vector that needs to be copied out before the std::vector is dtor'd. This should reduce quite a few list copies along this path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163416
Approved by: https://github.com/ezyang
2025-09-21 05:24:13 +00:00
5b386ee16e [vllm hash update] update the pinned vllm hash (#163392)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163392
Approved by: https://github.com/pytorchbot
2025-09-21 04:34:14 +00:00
97eb7a281d torchdim Python port (#160236)
The big semantic change (and the reason for this port) is that we no longer monkeypatch Tensor with torchdim's special methods. The new algorithm for handling dispatch is that we first land in `__torch_function__` and we see if a special FCD implementation needs to be dispatch to first, and if there is nothing we fallback to the standard level strategy.

Because there is no longer C binding equivalent of classes, we've condensed _C.Dim and Dim together, and similar for Tensor. This resulted in some bugs as the Python API is sometimes different from the C API. I've attempted to disambiguate these but there may still be mistakes (many early bugs were due to this problem). Dim and DimEntry are especially painful as Dim must abide by Tensor equality semantics, but is pointer equality in C (DimEntry doesn't have this problem). Another difference between C/Python that is subtle is we no longer get implicit conversions from Dim to DimEntry, this also caused some bugs.

Much of the mechanical porting work was done by claude code. I have a separate PR that deletes functorch._C, but it was useful having dim.cpp to point claude at it so I haven't done it in this PR. From a reviewing perspective, I need to re-review that I didn't forget to port anything, some noticeably missing "small" things are patched_dim_method. I am still in progress of carefully doing a side-by-side review of ports; "simplifications" from claude code were also a major source of bugs.

There are two major feature gaps in the implementation:

- DelayedTensor and dot handling are not implemented yet. This should be reasonably easy, just need to do it.  However, for the purposes of sharded propagation it is actually better not to reconstruct matmuls.
- Splitting dimensions with an index like `[x, y]` doesn't work. The problem is that `__getitem__` interprets this as advanced indexing and sends the list to torch.tensor to turn into a tensor, instead of being eligible for `__torch_function__`. I think I might need to hard code a special case for this or something?

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160236
Approved by: https://github.com/zdevito, https://github.com/albanD
2025-09-21 03:01:04 +00:00
2887f3fde4 [BE] Slight improvements to documentation in python_dispatch (#162963)
I was briefly confused which way I should iterate stack, here's the
comments I wanted.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162963
Approved by: https://github.com/albanD, https://github.com/SherlockNoMad
2025-09-21 01:45:46 +00:00
eqy
e37b600007 [CUDA][cuBLAS][FP8] Forward-fix #162022 (#163354)
@ngimel is right, `ciflow/h100` doesn't actually appear to test the PR :(

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163354
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-09-21 00:55:12 +00:00
8e3fd3d4f9 [AI Codemod][DevmatePerfOptimizationVectorReallocation] fbcode/caffe2/torch/csrc/jit/serialization/unpickler.cpp (#163240)
Reviewed By: marksantaniello, yfeldblum

Differential Revision: D82140619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163240
Approved by: https://github.com/Skylion007
2025-09-20 23:26:24 +00:00
9e3725e8e5 make fullgraph_capture work on mod, args, kwargs (#162849)
Summary:
Today `fullgraph_capture` takes a frame, but clients usually take a callable (`nn.Module`, function, or method) and example inputs (args and kwargs) and then explicitly set up the frame to pass. This is boilerplate—and potentially tricky to get right—that can be hidden inside the API.

The original `fullgraph_capture` now becomes `_fullgraph_capture_frame`.

Test Plan:
existing tests

Rollback Plan:

Differential Revision: D82339400

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162849
Approved by: https://github.com/zhxchen17
2025-09-20 22:48:06 +00:00
3938175ec1 [1/n] Support cpu_tensor.to("cuda:0") in FakeTensorMode on cuda-less machine (#160431)
Summary:
To support exporting a cuda model on a CPU-only machine under fake tensor mode.
User commonly need to move sample inputs to the cuda device with .to("cuda:0") call.
This diff supports this.

Notice that .to("cuda") doesn't work yet, as it enquery current device idx by calling cuda API.

I expect the following pattern to work

```
with FakeTensorMode(allow_non_fake_inputs=True):
    cuda_module = module.to("cuda:0")
    cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs])

    with torch.no_grad():
        ep = torch.export.export(cuda_module, cuda_sample_inputs)

```

Test Plan:
buck2 run  fbcode//caffe2/test:fake_tensor -- --r test_fake_gpu_no_init

Rollback Plan:

Differential Revision: D80101283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160431
Approved by: https://github.com/henryoier, https://github.com/ezyang
2025-09-20 21:33:53 +00:00
1056 changed files with 35106 additions and 17678 deletions

View File

@ -15,6 +15,8 @@ fi
# Compress the fatbin with -compress-mode=size for CUDA 13
if [[ "$DESIRED_CUDA" == *"13"* ]]; then
export TORCH_NVCC_FLAGS="-compress-mode=size"
# Bundle ptxas into the cu13 wheel, see https://github.com/pytorch/pytorch/issues/163801
export BUILD_BUNDLE_PTXAS=1
fi
SCRIPTPATH="$( cd -- "$(dirname "$0")" >/dev/null 2>&1 ; pwd -P )"

View File

@ -372,7 +372,7 @@ if __name__ == "__main__":
else:
print("build pytorch without mkldnn backend")
os.system(f"cd /pytorch; {build_vars} python3 setup.py bdist_wheel")
os.system(f"cd /pytorch; {build_vars} python3 -m build --wheel --no-isolation")
if enable_cuda:
print("Updating Cuda Dependency")
filename = os.listdir("/pytorch/dist/")

View File

@ -241,7 +241,7 @@ def wait_for_connection(addr, port, timeout=15, attempt_cnt=5):
try:
with socket.create_connection((addr, port), timeout=timeout):
return
except (ConnectionRefusedError, socket.timeout): # noqa: PERF203
except (ConnectionRefusedError, TimeoutError): # noqa: PERF203
if i == attempt_cnt - 1:
raise
time.sleep(timeout)
@ -442,7 +442,7 @@ def build_torchvision(
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
host.run_cmd(f"cd vision && {build_vars} python3 setup.py bdist_wheel")
host.run_cmd(f"cd vision && {build_vars} python3 -m build --wheel --no-isolation")
vision_wheel_name = host.list_dir("vision/dist")[0]
embed_libgomp(host, use_conda, os.path.join("vision", "dist", vision_wheel_name))
@ -497,7 +497,7 @@ def build_torchdata(
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
host.run_cmd(f"cd data && {build_vars} python3 setup.py bdist_wheel")
host.run_cmd(f"cd data && {build_vars} python3 -m build --wheel --no-isolation")
wheel_name = host.list_dir("data/dist")[0]
embed_libgomp(host, use_conda, os.path.join("data", "dist", wheel_name))
@ -553,7 +553,7 @@ def build_torchtext(
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
host.run_cmd(f"cd text && {build_vars} python3 setup.py bdist_wheel")
host.run_cmd(f"cd text && {build_vars} python3 -m build --wheel --no-isolation")
wheel_name = host.list_dir("text/dist")[0]
embed_libgomp(host, use_conda, os.path.join("text", "dist", wheel_name))
@ -614,7 +614,7 @@ def build_torchaudio(
host.run_cmd(
f"cd audio && export FFMPEG_ROOT=$(pwd)/third_party/ffmpeg && export USE_FFMPEG=1 \
&& ./packaging/ffmpeg/build.sh \
&& {build_vars} python3 setup.py bdist_wheel"
&& {build_vars} python3 -m build --wheel --no-isolation"
)
wheel_name = host.list_dir("audio/dist")[0]
@ -726,7 +726,7 @@ def start_build(
print("Building PyTorch wheel")
build_opts = ""
if pytorch_build_number is not None:
build_opts += f" --build-number {pytorch_build_number}"
build_opts += f" -C--build-option=--build-number={pytorch_build_number}"
# Breakpad build fails on aarch64
build_vars = "USE_BREAKPAD=0 "
if branch == "nightly":
@ -747,7 +747,8 @@ def start_build(
print("build pytorch with mkldnn+acl backend")
build_vars += " USE_MKLDNN=ON USE_MKLDNN_ACL=ON"
host.run_cmd(
f"cd $HOME/pytorch && export ACL_ROOT_DIR=$HOME/ComputeLibrary && {build_vars} python3 setup.py bdist_wheel{build_opts}"
f"cd $HOME/pytorch && export ACL_ROOT_DIR=$HOME/ComputeLibrary && "
f"{build_vars} python3 -m build --wheel --no-isolation{build_opts}"
)
print("Repair the wheel")
pytorch_wheel_name = host.list_dir("pytorch/dist")[0]
@ -763,7 +764,7 @@ def start_build(
else:
print("build pytorch without mkldnn backend")
host.run_cmd(
f"cd pytorch && {build_vars} python3 setup.py bdist_wheel{build_opts}"
f"cd pytorch && {build_vars} python3 -m build --wheel --no-isolation{build_opts}"
)
print("Deleting build folder")
@ -1004,7 +1005,7 @@ if __name__ == "__main__":
install_condaforge_python(host, args.python_version)
sys.exit(0)
python_version = args.python_version if args.python_version is not None else "3.9"
python_version = args.python_version if args.python_version is not None else "3.10"
if args.use_torch_from_pypi:
configure_system(host, compiler=args.compiler, python_version=python_version)

View File

@ -69,7 +69,8 @@ RUN bash ./install_cuda.sh 13.0
ENV DESIRED_CUDA=13.0
FROM ${ROCM_IMAGE} as rocm
ENV PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
ENV MKLROOT /opt/intel

View File

@ -36,6 +36,12 @@ case ${DOCKER_TAG_PREFIX} in
;;
rocm*)
BASE_TARGET=rocm
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
if [[ "$ROCM_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
fi
EXTRA_BUILD_ARGS="${EXTRA_BUILD_ARGS} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;
*)
echo "ERROR: Unknown docker tag ${DOCKER_TAG_PREFIX}"

View File

@ -84,8 +84,8 @@ fi
_UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152
_UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96
if [[ "$image" == *rocm* ]]; then
_UCX_COMMIT=cc312eaa4655c0cc5c2bcd796db938f90563bcf6
_UCC_COMMIT=0c0fc21559835044ab107199e334f7157d6a0d3d
_UCX_COMMIT=29831d319e6be55cb8c768ca61de335c934ca39e
_UCC_COMMIT=9f4b242cbbd8b1462cbc732eb29316cdfa124b77
fi
tag=$(echo $image | awk -F':' '{print $2}')
@ -175,20 +175,6 @@ case "$tag" in
fi
GCC_VERSION=11
VISION=yes
ROCM_VERSION=6.4
NINJA_VERSION=1.9.0
TRITON=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
if [[ $tag =~ "benchmarks" ]]; then
INDUCTOR_BENCHMARKS=yes
fi
;;
pytorch-linux-noble-rocm-alpha-py3)
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
VISION=yes
ROCM_VERSION=7.0
NINJA_VERSION=1.9.0
TRITON=yes
@ -196,6 +182,9 @@ case "$tag" in
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
PYTORCH_ROCM_ARCH="gfx90a;gfx942;gfx950"
if [[ $tag =~ "benchmarks" ]]; then
INDUCTOR_BENCHMARKS=yes
fi
;;
pytorch-linux-jammy-xpu-n-1-py3)
ANACONDA_PYTHON_VERSION=3.10
@ -262,13 +251,10 @@ case "$tag" in
TRITON_CPU=yes
;;
pytorch-linux-jammy-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
# would be to upgrade mypy to 1.0.0 with Python 3.11
PYTHON_VERSION=3.9
PYTHON_VERSION=3.10
;;
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter)
PYTHON_VERSION=3.9
pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-linter)
PYTHON_VERSION=3.10
CUDA_VERSION=12.8.1
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11)
@ -455,12 +441,3 @@ elif [ "$HAS_TRITON" = "yes" ]; then
echo "expecting triton to not be installed, but it is"
exit 1
fi
# Sanity check cmake version. Executorch reinstalls cmake and I'm not sure if
# they support 4.0.0 yet, so exclude them from this check.
CMAKE_VERSION=$(drun cmake --version)
if [[ "$EXECUTORCH" != *yes* && "$CMAKE_VERSION" != *4.* ]]; then
echo "CMake version is not 4.0.0:"
drun cmake --version
exit 1
fi

View File

@ -1 +1 @@
v2.27.5-1
v2.28.3-1

View File

@ -1 +1 @@
v2.27.7-1
v2.28.3-1

View File

@ -1 +1 @@
5ae38bdb0dc066c5823e34dc9797afb9de42c866
bbb06c0334a6772b92d24bde54956e675c8c6604

View File

@ -42,12 +42,6 @@ EOF
rocm_baseurl="http://repo.radeon.com/rocm/apt/${ROCM_VERSION}"
amdgpu_baseurl="https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu"
# Special case for ROCM_VERSION == 7.0
if [[ $(ver "$ROCM_VERSION") -eq $(ver 7.0) ]]; then
rocm_baseurl="https://repo.radeon.com/rocm/apt/7.0_alpha2"
amdgpu_baseurl="https://repo.radeon.com/amdgpu/30.10_alpha2/ubuntu"
fi
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
echo "deb [arch=amd64] ${amdgpu_baseurl} ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list

View File

@ -12,8 +12,8 @@ function do_install() {
rocm_version_nodot=${rocm_version//./}
# Version 2.7.2 + ROCm related updates
MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6
# https://github.com/icl-utk-edu/magma/pull/65
MAGMA_VERSION=d6e4117bc88e73f06d26c6c2e14f064e8fc3d1ec
magma_archive="magma-rocm${rocm_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
rocm_dir="/opt/rocm"

View File

@ -66,15 +66,15 @@ if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}"
# Triton needs at least gcc-9 to build
apt-get install -y g++-9
CXX=g++-9 conda_run python setup.py bdist_wheel
CXX=g++-9 conda_run python -m build --wheel --no-isolation
elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then
# Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain
add-apt-repository -y ppa:ubuntu-toolchain-r/test
apt-get install -y g++-9
CXX=g++-9 conda_run python setup.py bdist_wheel
CXX=g++-9 conda_run python -m build --wheel --no-isolation
else
conda_run python setup.py bdist_wheel
conda_run python -m build --wheel --no-isolation
fi
# Copy the wheel to /opt for multi stage docker builds

View File

@ -40,12 +40,16 @@ case ${DOCKER_TAG_PREFIX} in
;;
rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
if [[ "$GPU_ARCH_VERSION" == *"6.4"* ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
fi
BASE_TARGET=rocm
GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
fi
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"
;;
*)

View File

@ -1,71 +0,0 @@
FROM centos:8 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ENV PATH /opt/rh/gcc-toolset-11/root/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# change to a valid repo
RUN sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-Linux-*.repo
# enable to install ninja-build
RUN sed -i 's|enabled=0|enabled=1|g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo
RUN yum -y update
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which zlib-devel sudo
RUN yum install -y autoconf automake make cmake gdb gcc-toolset-11-gcc-c++
FROM base as openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# Install python
FROM base as python
RUN yum install -y openssl-devel zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel
ADD common/install_cpython.sh install_cpython.sh
RUN bash ./install_cpython.sh && rm install_cpython.sh
FROM base as conda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
RUN /opt/conda/bin/conda install -y cmake
FROM base as intel
# Install MKL
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=conda /opt/conda /opt/conda
ENV PATH=/opt/conda/bin:$PATH
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM base as jni
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM base as final
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=intel /opt/intel /opt/intel
COPY --from=conda /opt/conda /opt/conda
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
RUN yum install -y ninja-build

View File

@ -43,12 +43,6 @@ case ${image} in
MANY_LINUX_VERSION="2_28_aarch64"
OPENBLAS_VERSION="v0.3.30"
;;
manylinuxcxx11-abi-builder:cpu-cxx11-abi)
TARGET=final
GPU_IMAGE=""
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
MANY_LINUX_VERSION="cxx11-abi"
;;
manylinuxs390x-builder:cpu-s390x)
TARGET=final
GPU_IMAGE=s390x/almalinux:8
@ -82,7 +76,7 @@ case ${image} in
;;
manylinux2_28-builder:rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
if [[ "$GPU_ARCH_VERSION" == *"6.4"* ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.2"
fi
TARGET=rocm_final
@ -90,6 +84,10 @@ case ${image} in
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
fi
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;
manylinux2_28-builder:xpu)

View File

@ -10,6 +10,11 @@ boto3==1.35.42
#Pinned versions: 1.19.12, 1.16.34
#test that import:
build==1.3.0
#Description: A simple, correct Python build frontend.
#Pinned versions: 1.3.0
#test that import:
click
#Description: Command Line Interface Creation Kit
#Pinned versions:
@ -106,14 +111,12 @@ networkx==2.8.8
#Pinned versions: 2.8.8
#test that import: functorch
ninja==1.11.1.3
ninja==1.11.1.4
#Description: build system. Used in some tests. Used in build to generate build
#time tracing information
#Pinned versions: 1.11.1.3
#Pinned versions: 1.11.1.4
#test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
numba==0.49.0 ; python_version < "3.9" and platform_machine != "s390x"
numba==0.55.2 ; python_version == "3.9" and platform_machine != "s390x"
numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
#Description: Just-In-Time Compiler for Numerical Functions
@ -134,7 +137,7 @@ numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
#test_nn.py, test_namedtensor.py, test_linalg.py, test_jit_cuda_fuser.py,
#test_jit.py, test_indexing.py, test_datapipe.py, test_dataloader.py,
#test_binary_ufuncs.py
numpy==1.22.4; python_version == "3.9" or python_version == "3.10"
numpy==1.22.4; python_version == "3.10"
numpy==1.26.2; python_version == "3.11" or python_version == "3.12"
numpy==2.1.2; python_version >= "3.13"
@ -169,9 +172,9 @@ pillow==11.0.0
#Pinned versions: 10.3.0
#test that import:
protobuf==5.29.4
protobuf==5.29.5
#Description: Google's data interchange format
#Pinned versions: 5.29.4
#Pinned versions: 5.29.5
#test that import: test_tensorboard.py, test/onnx/*
psutil
@ -326,8 +329,6 @@ pywavelets==1.7.0 ; python_version >= "3.12"
lxml==5.3.0
#Description: This is a requirement of unittest-xml-reporting
# Python-3.9 binaries
PyGithub==2.3.0
sympy==1.13.3
@ -377,7 +378,7 @@ dataclasses_json==0.6.7
#Pinned versions: 0.6.7
#test that import:
cmake==4.0.0
cmake==3.31.6
#Description: required for building
tlparse==0.4.0

View File

@ -1,8 +1,15 @@
sphinx==5.3.0
#Description: This is used to generate PyTorch docs
#Pinned versions: 5.3.0
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@1657ad2fc1acdc98aa719eebecbb0128a7c13ce4#egg=pytorch_sphinx_theme2
standard-imghdr==3.13.0; python_version >= "3.13"
#Description: This is needed by Sphinx, so it needs to be added here.
# The reasons are as follows:
# 1) This module has been removed from the Python standard library since Python 3.13(https://peps.python.org/pep-0594/#imghdr);
# 2) The current version of Sphinx (5.3.0) is not compatible with Python 3.13.
# Once Sphinx is upgraded to a version compatible with Python 3.13 or later, we can remove this dependency.
-e git+https://github.com/pytorch/pytorch_sphinx_theme.git@71e55749be14ceb56e7f8211a9fb649866b87ad4#egg=pytorch_sphinx_theme2
# TODO: sphinxcontrib.katex 0.9.0 adds a local KaTeX server to speed up pre-rendering
# but it doesn't seem to work and hangs around idly. The initial thought that it is probably
# something related to Docker setup. We can investigate this later.

View File

@ -72,7 +72,7 @@ def sample_vllm_test_library():
]
),
"pytest -v -s entrypoints/llm/test_generate.py",
"VLLM_USE_V1=0 pytest -v -s entrypoints/offline_mode",
"pytest -v -s entrypoints/offline_mode",
],
},
"vllm_regression_test": {

View File

@ -1,11 +1,11 @@
SHELL=/usr/bin/env bash
DOCKER_CMD ?= docker
DESIRED_ROCM ?= 6.4
DESIRED_ROCM ?= 7.0
DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))
PACKAGE_NAME = magma-rocm
# inherit this from underlying docker image, do not pass this env var to docker
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201
DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-v $(shell git rev-parse --show-toplevel)/.ci:/builder \
@ -16,6 +16,7 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
magma-rocm/build_magma.sh
.PHONY: all
all: magma-rocm70
all: magma-rocm64
all: magma-rocm63
@ -24,6 +25,11 @@ clean:
$(RM) -r magma-*
$(RM) -r output
.PHONY: magma-rocm70
magma-rocm70: DESIRED_ROCM := 7.0
magma-rocm70:
$(DOCKER_RUN)
.PHONY: magma-rocm64
magma-rocm64: DESIRED_ROCM := 6.4
magma-rocm64:

View File

@ -6,8 +6,8 @@ set -eou pipefail
# The script expects DESIRED_CUDA and PACKAGE_NAME to be set
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
# Version 2.7.2 + ROCm related updates
MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6
# https://github.com/icl-utk-edu/magma/pull/65
MAGMA_VERSION=d6e4117bc88e73f06d26c6c2e14f064e8fc3d1ec
# Folders for the build
PACKAGE_FILES=${ROOT_DIR}/magma-rocm/package_files # metadata
@ -20,7 +20,7 @@ mkdir -p ${PACKAGE_DIR} ${PACKAGE_OUTPUT}/linux-64 ${PACKAGE_BUILD} ${PACKAGE_RE
# Fetch magma sources and verify checksum
pushd ${PACKAGE_DIR}
git clone https://bitbucket.org/icl/magma.git
git clone https://github.com/jeffdaily/magma
pushd magma
git checkout ${MAGMA_VERSION}
popd

View File

@ -142,7 +142,7 @@ time CMAKE_ARGS=${CMAKE_ARGS[@]} \
EXTRA_CAFFE2_CMAKE_FLAGS=${EXTRA_CAFFE2_CMAKE_FLAGS[@]} \
BUILD_LIBTORCH_CPU_WITH_DEBUG=$BUILD_DEBUG_INFO \
USE_NCCL=${USE_NCCL} USE_RCCL=${USE_RCCL} USE_KINETO=${USE_KINETO} \
python setup.py bdist_wheel -d /tmp/$WHEELHOUSE_DIR
python -m build --wheel --no-isolation --outdir /tmp/$WHEELHOUSE_DIR
echo "Finished setup.py bdist at $(date)"
# Build libtorch packages

View File

@ -104,7 +104,7 @@ if [[ "$DESIRED_CUDA" == *"rocm"* ]]; then
export ROCclr_DIR=/opt/rocm/rocclr/lib/cmake/rocclr
fi
echo "Calling 'python -m pip install .' at $(date)"
echo "Calling -m pip install . -v --no-build-isolation at $(date)"
if [[ $LIBTORCH_VARIANT = *"static"* ]]; then
STATIC_CMAKE_FLAG="-DTORCH_STATIC=1"

View File

@ -107,6 +107,10 @@ if [[ $ROCM_INT -ge 60200 ]]; then
ROCM_SO_FILES+=("librocm-core.so")
fi
if [[ $ROCM_INT -ge 70000 ]]; then
ROCM_SO_FILES+=("librocroller.so")
fi
OS_NAME=`awk -F= '/^NAME/{print $2}' /etc/os-release`
if [[ "$OS_NAME" == *"CentOS Linux"* || "$OS_NAME" == *"AlmaLinux"* ]]; then
LIBGOMP_PATH="/usr/lib64/libgomp.so.1"

View File

@ -290,13 +290,13 @@ else
WERROR=1 python setup.py clean
WERROR=1 python setup.py bdist_wheel
WERROR=1 python -m build --wheel --no-isolation
else
python setup.py clean
if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then
source .ci/pytorch/install_cache_xla.sh
fi
python setup.py bdist_wheel
python -m build --wheel --no-isolation
fi
pip_install_whl "$(echo dist/*.whl)"

View File

@ -58,7 +58,7 @@ time python tools/setup_helpers/generate_code.py \
# Build the docs
pushd docs/cpp
time make VERBOSE=1 html -j
time make VERBOSE=1 html
popd
popd

View File

@ -35,11 +35,12 @@ fi
print_cmake_info
if [[ ${BUILD_ENVIRONMENT} == *"distributed"* ]]; then
USE_OPENMP=1 WERROR=1 python setup.py bdist_wheel
# Needed for inductor benchmarks, as lots of HF networks make `torch.distribtued` calls
USE_DISTRIBUTED=1 USE_OPENMP=1 WERROR=1 python -m build --wheel --no-isolation
else
# NB: we always build with distributed; USE_DISTRIBUTED turns off all
# backends (specifically the gloo backend), so test that this case works too
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python setup.py bdist_wheel --plat-name macosx_11_0_arm64
# Explicitly set USE_DISTRIBUTED=0 to align with the default build config on mac. This also serves as the sole CI config that tests
# that building with USE_DISTRIBUTED=0 works at all. See https://github.com/pytorch/pytorch/issues/86448
USE_DISTRIBUTED=0 USE_OPENMP=1 MACOSX_DEPLOYMENT_TARGET=11.0 WERROR=1 BUILD_TEST=OFF USE_PYTORCH_METAL=1 python -m build --wheel --no-isolation -C--build-option=--plat-name=macosx_11_0_arm64
fi
if which sccache > /dev/null; then
print_sccache_stats

View File

@ -13,13 +13,9 @@ if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available(
fi
popd
python -mpip install -r requirements.txt
# enable debug asserts in serialization
export TORCH_SERIALIZATION_DEBUG=1
python -mpip install --no-input -r requirements.txt
setup_test_python() {
# The CircleCI worker hostname doesn't resolve to an address.
# This environment variable makes ProcessGroupGloo default to
@ -59,7 +55,7 @@ test_python_shard() {
setup_test_python
time python test/run_test.py --verbose --exclude-jit-executor --exclude-distributed-tests --shard "$1" "$NUM_TEST_SHARDS"
time python test/run_test.py --verbose --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard "$1" "$NUM_TEST_SHARDS"
assert_git_not_dirty
}

View File

@ -26,6 +26,7 @@ if [[ "${SHARD_NUMBER:-2}" == "2" ]]; then
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl
time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering
time python test/run_test.py --verbose -i distributed/test_aten_comm_compute_reordering
time python test/run_test.py --verbose -i distributed/test_store
time python test/run_test.py --verbose -i distributed/test_symmetric_memory
time python test/run_test.py --verbose -i distributed/test_pg_wrapper

View File

@ -322,23 +322,29 @@ test_python_shard() {
# modify LD_LIBRARY_PATH to ensure it has the conda env.
# This set of tests has been shown to be buggy without it for the split-build
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_python() {
# shellcheck disable=SC2086
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests $INCLUDE_CLAUSE --verbose $PYTHON_TEST_EXTRA_OPTION
assert_git_not_dirty
}
test_python_smoke() {
# Smoke tests for H100
# Smoke tests for H100/B200
time python test/run_test.py --include test_matmul_cuda inductor/test_fp8 inductor/test_max_autotune $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_python_smoke_b200() {
# Targeted smoke tests for B200 - staged approach to avoid too many failures
time python test/run_test.py --include test_matmul_cuda inductor/test_fp8 $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_h100_distributed() {
# Distributed tests at H100
time python test/run_test.py --include distributed/_composable/test_composability/test_pp_composability.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
@ -384,6 +390,7 @@ test_dynamo_wrapped_shard() {
--exclude-distributed-tests \
--exclude-torch-export-tests \
--exclude-aot-dispatch-tests \
--exclude-quantization-tests \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose \
--upload-artifacts-while-running
@ -428,7 +435,7 @@ test_inductor_distributed() {
# this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
# with if required # gpus aren't available
python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives distributed/test_compute_comm_reordering --verbose
python test/run_test.py --include distributed/test_dynamo_distributed distributed/test_inductor_collectives distributed/test_aten_comm_compute_reordering distributed/test_compute_comm_reordering --verbose
assert_git_not_dirty
}
@ -1156,6 +1163,12 @@ test_distributed() {
fi
}
test_quantization() {
echo "Testing quantization"
python test/test_quantization.py
}
test_rpc() {
echo "Testing RPC C++ tests"
# NB: the ending test_rpc must match the current function name for the current
@ -1402,7 +1415,7 @@ EOF
pip3 install -r requirements.txt
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
python setup.py bdist_wheel --bdist-dir="base_bdist_tmp" --dist-dir="base_dist"
python -m build --wheel --no-isolation -C--build-option=--bdist-dir="base_bdist_tmp" --outdir "base_dist"
python -mpip install base_dist/*.whl
echo "::endgroup::"
@ -1573,7 +1586,7 @@ test_executorch() {
test_linux_aarch64() {
python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
test_transformers test_multiprocessing test_numpy_interop test_autograd test_binary_ufuncs test_complex test_spectral_ops \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops \
test_foreach test_reductions test_unary_ufuncs test_tensor_creation_ops test_ops profiler/test_memory_profiler \
distributed/elastic/timer/api_test distributed/elastic/timer/local_timer_example distributed/elastic/timer/local_timer_test \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
@ -1604,7 +1617,7 @@ test_operator_benchmark() {
test_inductor_set_cpu_affinity
cd benchmarks/operator_benchmark/pt_extension
python -m pip install .
python -m pip install . -v --no-build-isolation
cd "${TEST_DIR}"/benchmarks/operator_benchmark
$TASKSET python -m benchmark_all_test --device "$1" --tag-filter "$2" \
@ -1617,6 +1630,25 @@ test_operator_benchmark() {
--expected "expected_ci_operator_benchmark_eager_float32_cpu.csv"
}
test_operator_microbenchmark() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
TEST_DIR=$(pwd)
cd benchmarks/operator_benchmark/pt_extension
python -m pip install .
cd "${TEST_DIR}"/benchmarks/operator_benchmark
for OP_BENCHMARK_TESTS in matmul mm addmm bmm; do
$TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \
--output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}_compile.json" \
--benchmark-name "PyTorch operator microbenchmark" --use-compile
$TASKSET python -m pt.${OP_BENCHMARK_TESTS}_test --tag-filter long \
--output-json-for-dashboard "${TEST_REPORTS_DIR}/operator_microbenchmark_${OP_BENCHMARK_TESTS}.json" \
--benchmark-name "PyTorch operator microbenchmark"
done
}
if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
(cd test && python -c "import torch; print(torch.__config__.show())")
@ -1649,6 +1681,8 @@ elif [[ "${TEST_CONFIG}" == *executorch* ]]; then
test_executorch
elif [[ "$TEST_CONFIG" == 'jit_legacy' ]]; then
test_python_legacy_jit
elif [[ "$TEST_CONFIG" == 'quantization' ]]; then
test_quantization
elif [[ "${BUILD_ENVIRONMENT}" == *libtorch* ]]; then
# TODO: run some C++ tests
echo "no-op at the moment"
@ -1671,6 +1705,8 @@ elif [[ "${TEST_CONFIG}" == *operator_benchmark* ]]; then
test_operator_benchmark cpu ${TEST_MODE}
fi
elif [[ "${TEST_CONFIG}" == *operator_microbenchmark* ]]; then
test_operator_microbenchmark
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then
@ -1773,10 +1809,14 @@ elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then
test_xpu_bin
elif [[ "${TEST_CONFIG}" == smoke ]]; then
test_python_smoke
elif [[ "${TEST_CONFIG}" == smoke_b200 ]]; then
test_python_smoke_b200
elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then
test_h100_distributed
elif [[ "${TEST_CONFIG}" == "h100-symm-mem" ]]; then
test_h100_symm_mem
elif [[ "${TEST_CONFIG}" == "b200-symm-mem" ]]; then
test_h100_symm_mem
elif [[ "${TEST_CONFIG}" == h100_cutlass_backend ]]; then
test_h100_cutlass_backend
else

View File

@ -70,7 +70,7 @@ sccache --zero-stats
sccache --show-stats
# Build the wheel
python setup.py bdist_wheel
python -m build --wheel --no-build-isolation
if ($LASTEXITCODE -ne 0) { exit 1 }
# Install the wheel locally

View File

@ -130,7 +130,7 @@ if "%USE_CUDA%"=="1" (
:: Print all existing environment variable for debugging
set
python setup.py bdist_wheel
python -m build --wheel --no-isolation
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
sccache --show-stats

View File

@ -25,7 +25,7 @@ echo Copying over test times file
robocopy /E "%PYTORCH_FINAL_PACKAGE_DIR_WIN%\.additional_ci_files" "%PROJECT_DIR_WIN%\.additional_ci_files"
echo Run nn tests
python run_test.py --exclude-jit-executor --exclude-distributed-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose
python run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard "%SHARD_NUMBER%" "%NUM_TEST_SHARDS%" --verbose
if ERRORLEVEL 1 goto fail
popd

View File

@ -48,7 +48,7 @@ sccache --zero-stats
sccache --show-stats
:: Call PyTorch build script
python setup.py bdist_wheel -d "%PYTORCH_FINAL_PACKAGE_DIR%"
python -m build --wheel --no-isolation --outdir "%PYTORCH_FINAL_PACKAGE_DIR%"
:: show sccache stats
sccache --show-stats

View File

@ -28,5 +28,5 @@ start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_t
if errorlevel 1 exit /b 1
set "PATH=%CD%\Python\Scripts;%CD%\Python;%PATH%"
%PYTHON_EXEC% -m pip install --upgrade pip setuptools packaging wheel
%PYTHON_EXEC% -m pip install --upgrade pip setuptools packaging wheel build
if errorlevel 1 exit /b 1

View File

@ -86,7 +86,7 @@ copy /Y "%LIBTORCH_PREFIX%-%PYTORCH_BUILD_VERSION%.zip" "%PYTORCH_FINAL_PACKAGE_
goto build_end
:pytorch
%PYTHON_EXEC% setup.py bdist_wheel -d "%PYTORCH_FINAL_PACKAGE_DIR%"
%PYTHON_EXEC% -m build --wheel --no-isolation --outdir "%PYTORCH_FINAL_PACKAGE_DIR%"
:build_end
IF ERRORLEVEL 1 exit /b 1

View File

@ -63,7 +63,7 @@ if errorlevel 1 exit /b 1
call %CONDA_HOME%\condabin\activate.bat testenv
if errorlevel 1 exit /b 1
call conda install -y -q -c conda-forge libuv=1.39
call conda install -y -q -c conda-forge libuv=1.51
call conda install -y -q intel-openmp
echo "install and test libtorch"

View File

@ -18,7 +18,7 @@ if "%DESIRED_PYTHON%" == "3.9" %PYTHON_EXEC% -m pip install numpy==2.0.2 cmake
%PYTHON_EXEC% -m pip install pyyaml
%PYTHON_EXEC% -m pip install mkl-include mkl-static
%PYTHON_EXEC% -m pip install boto3 ninja typing_extensions setuptools==72.1.0
%PYTHON_EXEC% -m pip install boto3 requests ninja typing_extensions setuptools==72.1.0
where cmake.exe

View File

@ -143,7 +143,8 @@ case $desired_python in
RENAME_WHEEL=false
;;
3.13t)
echo "Using 3.13 deps"
echo "Using 3.13t deps"
mac_version='macosx-11.0-arm64'
NUMPY_PINNED_VERSION="==2.1.0"
RENAME_WHEEL=false
;;
@ -177,8 +178,7 @@ source ~/${desired_python}-build/bin/activate
retry pip install "${PINNED_PACKAGES[@]}" -r "${pytorch_rootdir}/requirements.txt"
retry brew install libomp
# For USE_DISTRIBUTED=1 on macOS, this enables gloo, which needs libuv, which
# is build as part of tensorpipe submodule
# For USE_DISTRIBUTED=1 on macOS, need libuv, which is build as part of tensorpipe submodule
export USE_DISTRIBUTED=1
export USE_MKLDNN=OFF
@ -186,11 +186,11 @@ export USE_QNNPACK=OFF
export BUILD_TEST=OFF
pushd "$pytorch_rootdir"
echo "Calling setup.py bdist_wheel at $(date)"
echo "Calling -m build --wheel --no-isolation at $(date)"
_PYTHON_HOST_PLATFORM=${mac_version} ARCHFLAGS="-arch arm64" python setup.py bdist_wheel -d "$whl_tmp_dir" --plat-name "${mac_version//[-.]/_}"
_PYTHON_HOST_PLATFORM=${mac_version} ARCHFLAGS="-arch arm64" python -m build --wheel --no-isolation --outdir "$whl_tmp_dir" -C--plat-name="${mac_version//[-.]/_}"
echo "Finished setup.py bdist_wheel at $(date)"
echo "Finished -m build --wheel --no-isolation at $(date)"
if [[ $package_type != 'libtorch' ]]; then
echo "delocating wheel dependencies"

View File

@ -1,47 +0,0 @@
#!/bin/bash
# =================== The following code **should** be executed inside Docker container ===================
# Install dependencies
sudo apt-get -y update
sudo apt-get -y install expect-dev
# This is where the local pytorch install in the docker image is located
pt_checkout="/var/lib/jenkins/workspace"
source "$pt_checkout/.ci/pytorch/common_utils.sh"
echo "functorch_doc_push_script.sh: Invoked with $*"
set -ex
version=${DOCS_VERSION:-nightly}
echo "version: $version"
# Build functorch docs
pushd $pt_checkout/functorch/docs
pip -q install -r requirements.txt
make html
popd
git clone https://github.com/pytorch/functorch -b gh-pages --depth 1 functorch_ghpages
pushd functorch_ghpages
if [ $version == "main" ]; then
version=nightly
fi
git rm -rf "$version" || true
mv "$pt_checkout/functorch/docs/build/html" "$version"
git add "$version" || true
git status
git config user.email "soumith+bot@pytorch.org"
git config user.name "pytorchbot"
# If there aren't changes, don't make a commit; push is no-op
git commit -m "Generate Python docs from pytorch/pytorch@${GITHUB_SHA}" || true
git status
if [[ "${WITH_PUSH:-}" == true ]]; then
git push -u origin gh-pages
fi
popd
# =================== The above code **should** be executed inside Docker container ===================

View File

@ -69,6 +69,8 @@ readability-string-compare,
'
HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
WarningsAsErrors: '*'
LineFilter:
- name: '/usr/include/.*'
CheckOptions:
cppcoreguidelines-special-member-functions.AllowSoleDefaultDtor: true
cppcoreguidelines-special-member-functions.AllowImplicitlyDeletedCopyOrMove: true

View File

@ -1,6 +1,10 @@
---
name: "⚠️ CI SEV"
about: Tracking incidents for PyTorch's CI infra.
title: ''
labels: ''
assignees: ''
---
> NOTE: Remember to label this issue with "`ci: sev`"

View File

@ -0,0 +1,18 @@
---
name: DISABLE AUTOREVERT
about: Disables autorevert when open
title: "❌​\U0001F519 [DISABLE AUTOREVERT]"
labels: 'ci: disable-autorevert'
assignees: ''
---
This issue, while open, disables the autorevert functionality.
More details can be found [here](https://github.com/pytorch/test-infra/blob/main/aws/lambda/pytorch-auto-revert/README.md)
## Why are you disabling autorevert?
## Links to any issues/commits/errors that shows the source of problem

View File

@ -1,8 +1,10 @@
---
name: Disable CI jobs (PyTorch Dev Infra only)
about: Use this template to disable CI jobs
title: "DISABLED [WORKFLOW_NAME] / [PLATFORM_NAME] / [JOB_NAME]"
labels: "module: ci"
title: DISABLED [WORKFLOW_NAME] / [PLATFORM_NAME] / [JOB_NAME]
labels: 'module: ci'
assignees: ''
---
> For example, DISABLED pull / win-vs2022-cpu-py3 / test (default). Once

View File

@ -22,6 +22,9 @@ self-hosted-runner:
- linux.arm64.m7g.4xlarge
- linux.arm64.m7g.4xlarge.ephemeral
- linux.arm64.r7g.12xlarge.memory
- linux.aws.h100
- linux.aws.h100.4
- linux.aws.h100.8
- linux.4xlarge.nvidia.gpu
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu

View File

@ -59,7 +59,7 @@ runs:
set -x
# Create new py_tmp env with python-version
${CONDA} create -y -n py_tmp python=${PYTHON_VERSION} intel-openmp
${CONDA} create -y -n py_tmp python=${PYTHON_VERSION} intel-openmp libuv
PYTHON3=$(${CONDA_RUN} -n py_tmp which python3)
EXIT_CODE=$?

View File

@ -1 +1 @@
9d1c50a5ac8726f4af0d4a4e85ad4d26a674ad26
78a47f87ce259a48f0391fa9ae15add05ea7432b

View File

@ -1 +1 @@
c77852e117bdf056c8e9a087e51d6f65cf6ba53d
0fc62aa26a30ed7ca419d285f285cb5ba02c4394

View File

@ -525,6 +525,21 @@
- Lint
- pull
- name: typechecking
patterns:
- 'pyrefly.toml'
- 'mypy.ini'
- 'mypy-strict.ini'
approved_by:
- lolpack
- maggiemoss
- ndmitchell
- kinto0
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- name: superuser
patterns:
- '*'

View File

@ -1,41 +1,44 @@
tracking_issue: 24422
ciflow_tracking_issue: 64124
ciflow_push_tags:
- ciflow/b200
- ciflow/b200-symm-mem
- ciflow/binaries
- ciflow/binaries_libtorch
- ciflow/binaries_wheel
- ciflow/triton_binaries
- ciflow/h100
- ciflow/h100-cutlass-backend
- ciflow/h100-distributed
- ciflow/h100-symm-mem
- ciflow/inductor
- ciflow/inductor-periodic
- ciflow/inductor-rocm
- ciflow/inductor-perf-test-nightly-rocm
- ciflow/inductor-perf-compare
- ciflow/inductor-cu126
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
- ciflow/inductor-perf-compare
- ciflow/inductor-perf-test-nightly-rocm
- ciflow/inductor-perf-test-nightly-x86-zen
- ciflow/inductor-cu126
- ciflow/inductor-periodic
- ciflow/inductor-rocm
- ciflow/linux-aarch64
- ciflow/mps
- ciflow/nightly
- ciflow/op-benchmark
- ciflow/periodic
- ciflow/periodic-rocm-mi300
- ciflow/pull
- ciflow/quantization-periodic
- ciflow/riscv64
- ciflow/rocm
- ciflow/rocm-mi300
- ciflow/s390
- ciflow/riscv64
- ciflow/slow
- ciflow/torchbench
- ciflow/triton_binaries
- ciflow/trunk
- ciflow/unstable
- ciflow/xpu
- ciflow/vllm
- ciflow/torchbench
- ciflow/op-benchmark
- ciflow/pull
- ciflow/h100
- ciflow/h100-distributed
- ciflow/win-arm64
- ciflow/h100-symm-mem
- ciflow/h100-cutlass-backend
- ciflow/xpu
retryable_workflows:
- pull
- trunk
@ -44,4 +47,4 @@ retryable_workflows:
- inductor-A100-perf-nightly
labeler_config: labeler.yml
label_to_label_config: label_to_label.yml
mergebot: True
mergebot: true

View File

@ -1,4 +1,5 @@
boto3==1.35.42
build==1.2.2.post1
cmake==3.27.*
expecttest==0.3.0
fbscribelogger==0.1.7

View File

@ -30,7 +30,7 @@ CUDA_ARCHES_CUDNN_VERSION = {
}
# NOTE: Please also update the ROCm sources in `PIP_SOURCES` in tools/nightly.py when changing this
ROCM_ARCHES = ["6.3", "6.4"]
ROCM_ARCHES = ["6.4", "7.0"]
XPU_ARCHES = ["xpu"]
@ -53,7 +53,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | "
@ -70,7 +70,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | "
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | "
@ -87,7 +87,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | "
"nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | "
"nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | "
"nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | "
"nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | "
"nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx==13.0.39; platform_system == 'Linux' | "
"nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | "

View File

@ -127,53 +127,6 @@ LINUX_BINARY_BUILD_WORFKLOWS = [
),
]
ROCM_SMOKE_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_variant="rocm",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["6.4"],
python_versions=["3.10"],
),
ciflow_config=CIFlowConfig(
labels={
LABEL_CIFLOW_BINARIES,
LABEL_CIFLOW_BINARIES_WHEEL,
LABEL_CIFLOW_ROCM,
},
isolated_workflow=True,
),
branches="main",
),
]
LINUX_BINARY_SMOKE_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["12.8"],
python_versions=["3.12"],
),
branches="main",
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="libtorch",
build_variant=generate_binary_build_matrix.RELEASE,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.LINUX,
generate_binary_build_matrix.RELEASE,
arches=["cpu"],
libtorch_variants=["shared-with-deps"],
),
branches="main",
),
]
WINDOWS_BINARY_BUILD_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.WINDOWS,
@ -259,39 +212,6 @@ WINDOWS_BINARY_BUILD_WORKFLOWS = [
),
]
WINDOWS_BINARY_SMOKE_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.WINDOWS,
package_type="libtorch",
build_variant=generate_binary_build_matrix.RELEASE,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.WINDOWS,
generate_binary_build_matrix.RELEASE,
arches=["cpu"],
libtorch_variants=["shared-with-deps"],
),
branches="main",
ciflow_config=CIFlowConfig(
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.WINDOWS,
package_type="libtorch",
build_variant=generate_binary_build_matrix.DEBUG,
build_configs=generate_binary_build_matrix.generate_libtorch_matrix(
OperatingSystem.WINDOWS,
generate_binary_build_matrix.DEBUG,
arches=["cpu"],
libtorch_variants=["shared-with-deps"],
),
branches="main",
ciflow_config=CIFlowConfig(
isolated_workflow=True,
),
),
]
MACOS_BINARY_BUILD_WORKFLOWS = [
BinaryBuildWorkflow(
os=OperatingSystem.MACOS_ARM64,
@ -372,23 +292,10 @@ def main() -> None:
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
S390X_BINARY_BUILD_WORKFLOWS,
),
(
# Give rocm it's own workflow file
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
ROCM_SMOKE_WORKFLOWS,
),
(
jinja_env.get_template("linux_binary_build_workflow.yml.j2"),
LINUX_BINARY_SMOKE_WORKFLOWS,
),
(
jinja_env.get_template("windows_binary_build_workflow.yml.j2"),
WINDOWS_BINARY_BUILD_WORKFLOWS,
),
(
jinja_env.get_template("windows_binary_build_workflow.yml.j2"),
WINDOWS_BINARY_SMOKE_WORKFLOWS,
),
(
jinja_env.get_template("macos_binary_build_workflow.yml.j2"),
MACOS_BINARY_BUILD_WORKFLOWS,

View File

@ -71,12 +71,15 @@ jobs:
with:!{{ upload.binary_env_as_input(config) }}
{%- if "aarch64" in build_environment %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "s390x" in build_environment %}
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
timeout-minutes: 420
{%- elif config["gpu_arch_type"] == "rocm" %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
timeout-minutes: 300
{%- elif "conda" in build_environment and config["gpu_arch_type"] == "cuda" %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.24xlarge.ephemeral

View File

@ -67,7 +67,7 @@ jobs:
# an OOM issue when running the job, so this upgrades the runner from 4xlarge
# to the next available tier of 12xlarge. So much memory just to generate cpp
# doc
runner: ${{ inputs.runner_prefix }}linux.12xlarge
runner: ${{ inputs.runner_prefix }}linux.12xlarge.memory
# TODO: Nightly cpp docs take longer and longer to finish (more than 3h now)
# Let's try to figure out how this can be improved
timeout-minutes: 360

View File

@ -2,6 +2,12 @@ name: Get Changed Files
on:
workflow_call:
inputs:
all_files:
description: "Whether to return all files instead of just changed files"
required: false
type: boolean
default: false
outputs:
changed-files:
description: "List of changed files (space-separated) or '*' if not in a PR"
@ -26,17 +32,23 @@ jobs:
# Get the PR number from the github context
PR_NUMBER="${{ github.event.number }}"
# Use gh CLI to get changed files in the PR with explicit repo
CHANGED_FILES=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/files --paginate --jq '.[] | select(.status != "removed") | .filename' | tr '\n' ' ' | sed 's/ $//')
# Check if all_files is requested
if [ "${{ inputs.all_files }}" = "true" ]; then
echo "all_files input is true, returning all files"
echo "changed-files=*" >> "$GITHUB_OUTPUT"
else
# Use gh CLI to get changed files in the PR with explicit repo
CHANGED_FILES=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/files --paginate --jq '.[] | select(.status != "removed") | .filename' | tr '\n' ' ' | sed 's/ $//')
if [ -z "$CHANGED_FILES" ]; then
echo "No changed files found, setting to '*'"
CHANGED_FILES="*"
if [ -z "$CHANGED_FILES" ]; then
echo "No changed files found, setting to '*'"
CHANGED_FILES="*"
fi
echo "Changed files: $CHANGED_FILES"
echo "changed-files=$CHANGED_FILES" >> "$GITHUB_OUTPUT"
fi
echo "Changed files: $CHANGED_FILES"
echo "changed-files=$CHANGED_FILES" >> "$GITHUB_OUTPUT"
else
echo "Not in PR context, setting changed files to '*'"
echo "changed-files=*" >> "$GITHUB_OUTPUT"

View File

@ -273,6 +273,8 @@ jobs:
TEST_CONFIG: ${{ matrix.config }}
SHARD_NUMBER: ${{ matrix.shard }}
NUM_TEST_SHARDS: ${{ matrix.num_shards }}
EXTRA_FLAGS: ${{ matrix.extra_flags || '' }}
OP_BENCHMARK_TESTS: ${{ matrix.op_benchmark_tests }}
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}

60
.github/workflows/b200-symm-mem.yml vendored Normal file
View File

@ -0,0 +1,60 @@
name: Limited CI for symmetric memory tests on B200
on:
pull_request:
paths:
- .github/workflows/b200-symm-mem.yml
workflow_dispatch:
push:
tags:
- ciflow/b200-symm-mem/*
schedule:
- cron: 22 8 * * * # about 1:22am PDT
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'
test-matrix: |
{ include: [
{ config: "b200-symm-mem", shard: 1, num_shards: 1, runner: "linux.dgx.b200.8" },
]}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-sm100-test:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100-symm
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build-symm.outputs.test-matrix }}
aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
secrets: inherit

View File

@ -36,7 +36,7 @@ jobs:
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:
tag: ["cuda12.6", "cuda12.8", "cuda12.9", "cuda13.0", "rocm6.3", "rocm6.4", "cpu"]
tag: ["cuda12.6", "cuda12.8", "cuda12.9", "cuda13.0", "rocm6.4", "rocm7.0", "cpu"]
steps:
- name: Build docker image
uses: pytorch/pytorch/.github/actions/binary-docker-build@main

View File

@ -52,8 +52,8 @@ jobs:
{ tag: "cuda12.9" },
{ tag: "cuda12.8" },
{ tag: "cuda12.6" },
{ tag: "rocm6.3" },
{ tag: "rocm6.4" },
{ tag: "rocm7.0" },
{ tag: "cpu" },
]
steps:

View File

@ -34,7 +34,7 @@ jobs:
id-token: write
strategy:
matrix:
rocm_version: ["64", "63"]
rocm_version: ["70", "64"]
steps:
- name: Checkout PyTorch
uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2

View File

@ -52,11 +52,10 @@ jobs:
{ name: "manylinuxaarch64-builder", tag: "cuda13.0", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.8", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.6", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.3", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.4", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm7.0", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cpu", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28_aarch64-builder", tag: "cpu-aarch64", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxcxx11-abi-builder", tag: "cpu-cxx11-abi", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "xpu", runner: "linux.9xlarge.ephemeral" },
]
runs-on: ${{ needs.get-label-type.outputs.label-type }}${{ matrix.runner }}

View File

@ -50,12 +50,12 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t", "3.14", "3.14t" ]
py_vers: [ "3.10", "3.11", "3.12", "3.13", "3.13t", "3.14", "3.14t" ]
device: ["cuda", "rocm", "xpu", "aarch64"]
docker-image: ["pytorch/manylinux2_28-builder:cpu"]
include:
- device: "rocm"
rocm_version: "6.4"
rocm_version: "7.0"
runs_on: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge"
- device: "cuda"
rocm_version: ""
@ -108,9 +108,6 @@ jobs:
# Determine python executable for given version
case $PY_VERS in
3.9)
PYTHON_EXECUTABLE=/opt/python/cp39-cp39/bin/python
;;
3.10)
PYTHON_EXECUTABLE=/opt/python/cp310-cp310/bin/python
;;
@ -194,7 +191,7 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.9", "3.10", "3.11", "3.12", "3.13", "3.13t", "3.14", "3.14t" ]
py_vers: [ "3.10", "3.11", "3.12", "3.13", "3.13t", "3.14", "3.14t" ]
device: ["xpu"]
timeout-minutes: 40
env:

View File

@ -35,6 +35,7 @@ jobs:
contents: write
outputs:
pt_release_name: ${{ steps.release_name.outputs.pt_release_name }}
pt_pep517_release_name: ${{ steps.release_name.outputs.pt_pep517_release_name }}
steps:
- uses: actions/checkout@11bd71901bbe5b1630ceea73d27597364c9af683 # v4.2.2
with:
@ -53,8 +54,12 @@ jobs:
tag_or_branch="${tag_or_branch#refs/heads/}"
# replace directory separators with _ in branch name
tag_or_branch="${tag_or_branch//\//_}"
echo "PT_RELEASE_NAME=pytorch-$tag_or_branch" >> "$GITHUB_ENV"
echo "PT_RELEASE_FILE=pytorch-$tag_or_branch.tar.gz" >> "$GITHUB_ENV"
torch_version="$(python -c 'from tools.generate_torch_version import get_torch_version; print(get_torch_version())')"
{
echo "PT_RELEASE_NAME=pytorch-$tag_or_branch";
echo "PT_RELEASE_FILE=pytorch-$tag_or_branch.tar.gz";
echo "PT_PEP517_RELEASE_FILE=torch-${torch_version}.tar.gz";
} >> "$GITHUB_ENV"
- name: Checkout optional submodules
run: python3 tools/optional_submodules.py
- name: Copy docs requirements for inclusion
@ -64,30 +69,47 @@ jobs:
cp .ci/docker/requirements-docs.txt docs/requirements.txt
- name: Create source distribution
run: |
# Create new folder with specified name so extracting the archive yields that
rm -rf "/tmp/$PT_RELEASE_NAME"
cp -r "$PWD" "/tmp/$PT_RELEASE_NAME"
mv "/tmp/$PT_RELEASE_NAME" .
# Cleanup
rm -rf "$PT_RELEASE_NAME"/{.circleci,.ci}
find "$PT_RELEASE_NAME" -name '.git*' -exec rm -rv {} \; || true
# Create archive
tar -czf "$PT_RELEASE_FILE" "$PT_RELEASE_NAME"
echo "Created source archive $PT_RELEASE_FILE with content: $(ls -a "$PT_RELEASE_NAME")"
# Create new folder with specified name so extracting the archive yields that
rm -rf "/tmp/$PT_RELEASE_NAME"
cp -r "$PWD" "/tmp/$PT_RELEASE_NAME"
mv "/tmp/$PT_RELEASE_NAME" .
# Cleanup
rm -rf "$PT_RELEASE_NAME"/{.circleci,.ci}
find "$PT_RELEASE_NAME" -name '.git*' -exec rm -rv {} \; || true
# Create archive
tar -czf "$PT_RELEASE_FILE" "$PT_RELEASE_NAME"
echo "Created source archive $PT_RELEASE_FILE with content: $(ls -a "$PT_RELEASE_NAME")"
- name: Create PEP 517 compatible source distribution
run: |
pip install build==1.2.2.post1 || exit 1
python -m build --sdist || exit 1
cd dist || exit 1
- name: Upload source distribution for release
if: ${{ github.event_name == 'release' }}
uses: softprops/action-gh-release@da05d552573ad5aba039eaac05058a918a7bf631 # v2.2.2
with:
files: ${{env.PT_RELEASE_FILE}}
- name: Upload source distribution to GHA artifacts for release tags
files: |
${{ env.PT_RELEASE_FILE }}
${{ env.PT_PEP517_RELEASE_FILE }}
- name: Upload source distribution to GHA artifacts # for release tags
if: ${{ github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }}
uses: actions/upload-artifact@50769540e7f4bd5e21e526ee35c689e35e0d6874 # v4.4.0
with:
name: ${{ env.PT_RELEASE_FILE }}
path: ${{ env.PT_RELEASE_FILE }}
- name: Upload PEP 517 source distribution to GHA artifacts # for release tags
if: ${{ github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }}
uses: actions/upload-artifact@50769540e7f4bd5e21e526ee35c689e35e0d6874 # v4.4.0
with:
name: ${{ env.PT_PEP517_RELEASE_FILE }}
path: dist/${{ env.PT_PEP517_RELEASE_FILE }}
- name: Set output
id: release_name
run: echo "pt_release_name=${{ env.PT_RELEASE_NAME }}.tar.gz" >> "${GITHUB_OUTPUT}"
run: |
{
echo "pt_release_name=${{ env.PT_RELEASE_FILE }}";
echo "pt_pep517_release_name=${{ env.PT_PEP517_RELEASE_FILE }}";
} >> "${GITHUB_OUTPUT}"
upload_source_code_to_s3:
if: ${{ github.repository == 'pytorch/pytorch' && github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }}
@ -103,6 +125,9 @@ jobs:
- uses: actions/download-artifact@65a9edc5881444af0b9093a5e628f2fe47ea3b2e # v4.1.7
with:
name: ${{ needs.release.outputs.pt_release_name }}
- uses: actions/download-artifact@65a9edc5881444af0b9093a5e628f2fe47ea3b2e # v4.1.7
with:
name: ${{ needs.release.outputs.pt_pep517_release_name }}
- name: Configure AWS credentials(PyTorch account)
uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
with:
@ -113,7 +138,9 @@ jobs:
s3-bucket: pytorch
s3-prefix: source_code/test
if-no-files-found: warn
path: ${{ needs.release.outputs.pt_release_name }}
path: |
${{ needs.release.outputs.pt_release_name }}
${{ needs.release.outputs.pt_pep517_release_name }}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name }}

View File

@ -59,7 +59,6 @@ jobs:
pytorch-linux-jammy-py3.13-clang12,
pytorch-linux-jammy-rocm-n-py3,
pytorch-linux-noble-rocm-n-py3,
pytorch-linux-noble-rocm-alpha-py3,
pytorch-linux-jammy-rocm-n-py3-benchmarks,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-clang12,
pytorch-linux-jammy-py3.10-gcc11,
@ -70,7 +69,7 @@ jobs:
pytorch-linux-jammy-py3-clang18-asan,
pytorch-linux-jammy-py3-clang12-onnx,
pytorch-linux-jammy-linter,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter,
pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-linter,
pytorch-linux-jammy-py3-clang12-executorch,
pytorch-linux-jammy-py3.12-triton-cpu,
pytorch-linux-noble-riscv64-py3.12-gcc14

View File

@ -62,7 +62,7 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
@ -128,11 +128,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -174,11 +174,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -220,11 +220,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda13.0
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -265,7 +265,7 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
@ -331,11 +331,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -377,11 +377,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -423,11 +423,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda13.0
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -468,7 +468,7 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
@ -534,11 +534,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -580,11 +580,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -626,11 +626,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda13.0
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -671,7 +671,7 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
@ -737,11 +737,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -783,11 +783,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -829,11 +829,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda13.0
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -874,7 +874,7 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
@ -940,11 +940,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -986,11 +986,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1032,11 +1032,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda13.0
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1077,7 +1077,7 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64
DESIRED_PYTHON: "3.14"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
@ -1143,11 +1143,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.14"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1189,11 +1189,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.14"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1235,11 +1235,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda13.0
DESIRED_PYTHON: "3.14"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1280,7 +1280,7 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cpu-aarch64
DESIRED_PYTHON: "3.14t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
@ -1346,11 +1346,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
DESIRED_PYTHON: "3.14t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1392,11 +1392,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.14t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1438,11 +1438,11 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: cuda13.0
DESIRED_PYTHON: "3.14t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -316,120 +316,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_3-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.3
GPU_ARCH_VERSION: "6.3"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.3
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: libtorch-rocm6_3-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm6_3-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-rocm6_3-shared-with-deps-release-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.3
GPU_ARCH_VERSION: "6.3"
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.3
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-rocm6_3-shared-with-deps-release
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: libtorch-cxx11-builder
custom-tag-prefix: rocm6.3
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm6_3-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm6_3-shared-with-deps-release-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.3
GPU_ARCH_VERSION: "6.3"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.3
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-rocm6_3-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_4-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -447,6 +333,7 @@ jobs:
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
timeout-minutes: 300
build_name: libtorch-rocm6_4-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
@ -543,3 +430,118 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm7_0-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.0
GPU_ARCH_VERSION: "7.0"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
timeout-minutes: 300
build_name: libtorch-rocm7_0-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-rocm7_0-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-rocm7_0-shared-with-deps-release-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.0
GPU_ARCH_VERSION: "7.0"
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-rocm7_0-shared-with-deps-release
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: libtorch-cxx11-builder
custom-tag-prefix: rocm7.0
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm
libtorch-rocm7_0-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-rocm7_0-shared-with-deps-release-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm7.0
GPU_ARCH_VERSION: "7.0"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-rocm7_0-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -1,87 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-libtorch-release
on:
push:
branches:
- main
tags:
- 'ciflow/trunk/*'
workflow_dispatch:
permissions:
id-token: write
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-libtorch-release
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-libtorch-release-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
libtorch-cpu-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cpu
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: libtorch-cpu-shared-with-deps-release
build_environment: linux-binary-libtorch-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cpu-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cpu-shared-with-deps-release-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cpu
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-cpu-shared-with-deps-release
build_environment: linux-binary-libtorch-release
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -1,88 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-manywheel
on:
push:
branches:
- main
tags:
- 'ciflow/trunk/*'
workflow_dispatch:
permissions:
id-token: write
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-manywheel
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
manywheel-py3_12-cuda12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_12-cuda12_8-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: "12.8"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda12_8
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8+ builds need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

File diff suppressed because it is too large Load Diff

View File

@ -1,135 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/linux_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: linux-binary-manywheel-rocm
on:
push:
branches:
- main
tags:
- 'ciflow/binaries/*'
- 'ciflow/binaries_wheel/*'
- 'ciflow/rocm/*'
workflow_dispatch:
permissions:
id-token: write
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
BINARY_ENV_FILE: /tmp/env
BUILD_ENVIRONMENT: linux-binary-manywheel-rocm
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
PYTORCH_FINAL_PACKAGE_DIR: /artifacts
PYTORCH_ROOT: /pytorch
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 0
concurrency:
group: linux-binary-manywheel-rocm-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
manywheel-py3_10-rocm6_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: "6.4"
GPU_ARCH_TYPE: rocm
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-rocm6_4
build_environment: linux-binary-manywheel-rocm
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-rocm6_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_10-rocm6_4-build
- get-label-type
runs-on: linux.rocm.gpu.mi250
timeout-minutes: 240
env:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: rocm6.4
GPU_ARCH_VERSION: "6.4"
GPU_ARCH_TYPE: rocm
SKIP_ALL_TESTS: 1
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.10"
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: manywheel-py3_10-rocm6_4
path: "${{ runner.temp }}/artifacts/"
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: ROCm set GPU_FLAG
run: |
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd --device=/dev/dri --group-add video --group-add daemon" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') }}
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-registry: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/') && '308535385114.dkr.ecr.us-east-1.amazonaws.com' || 'docker.io' }}
docker-image-name: manylinux2_28-builder
custom-tag-prefix: rocm6.4
docker-build-dir: .ci/docker
working-directory: pytorch
- name: Pull Docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Test Pytorch binary
uses: ./pytorch/.github/actions/test-pytorch-binary
env:
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Teardown ROCm
uses: ./.github/actions/teardown-rocm

View File

@ -1,261 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/windows_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: windows-binary-libtorch-debug
on:
push:
branches:
- main
workflow_dispatch:
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
BUILD_ENVIRONMENT: windows-binary-libtorch-debug
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
OS: windows
concurrency:
group: windows-binary-libtorch-debug-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
libtorch-cpu-shared-with-deps-debug-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v4.4.0
if: always()
with:
name: libtorch-cpu-shared-with-deps-debug
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cpu-shared-with-deps-debug-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cpu-shared-with-deps-debug-build
- get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-cpu-shared-with-deps-debug
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1

View File

@ -1,261 +0,0 @@
# @generated DO NOT EDIT MANUALLY
# Template is at: .github/templates/windows_binary_build_workflow.yml.j2
# Generation script: .github/scripts/generate_ci_workflows.py
name: windows-binary-libtorch-release
on:
push:
branches:
- main
workflow_dispatch:
env:
# Needed for conda builds
ALPINE_IMAGE: "308535385114.dkr.ecr.us-east-1.amazonaws.com/tool/alpine"
AWS_DEFAULT_REGION: us-east-1
BUILD_ENVIRONMENT: windows-binary-libtorch-release
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SKIP_ALL_TESTS: 1
OS: windows
concurrency:
group: windows-binary-libtorch-release-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
libtorch-cpu-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v4.4.0
if: always()
with:
name: libtorch-cpu-shared-with-deps-release
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cpu-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cpu-shared-with-deps-release-build
- get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-cpu-shared-with-deps-release
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1

View File

@ -31,6 +31,8 @@ jobs:
if: github.repository_owner == 'pytorch'
name: Get changed files
uses: ./.github/workflows/_get-changed-files.yml
with:
all_files: ${{ contains(github.event.pull_request.labels.*.name, 'lint-all-files') || contains(github.event.pull_request.labels.*.name, 'Reverted') }}
lintrunner-clang:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
@ -53,7 +55,7 @@ jobs:
with:
timeout: 120
runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge"
docker-image: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3.9-linter
docker-image: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3.10-linter
# NB: A shallow checkout won't work here because calculate-docker-image requires a full checkout
# to run git rev-parse HEAD~:.ci/docker when a new image is needed
fetch-depth: 0
@ -264,10 +266,10 @@ jobs:
with:
submodules: false
fetch-depth: 1
- name: Setup Python 3.9
- name: Setup Python 3.10
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.9'
python-version: '3.10'
architecture: x64
cache: pip
- name: Install dependencies

View File

@ -0,0 +1,46 @@
name: operator_microbenchmark
on:
push:
tags:
- ciflow/op-benchmark/*
workflow_dispatch:
schedule:
# Run at 06:00 UTC everyday
- cron: 0 6 * * *
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
opmicrobenchmark-build:
if: github.repository_owner == 'pytorch'
name: opmicrobenchmark-build
uses: ./.github/workflows/_linux-build.yml
with:
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '8.0 9.0'
test-matrix: |
{ include: [
{ config: "operator_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.aws.h100" },
{ config: "operator_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.aws.a100" },
]}
secrets: inherit
opmicrobenchmark-test:
name: opmicrobenchmark-test
uses: ./.github/workflows/_linux-test.yml
needs: opmicrobenchmark-build
with:
timeout-minutes: 500
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm80
docker-image: ${{ needs.opmicrobenchmark-build.outputs.docker-image }}
test-matrix: ${{ needs.opmicrobenchmark-build.outputs.test-matrix }}
secrets: inherit

View File

@ -59,13 +59,14 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-cuda12.4-py3.10-gcc11
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.4-cudnn9-py3-gcc11
cuda-arch-list: 7.5
test-matrix: |
{ include: [
{ config: "legacy_nvidia_driver", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "legacy_nvidia_driver", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
]}
secrets: inherit
@ -112,13 +113,13 @@ jobs:
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cuda12_8-py3_9-gcc9-build:
name: linux-jammy-cuda12.8-py3.9-gcc9
linux-jammy-cuda12_8-py3_10-gcc9-build:
name: linux-jammy-cuda12.8-py3.10-gcc9
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-py3.9-gcc9
build-environment: linux-jammy-cuda12.8-py3.10-gcc9
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc9
cuda-arch-list: 8.6
test-matrix: |
@ -128,14 +129,14 @@ jobs:
]}
secrets: inherit
linux-jammy-cuda12_8-py3_9-gcc9-test:
name: linux-jammy-cuda12.8-py3.9-gcc9
linux-jammy-cuda12_8-py3_10-gcc9-test:
name: linux-jammy-cuda12.8-py3.10-gcc9
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-cuda12_8-py3_9-gcc9-build
needs: linux-jammy-cuda12_8-py3_10-gcc9-build
with:
build-environment: linux-jammy-cuda12.8-py3.9-gcc9
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_9-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_9-gcc9-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc9
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc9-debug-build:

View File

@ -343,14 +343,14 @@ jobs:
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc9-inductor-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-xpu-n-py3_9-build:
name: linux-jammy-xpu-n-py3.9
linux-jammy-xpu-n-py3_10-build:
name: linux-jammy-xpu-n-py3.10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
sync-tag: linux-xpu-n-build
runner_prefix: ${{ needs.get-label-type.outputs.label-type }}
build-environment: linux-jammy-xpu-n-py3.9
build-environment: linux-jammy-xpu-n-py3.10
docker-image-name: ci-image:pytorch-linux-jammy-xpu-n-py3
test-matrix: |
{ include: [

View File

@ -0,0 +1,54 @@
name: quantization-periodic
on:
push:
tags:
- ciflow/quantization-periodic/*
workflow_dispatch:
schedule:
# run weekly
- cron: "45 0 * * 0"
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-default-label-prefix:
name: get-default-label-prefix
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
periodic-quantization-build:
name: periodic-quantization-build
uses: ./.github/workflows/_linux-build.yml
needs: get-default-label-prefix
with:
runner_prefix: "${{ needs.get-default-label-prefix.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-cudnn9-py3-gcc11
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '8.9'
test-matrix: |
{ include: [
{ config: "quantization", shard: 1, num_shards: 1, runner: "${{ needs.get-default-label-prefix.outputs.label-type }}linux.g6.4xlarge.experimental.nvidia.gpu" },
]}
secrets: inherit
periodic-test-quantization:
name: periodic-test-quantization
uses: ./.github/workflows/_linux-test.yml
needs: periodic-quantization-build
with:
build-environment: linux-jammy-cuda12.8-cudnn9-py3-gcc11
docker-image: ${{ needs.periodic-quantization-build.outputs.docker-image }}
test-matrix: ${{ needs.periodic-quantization-build.outputs.test-matrix }}
secrets: inherit

View File

@ -38,7 +38,7 @@ jobs:
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-noble-rocm-py3.12-mi355
docker-image-name: ci-image:pytorch-linux-noble-rocm-alpha-py3
docker-image-name: ci-image:pytorch-linux-noble-rocm-n-py3
sync-tag: rocm-build
test-matrix: |
{ include: [

76
.github/workflows/test-b200.yml vendored Normal file
View File

@ -0,0 +1,76 @@
# B200 Smoke Tests CI Workflow
#
# This workflow runs smoke tests on B200 hardware
#
# Flow:
# 1. Builds PyTorch with CUDA 12.8+ and sm100 architecture for B200
# 2. Runs smoke tests on linux.dgx.b200 runner
# 3. Tests executed are defined in .ci/pytorch/test.sh -> test_python_smoke() function
#
# Triggered by:
# - Pull requests modifying this workflow file
# - Manual dispatch
# - Schedule (every 6 hours)
# - Adding ciflow/b200 label to a PR (creates ciflow/b200/* tag)
name: B200 Smoke Tests
on:
pull_request:
paths:
- .github/workflows/test-b200.yml
workflow_dispatch:
schedule:
- cron: 0 4,10,16,22 * * * # every 6 hours
push:
tags:
- ciflow/b200/*
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-cuda12_8-py3_10-gcc11-sm100-build:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm100
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'
test-matrix: |
{ include: [
{ config: "smoke_b200", shard: 1, num_shards: 1, runner: "linux.dgx.b200" },
]}
# config: "smoke_b200" maps to test_python_smoke_b200() in .ci/pytorch/test.sh
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-sm100-test:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm100
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-sm100-build
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm100
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm100-build.outputs.test-matrix }}
aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
secrets: inherit

View File

@ -53,27 +53,3 @@ jobs:
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-py3_9-clang9-xla-build:
name: linux-jammy-py3_9-clang9-xla
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.9-clang9-xla
docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/xla_base:v1.3-lite
test-matrix: |
{ include: [
{ config: "xla", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
]}
secrets: inherit
linux-jammy-py3_9-clang9-xla-test:
name: linux-jammy-py3_9-clang9-xla
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-py3_9-clang9-xla-build
with:
build-environment: linux-jammy-py3.9-clang9-xla
docker-image: ${{ needs.linux-jammy-py3_9-clang9-xla-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-py3_9-clang9-xla-build.outputs.test-matrix }}
secrets: inherit

1
.gitignore vendored
View File

@ -82,6 +82,7 @@ torch/return_types.pyi
torch/nn/functional.pyi
torch/utils/data/datapipes/datapipe.pyi
torch/csrc/autograd/generated/*
torch/csrc/functionalization/generated/*
torch/csrc/lazy/generated/*.[!m]*
torch_compile_debug/
# Listed manually because some files in this directory are not generated

View File

@ -49,7 +49,7 @@ init_command = [
'mccabe==0.7.0',
'pycodestyle==2.14.0',
'pyflakes==3.4.0',
'torchfix==0.4.0 ; python_version >= "3.9" and python_version < "3.13"',
'torchfix==0.4.0 ; python_version >= "3.10" and python_version < "3.13"',
]
@ -123,6 +123,7 @@ is_formatter = true
code = 'MYPY'
include_patterns = [
'setup.py',
'functorch/dim/**/*.py',
'torch/**/*.py',
'torch/**/*.pyi',
'caffe2/**/*.py',
@ -152,7 +153,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'numpy==1.26.4 ; python_version >= "3.9" and python_version <= "3.11"',
'numpy==1.26.4 ; python_version >= "3.10" and python_version <= "3.11"',
'numpy==2.1.0 ; python_version >= "3.12"',
'expecttest==0.3.0',
'mypy==1.16.0',
@ -195,6 +196,7 @@ exclude_patterns = [
'tools/test/gen_operators_yaml_test.py',
'tools/test/gen_oplist_test.py',
'tools/test/test_selective_build.py',
'tools/experimental/dynamic_shapes/torchfuzz/**',
]
command = [
'python3',
@ -1451,7 +1453,7 @@ init_command = [
'--dry-run={{DRYRUN}}',
'usort==1.0.8.post1',
'isort==6.0.1',
'ruff==0.12.9', # sync with RUFF
'ruff==0.13.1', # sync with RUFF
]
is_formatter = true
@ -1585,7 +1587,7 @@ init_command = [
'python3',
'tools/linter/adapters/pip_init.py',
'--dry-run={{DRYRUN}}',
'ruff==0.12.9', # sync with PYFMT
'ruff==0.13.1', # sync with PYFMT
]
is_formatter = true

View File

@ -22,6 +22,7 @@ COMMON_COPTS = [
"-DHAVE_SHM_UNLINK=1",
"-D_FILE_OFFSET_BITS=64",
"-DUSE_FBGEMM",
"-DUSE_DISTRIBUTED",
"-DAT_PER_OPERATOR_HEADERS",
"-DATEN_THREADING=NATIVE",
"-DNO_CUDNN_DESTROY_HANDLE",
@ -90,6 +91,8 @@ generated_cpu_cpp = [
"aten/src/ATen/NativeMetaFunctions.h",
"aten/src/ATen/RegistrationDeclarations.h",
"aten/src/ATen/VmapGeneratedPlumbing.h",
"aten/src/ATen/ViewMetaClasses.h",
"aten/src/ATen/ViewMetaClasses.cpp",
"aten/src/ATen/core/aten_interned_strings.h",
"aten/src/ATen/core/enum_tag.h",
"aten/src/ATen/core/TensorBody.h",
@ -810,7 +813,7 @@ cc_library(
name = "torch_python",
srcs = libtorch_python_core_sources
+ if_cuda(libtorch_python_cuda_sources)
+ libtorch_python_distributed_sources
+ if_cuda(libtorch_python_distributed_sources)
+ GENERATED_AUTOGRAD_PYTHON,
hdrs = glob([
"torch/csrc/generic/*.cpp",
@ -832,36 +835,6 @@ pybind_extension(
],
)
cc_library(
name = "functorch",
hdrs = glob([
"functorch/csrc/dim/*.h",
]),
srcs = glob([
"functorch/csrc/dim/*.cpp",
]),
deps = [
":aten_nvrtc",
":torch_python",
"@pybind11",
],
)
pybind_extension(
name = "functorch/_C",
copts=[
"-DTORCH_EXTENSION_NAME=_C"
],
srcs = [
"functorch/csrc/init_dim_only.cpp",
],
deps = [
":functorch",
":torch_python",
":aten_nvrtc",
],
)
cc_binary(
name = "torch/bin/torch_shm_manager",
srcs = [
@ -902,7 +875,6 @@ py_library(
],
data = [
":torch/_C.so",
":functorch/_C.so",
":torch/bin/torch_shm_manager",
],
)
@ -1105,6 +1077,7 @@ test_suite(
"aten/src/ATen/templates/LazyNonNativeIr.h",
"aten/src/ATen/templates/RegisterDispatchKey.cpp",
"aten/src/ATen/templates/RegisterDispatchDefinitions.ini",
"aten/src/ATen/templates/ViewMetaClassesPythonBinding.cpp",
"aten/src/ATen/native/native_functions.yaml",
"aten/src/ATen/native/tags.yaml",
"aten/src/ATen/native/ts_native_functions.yaml",

View File

@ -1,5 +1,4 @@
cmake_minimum_required(VERSION 3.27 FATAL_ERROR)
# cmake_policy(SET CMP0022 NEW) cmake_policy(SET CMP0023 NEW)
# Use compiler ID "AppleClang" instead of "Clang" for XCode. Not setting this
# sometimes makes XCode C compiler gets detected as "Clang", even when the C++
@ -181,9 +180,8 @@ elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^(ppc64le)")
set(CPU_POWER ON)
endif()
# For non-supported platforms, turn USE_DISTRIBUTED off by default.
# NB: USE_DISTRIBUTED simply disables the backend; distributed code
# still gets built
# For non-supported platforms, turn USE_DISTRIBUTED off by default. It is not
# tested and likely won't work without additional changes.
if(NOT LINUX AND NOT WIN32)
set(USE_DISTRIBUTED
OFF
@ -263,11 +261,11 @@ option(USE_PYTORCH_METAL "Use Metal for PyTorch iOS build" OFF)
option(USE_PYTORCH_METAL_EXPORT "Export Metal models on MacOSX desktop" OFF)
option(USE_NATIVE_ARCH "Use -march=native" OFF)
cmake_dependent_option(USE_MPS "Use MPS for macOS build" ON "MPS_FOUND" OFF)
option(USE_DISTRIBUTED "Enable default distributed backends" ON)
option(USE_DISTRIBUTED "Use distributed" ON)
cmake_dependent_option(USE_NCCL "Use NCCL" ON
"USE_DISTRIBUTED;USE_CUDA OR USE_ROCM;UNIX;NOT APPLE" OFF)
cmake_dependent_option(USE_XCCL "Use XCCL" ON
"USE_DISTRIBUTED;USE_XPU;UNIX;NOT APPLE" OFF)
"USE_XPU;UNIX;NOT APPLE" OFF)
cmake_dependent_option(USE_RCCL "Use RCCL" ON USE_NCCL OFF)
cmake_dependent_option(USE_RCCL "Use RCCL" ON "USE_NCCL;NOT WIN32" OFF)
cmake_dependent_option(USE_STATIC_NCCL "Use static NCCL" OFF "USE_NCCL" OFF)
@ -439,11 +437,12 @@ if(WIN32)
PATH_SUFFIXES lib
NO_DEFAULT_PATH)
if(NOT libuv_tmp_LIBRARY)
set(USE_DISTRIBUTED OFF)
set(USE_GLOO OFF)
message(
WARNING
"Libuv is not installed in current conda env. Set USE_GLOO to OFF. "
"Please run command 'conda install -c conda-forge libuv=1.39' to install libuv."
"Libuv is not installed in current conda env. Set USE_DISTRIBUTED to OFF. "
"Please run command 'conda install -c conda-forge libuv=1.51' to install libuv."
)
else()
set(ENV{libuv_ROOT} ${libuv_tmp_LIBRARY}/../../)
@ -889,23 +888,28 @@ cmake_dependent_option(
"(USE_CUDA AND NOT MSVC) OR USE_ROCM"
OFF)
IF(USE_ROCM AND "gfx942" IN_LIST PYTORCH_ROCM_ARCH)
message(WARNING "Setting USE_FBGEMM_GENAI for gfx942 to ON by default, doing ROCM build")
set(USE_FBGEMM_GENAI_DEFAULT ON)
elseif(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0" AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8 AND NOT WIN32)
message(STATUS "Setting USE_FBGEMM_GENAI to ON by default , doing CUDA build for SM100a")
set(USE_FBGEMM_GENAI_DEFAULT ON)
else()
set(USE_FBGEMM_GENAI_DEFAULT OFF)
endif()
cmake_dependent_option(
USE_FBGEMM_GENAI
"Whether to build FBGEMM GenAI quantized GEMM kernels.\
Will be disabled if not supported by the platform"
ON
"USE_ROCM"
${USE_FBGEMM_GENAI_DEFAULT}
"(USE_CUDA AND NOT MSVC) OR USE_ROCM"
OFF)
IF(USE_FBGEMM_GENAI AND USE_ROCM AND NOT "gfx942" IN_LIST PYTORCH_ROCM_ARCH)
message(WARNING "Unsupported ROCM arch for FBGEMM GenAI, will set USE_FBGEMM_GENAI to OFF")
set(USE_FBGEMM_GENAI off)
endif()
# Set USE_FBGEMM_GENAI to ON for CUDA build on SM100.
if(USE_CUDA AND "$ENV{TORCH_CUDA_ARCH_LIST}" MATCHES "10.0" AND CMAKE_CUDA_COMPILER_VERSION VERSION_GREATER_EQUAL 12.8 AND NOT WIN32)
message(STATUS "Setting USE_FBGEMM_GENAI to ON, doing CUDA build for SM100a")
set(USE_FBGEMM_GENAI ON)
endif()
# CAVEAT: Again, Flash Attention2 will error while building for sm52 while Mem
@ -1391,10 +1395,6 @@ endif()
include(cmake/Summary.cmake)
caffe2_print_configuration_summary()
if(BUILD_FUNCTORCH)
add_subdirectory(functorch)
endif()
# Parse custom debug info
if(DEFINED USE_CUSTOM_DEBINFO)
string(REPLACE ";" " " SOURCE_FILES "${USE_CUSTOM_DEBINFO}")
@ -1486,4 +1486,4 @@ else()
To do so please export USE_PRIORITIZED_TEXT_FOR_LD=1
]])
endif()
endif()
endif()

View File

@ -1,20 +1,61 @@
# Reference: https://setuptools.pypa.io/en/latest/userguide/miscellaneous.html
# Include source files in SDist
include CMakeLists.txt
include *.bzl *.bazel .bazel* BUILD *.BUILD BUILD.* WORKSPACE
include BUCK BUCK.*
include requirements*.txt
include version.txt
include [Mm]akefile *.[Mm]akefile [Mm]akefile.*
include [Dd]ockerfile *.[Dd]ockerfile [Dd]ockerfile.* .dockerignore
# Include individual top-level files
include CITATION.cff
include CODEOWNERS
include Dockerfile
include LICENSE
include MANIFEST.in
include Makefile
include NOTICE
include .bc-linter.yml
include .clang-format .clang-tidy
include .cmakelintrc
include .coveragerc
include .dockerignore
include .editorconfig
include .flake8
include .gdbinit
include .lintrunner.toml
include .lldbinit
include codex_setup.sh
include docker.Makefile
include pyrefly.toml
include ubsan.supp
# Include bazel and BUCK related files
include BUILD.bazel BUCK.oss
include WORKSPACE
include *.bzl
include .bazelignore .bazelrc .bazelversion
# Include general configuration files
include *.ini
# Include important top-level information
include *.md
# Include technical text files at the moment, comprises
# version.txt, CMakeLists.txt, requirements.txt
include *.txt
# Include ctags configuration
include .ctags.d/*.ctags
# Include subfolders completely
graft .devcontainer
graft .vscode
graft android
graft aten
graft benchmarks
graft binaries
graft c10
graft caffe2
graft cmake
graft docs
graft functorch
graft ios
graft mypy_plugins
graft scripts
graft test
graft third_party
graft tools
graft torch
@ -22,29 +63,37 @@ graft torchgen
# FIXME: torch-xla build during codegen will fail if include this file in wheel
exclude torchgen/BUILD.bazel
# Misc files and directories in SDist
include *.md
include CITATION.cff
include LICENSE NOTICE
include mypy*.ini
graft benchmarks
graft docs
graft mypy_plugins
graft scripts
# The following exclusions omit parts from third-party dependencies that
# contain invalid symlinks[1] and that are not needed for pytorch, such as
# bindings for unused languages
prune third_party/flatbuffers/java
prune third_party/flatbuffers/kotlin
prune third_party/ittapi/rust
prune third_party/nccl/pkg/debian
prune third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-*
# The following document is also an invalid symlink[1] and superfluous
exclude third_party/flatbuffers/docs/source/CONTRIBUTING.md
# Omit autogenerated code
prune torchgen/packaged
# Omit caches, compiled, and scm related content
prune */__pycache__
prune **/.github
prune **/.gitlab
global-exclude *.o *.obj *.so *.dylib *.a *.pxd *.dll *.lib
global-exclude *.py[cod] *.swp *~
global-exclude .git .git-blame-ignore-revs .gitattributes .gitignore .gitmodules
global-exclude .gitlab-ci.yml
# Misc files needed for custom setuptools command
include .gitignore
include .gitmodules
# Include test suites in SDist
graft test
include pytest.ini
include .coveragerc
# [1] Invalid symlinks for the purposes of Python source distributions are,
# according to the source distribution format[2] links pointing outside the
# destination directory or links with a `..` component, which is those of
# concern here.
# Prune generated/compiled files
prune torchgen/packaged
prune */__pycache__
global-exclude *.o *.obj *.so *.a *.dylib *.pxd *.dll *.lib *.py[cod]
prune */.git
global-exclude .git *~ *.swp
# [2] https://packaging.python.org/en/latest/specifications/source-distribution-format/#source-distribution-archive-features

View File

@ -161,7 +161,7 @@ They require JetPack 4.2 and above, and [@dusty-nv](https://github.com/dusty-nv)
#### Prerequisites
If you are installing from source, you will need:
- Python 3.9 or later
- Python 3.10 or later
- A compiler that fully supports C++17, such as clang or gcc (gcc 9.4.0 or newer is required, on Linux)
- Visual Studio or Visual Studio Build Tool (Windows only)
@ -275,7 +275,7 @@ conda install pkg-config libuv
pip install mkl-static mkl-include
# Add these packages if torch.distributed is needed.
# Distributed package support on Windows is a prototype feature and is subject to changes.
conda install -c conda-forge libuv
conda install -c conda-forge libuv=1.51
```
#### Install PyTorch

View File

@ -317,10 +317,20 @@ IF(USE_FBGEMM_GENAI)
-greedy-reverse-local-assignment=1
-fhip-new-launch-api)
# Only compile for gfx942 for now.
# This is rather hacky, I could not figure out a clean solution :(
set(HIP_CLANG_FLAGS_ORIGINAL ${HIP_CLANG_FLAGS})
string(REGEX REPLACE "--offload-arch=[^ ]*" "" FILTERED_HIP_CLANG_FLAGS "${HIP_CLANG_FLAGS}")
if("gfx942" IN_LIST PYTORCH_ROCM_ARCH)
list(APPEND FILTERED_HIP_CLANG_FLAGS --offload-arch=gfx942;)
endif()
set(HIP_CLANG_FLAGS ${FILTERED_HIP_CLANG_FLAGS})
hip_add_library(
fbgemm_genai STATIC
${fbgemm_genai_native_rocm_hip}
HIPCC_OPTIONS ${HIP_HCC_FLAGS} ${FBGEMM_GENAI_EXTRA_HIPCC_FLAGS})
set(HIP_CLANG_FLAGS ${HIP_CLANG_FLAGS_ORIGINAL})
set_target_properties(fbgemm_genai PROPERTIES POSITION_INDEPENDENT_CODE ON)
target_compile_definitions(fbgemm_genai PRIVATE FBGEMM_GENAI_NO_EXTENDED_SHAPES)

View File

@ -401,30 +401,13 @@ T* toDLPackImpl(const Tensor& src) {
// The following code detects whether the src follows
// a continuous pattern. If the src follows such pattern (common-case)
// then we do not need to normalize the strides.
bool need_normalize_strides = false;
int64_t expected_stride = 1;
for (int i = src.dim() - 1; i >= 0; i--) {
// detect if we do not meet continuous pattern
// and the size is 1, so there is opportunity to normalize
if (src.stride(i) != expected_stride && src.size(i) == 1) {
need_normalize_strides = true;
break;
}
expected_stride *= src.size(i);
}
bool need_normalize_strides = src.dim() == 1 && src.size(0) == 1 && src.stride(0) != 1;
// less common case, try normalizing the strides
if (need_normalize_strides) {
// create a new tensor with possibly normalized strides
// gh-83069
auto shape = src.sizes();
auto strides = src.strides().vec();
for (int i = 0; i < src.dim(); i++) {
if (shape[i] < 2) {
strides[i] = 1;
}
}
view = src.as_strided(shape, strides, src.storage_offset());
view = src.as_strided(shape, {1}, src.storage_offset());
}
ATenDLMTensor<T>* atDLMTensor(new ATenDLMTensor<T>);

View File

@ -468,7 +468,7 @@ inline Tensor _sum_to(
// if we assume no reduction due to unbacked we ensure that at runtime.
TORCH_MAYBE_SYM_CHECK(
sym_eq(shape[i - leading_dims], sizes[i]),
"non-reduction path was assumed due to unabcked symbols expected those two sizes to be the same:",
"non-reduction path was assumed due to unbacked symbols expected those two sizes to be the same:",
shape[i - leading_dims],
", ",
sizes[i])

View File

@ -9,11 +9,6 @@
namespace at::functionalization {
ViewMeta ViewMeta::to_out_idx(int64_t out_idx) {
if (out_idx == this->out_index) return *this;
return ViewMeta(forward_fn, reverse_fn, has_symbolic_inputs, is_multi_output, is_as_strided, out_idx);
}
// Note [Functionalization: Alias Removal Part 2]
// See Note [Functionalization: Alias Removal] for more details.
// This function applies a single update from one of the views to the StorageImpl.
@ -42,12 +37,12 @@ ViewMeta ViewMeta::to_out_idx(int64_t out_idx) {
static const Tensor apply_update(const FunctionalStorageImpl::Update& update, const Tensor& base) {
at::Tensor t = update.new_val;
TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(t));
if (update.view_metas.empty()) return t;
if (update.view_metas.empty()) { return t; }
std::vector<at::Tensor> tmp_values({base});
tmp_values.reserve(update.view_metas.size());
for (size_t i = 0; i < update.view_metas.size() - 1; ++i) {
at::Tensor next_view = update.view_metas[i].forward_fn(tmp_values.back(), update.view_metas[i].out_index);
at::Tensor next_view = update.view_metas[i]->forward(tmp_values.back());
// NB: We only actually need tmp_values for ops like select/slice/diagonal/squeeze/as_strided
// All of these ops require additional information to recover the sizes of the original tensor.
// If need to, we could probably apply this optimization and only bother computing tmp_values
@ -55,9 +50,8 @@ static const Tensor apply_update(const FunctionalStorageImpl::Update& update, co
tmp_values.push_back(std::move(next_view));
}
for(int64_t i = static_cast<int64_t>(update.view_metas.size()) - 1; i >= 0; --i) {
int64_t out_idx = update.view_metas[i].out_index;
// Each view inverse is implemented in ViewInverses.cpp.
t = update.view_metas[i].reverse_fn(tmp_values[i], t, out_idx);
t = update.view_metas[i]->reverse(tmp_values[i], t);
}
TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(t));
return t;
@ -111,13 +105,13 @@ FunctionalStorageImpl::FunctionalStorageImpl(const Tensor& base)
TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(base_));
}
void FunctionalStorageImpl::add_update(const Tensor& updated_val, const std::vector<ViewMeta>& metas) {
void FunctionalStorageImpl::add_update(const Tensor& updated_val, const std::vector<std::shared_ptr<ViewMeta>>& metas) {
TORCH_CHECK(!frozen_, "cannot mutate tensors with frozen storage");
if (metas.size() > 1) {
for (size_t i = 1; i < metas.size(); ++i) {
// Skipping this check for XLA. Would be good to add it back, but it is failing XLA CI
TORCH_CHECK(updated_val.device().type() == c10::DeviceType::XLA || !metas[i].is_as_strided,
TORCH_CHECK(updated_val.device().type() == c10::DeviceType::XLA || !metas[i]->is_as_strided,
"During torch.compile, encountered a mutation on a view chain of length ", metas.size(), ", where view ", i,
" was an as_strided() call. as_strided() is non-compositional, and therefore is not possible to functionalize properly today,"
"so this behavior is banned in compile. As a workaround, you can either remove the mutation from the model code, or you "

View File

@ -8,44 +8,89 @@ namespace at::functionalization {
// See Note [Functionalization Pass In Core]
enum class InverseReturnMode {
/// Specifies that functional inverses should always return a view.
AlwaysView,
/// Specifies that functional inverses should always return a non-view / copy.
NeverView,
/// Specifies that functional inverses should return a view unless a (copying)
/// scatter
/// inverse exists, in which case that will be used instead.
/// This avoids as_strided() calls that can be difficult for subclasses to
/// handle.
ViewOrScatterInverse,
};
#define FUNCTIONALIZATION_VIEWMETA_NAME(TYPE) \
static const char* name() { \
return #TYPE; \
}
#define FUNCTIONALIZATION_VIEWMETA_SERIALIZABLE_TUPLE(...) \
using SerializableTuple = std::tuple<__VA_ARGS__>
// ViewMeta is a class used by the functionalization pass to navigate between
// a base tensor and a view tensor.
// For example, if I call `b = a.view1(...)`
// the functionalization pass will generate and store a ViewMeta on b that looks
// like:
// the functionalization pass will generate and store a ViewMeta specialization
// for `view1` operation on b that looks like:
//
// ViewMeta(
// [<captures>](const Tensor& base, int64_t mutated_view_idx) {
// return base.view1(...);
// },
// [<captures>](const at::Tensor& base, const at::Tensor& mutated_view,
// int64_t mutated_view_idx) -> at::Tensor {
// return at::functionalization::impl::view1_inverse(base, mutated_view,
// ...);
// struct TORCH_API view1_ViewMeta : public ViewMeta {
// FUNCTIONALIZATION_VIEWMETA_NAME(view1_ViewMeta);
// FUNCTIONALIZATION_VIEWMETA_SERIALIZABLE_TUPLE(
// bool /* reapply_views */,
// const std::vector<int64_t>&);
//
// view1_ViewMeta(const SerializableTuple& tpl)
// : view1_ViewMeta(std::get<0>(tpl), std::get<1>(tpl)) {}
//
// view1_ViewMeta(bool reapply_views, const std::vector<int64_t>& size)
// : ViewMeta(/*has_symbolic_inputs=*/false),
// reapply_views(reapply_views),
// size(size) {}
//
// Tensor forward(const Tensor& base) override {
// return base.view1(...);
// }
//
// The forward_fn lambda describes how to replay view1 on a tensor.
// Tensor reverse(const Tensor& base, const Tensor& mutated_view) override {
// return at::functionalization::impl::view1_inverse(base, mutated_view,
// ...);
// }
//
// The reverse_fn lambda describes how, given a tensor that is already a view,
// SerializableTuple to_serializable_tuple() {
// return std::make_tuple(reapply_views, size);
// }
//
// bool reapply_views;
// std::vector<int64_t> size;
// };
//
// The forward function describes how to replay view1 on a tensor.
//
// The reverse function describes how, given a tensor that is already a view,
// how to get the corresponding base tensor. See Note [Functionalization Pass:
// View Inverses] for details.
//
// `SerializedTuple` is a typedef that defines an `std::tuple<...>` type
// representing the `ViewMeta` instance state. Methods that take in/return such
// a type are used for supporting pickle serialization.
struct ViewMeta {
ViewMeta(
std::function<Tensor(const Tensor&, int64_t)> forward,
std::function<Tensor(const Tensor&, const Tensor&, int64_t)> reverse,
bool has_symbolic_inputs,
bool is_multi_output = false,
bool is_as_strided = false,
int64_t out_idx = 0)
: forward_fn(std::move(forward)),
reverse_fn(std::move(reverse)),
out_index(out_idx),
: out_index(out_idx),
is_multi_output(is_multi_output),
is_as_strided(is_as_strided),
has_symbolic_inputs(has_symbolic_inputs) {}
std::function<Tensor(const Tensor&, int64_t)> forward_fn;
std::function<Tensor(const Tensor&, const Tensor&, int64_t)> reverse_fn;
virtual ~ViewMeta() = default;
virtual Tensor forward(const Tensor& base) = 0;
virtual Tensor reverse(const Tensor& base, const Tensor& mutated_view) = 0;
// See Note [out_idx in ViewMeta]
int64_t out_index;
@ -57,10 +102,17 @@ struct ViewMeta {
// Tells us if this view operation has any symbolic inputs
bool has_symbolic_inputs;
// Returns a copy of the current ViewMeta, if out_idx matches the current
// out_index. Otherwise, returns a new ViewMeta with the same forward/reverse
// Returns a new ViewMeta with the same forward/reverse
// functions, but a new out index.
ViewMeta to_out_idx(int64_t out_idx);
//
// This method should be implemented by those `ViewMeta` that have more than
// one output.
virtual std::shared_ptr<ViewMeta> to_out_index(int64_t out_index) {
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"ViewMeta::to_out_index not implemented. ",
"Likely because there's only one output.");
}
};
// FunctionalStorageImpl is a subclass of StorageImpl used by the
@ -93,14 +145,14 @@ struct TORCH_API FunctionalStorageImpl : public c10::StorageImpl {
// NOLINTNEXTLINE(cppcoreguidelines-avoid-const-or-ref-data-members)
const at::Tensor new_val;
// NOLINTNEXTLINE(cppcoreguidelines-avoid-const-or-ref-data-members)
const std::vector<ViewMeta> view_metas;
const std::vector<std::shared_ptr<ViewMeta>> view_metas;
};
explicit FunctionalStorageImpl(const Tensor& value);
void add_update(
const Tensor& updated_val,
const std::vector<ViewMeta>& view_metas);
const std::vector<std::shared_ptr<ViewMeta>>& view_metas);
bool apply_updates();
const Tensor& base() {
return base_;

View File

@ -129,17 +129,19 @@ void FunctionalTensorWrapper::freeze_storage() const {
// - view_value: The output tensor that we need to wrap.
// - base: The "base" of the view that `view_value` was generated from.
// See Note [Functionalization: Alias Removal Part 2] for more details on the mutation replay logic.
FunctionalTensorWrapper::FunctionalTensorWrapper(const Tensor& view_value, const FunctionalTensorWrapper* base, const functionalization::ViewMeta& meta)
: c10::TensorImpl(
c10::DispatchKeySet(DispatchKey::Functionalize),
view_value.dtype(),
base->storage().data_ptr().device()
),
value_(view_value),
is_multi_output_view_(base->is_multi_output_view_ || meta.is_multi_output),
was_storage_changed_(base->was_storage_changed_),
is_symbolic_(base->is_symbolic_)
{
FunctionalTensorWrapper::FunctionalTensorWrapper(
const Tensor& view_value,
const FunctionalTensorWrapper* base,
const std::shared_ptr<functionalization::ViewMeta>& meta)
: c10::TensorImpl(
c10::DispatchKeySet(DispatchKey::Functionalize),
view_value.dtype(),
base->storage().data_ptr().device()),
value_(view_value),
is_multi_output_view_(
base->is_multi_output_view_ || meta->is_multi_output),
was_storage_changed_(base->was_storage_changed_),
is_symbolic_(base->is_symbolic_) {
TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(value_));
TORCH_INTERNAL_ASSERT(!value_.key_set().has(c10::DispatchKey::Functionalize));
set_constructor_metadata();
@ -148,11 +150,10 @@ FunctionalTensorWrapper::FunctionalTensorWrapper(const Tensor& view_value, const
view_metas_ = base->view_metas_; // copy
}
view_metas_.push_back(meta);
maybe_mark_symbolic(meta);
maybe_mark_symbolic(meta.get());
storage_ = base->storage_; // alias this tensor's storage with the base tensor's
}
functionalization::FunctionalStorageImpl* FunctionalTensorWrapper::functional_storage_impl() const {
return static_cast<functionalization::FunctionalStorageImpl*>(storage_.unsafeGetStorageImpl());
}
@ -176,18 +177,18 @@ bool FunctionalTensorWrapper::is_up_to_date() const {
}
// See Note [Functionalization Pass - Inplace View Ops]
void FunctionalTensorWrapper::mutate_view_meta(const at::functionalization::ViewMeta& meta) {
void FunctionalTensorWrapper::mutate_view_meta(const std::shared_ptr<at::functionalization::ViewMeta>& meta) {
view_metas_.push_back(meta);
// Manually track the fact that this tensor received a metadata mutation!
has_metadata_mutation_ = true;
// Mark this tensor as being symbolic if there are any symbolic inputs used by the view operation.
maybe_mark_symbolic(meta);
maybe_mark_symbolic(meta.get());
// Note [Functionalization Pass - Inplace View Ops]
// So, these ops are special - they're mutation AND view ops. They get special codegen.
// An example is transpose_, e.g. `a.transpose_()`
// Calling transpose_() should ensure that a gets an alias, and append the new ViewMeta to a's current list of ViewMetas.
at::AutoDispatchSkipFunctionalize guard;
value_ = meta.forward_fn(value_, meta.out_index);
value_ = meta->forward(value_);
TORCH_INTERNAL_ASSERT(!value_.key_set().has(c10::DispatchKey::Functionalize));
}
@ -368,15 +369,8 @@ void FunctionalTensorWrapper::sync_() {
regenerate_from_base();
}
Tensor FunctionalTensorWrapper::apply_view_metas(const Tensor& base) {
auto t = base;
// Reapply views to get the viewed tensor from the base in alias_
for (auto& view_meta: view_metas_) {
t = view_meta.forward_fn(t, view_meta.out_index);
}
return t;
const std::vector<std::shared_ptr<functionalization::ViewMeta>>& FunctionalTensorWrapper::view_metas() const {
return view_metas_;
}
void FunctionalTensorWrapper::regenerate_from_base() {
@ -385,7 +379,7 @@ void FunctionalTensorWrapper::regenerate_from_base() {
auto t = storage_impl->base();
TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(t));
t = apply_view_metas(t);
t = at::functionalization::impl::apply_view_meta_sequence(t, view_metas_);
TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(t));
replace_(t, /*from_lazy_regenerate=*/true);
@ -727,11 +721,11 @@ bool isFunctionalTensor(const std::optional<Tensor>& t) {
}
bool isFunctionalTensor(const c10::List<::std::optional<Tensor>>& t_list) {
if (t_list.empty()) return false;
if (t_list.empty()) { return false; }
auto functional_count = 0;
for (const auto i : c10::irange(t_list.size())) {
auto const & e= t_list[i];
if (!e.has_value() || !e->defined()) continue;
if (!e.has_value() || !e->defined()) { continue; }
if (isFunctionalTensor(e)) {
++functional_count;
}
@ -741,10 +735,10 @@ bool isFunctionalTensor(const c10::List<::std::optional<Tensor>>& t_list) {
template <typename T>
static bool isFunctionalTensorIListRef(c10::IListRef<T> list) {
if (list.size() == 0) return false;
if (list.size() == 0) { return false; }
auto functional_count = 0;
for (const auto& tensor : list) {
if (!tensor.defined()) continue;
if (!tensor.defined()) { continue; }
if (isFunctionalTensor(tensor)) {
++functional_count;
}
@ -762,20 +756,28 @@ void freeze_functional_tensor(const Tensor& tensor) {
functional_base_impl->freeze_storage();
}
Tensor create_functional_tensor_with_view_meta(const at::Tensor& view_to_wrap, const at::Tensor& base, functionalization::ViewMeta meta, int64_t out_idx) {
Tensor create_functional_tensor_with_view_meta(
const at::Tensor& view_to_wrap,
const at::Tensor& base,
const std::shared_ptr<functionalization::ViewMeta>& meta,
int64_t out_idx) {
TORCH_INTERNAL_ASSERT(!at::functionalization::impl::isFunctionalTensor(view_to_wrap));
TORCH_INTERNAL_ASSERT(at::functionalization::impl::isFunctionalTensor(base));
auto functional_base_impl = at::functionalization::impl::unsafeGetFunctionalWrapper(base);
auto meta_ = meta;
if (out_idx != 0) {
// Note [out_idx in ViewMeta]
// When a view op outputs multiple tensors, each output needs its own separate ViewMeta.
// Each ViewMeta also tracks the index of the particular output tensor, which is needed in the reverse function.
meta = meta.to_out_idx(out_idx);
meta_ = meta->to_out_index(out_idx);
}
return at::detail::make_tensor<FunctionalTensorWrapper>(view_to_wrap, functional_base_impl, meta);
return at::detail::make_tensor<FunctionalTensorWrapper>(view_to_wrap, functional_base_impl, meta_);
}
std::vector<Tensor> create_functional_tensor_with_view_meta(ITensorListRef view_to_wrap, const at::Tensor& base, const functionalization::ViewMeta& meta) {
std::vector<Tensor> create_functional_tensor_with_view_meta(
ITensorListRef view_to_wrap,
const at::Tensor& base,
const std::shared_ptr<functionalization::ViewMeta>& meta) {
std::vector<Tensor> outputs(view_to_wrap.size());
int64_t i = 0;
for (const auto& tensor : view_to_wrap) {
@ -785,12 +787,22 @@ std::vector<Tensor> create_functional_tensor_with_view_meta(ITensorListRef view_
return outputs;
}
void mutate_view_meta(const at::Tensor& self, const functionalization::ViewMeta& meta) {
void mutate_view_meta(const at::Tensor& self, const std::shared_ptr<functionalization::ViewMeta>& meta) {
TORCH_INTERNAL_ASSERT(at::functionalization::impl::isFunctionalTensor(self));
auto self_impl = at::functionalization::impl::unsafeGetFunctionalWrapper(self);
self_impl->mutate_view_meta(meta);
}
Tensor apply_view_meta_sequence(
const Tensor& base,
const std::vector<std::shared_ptr<functionalization::ViewMeta>>& sequence) {
Tensor r = base;
for (auto& vm : sequence) {
r = vm->forward(r);
}
return r;
}
// Note [Propagating strides in the functionalization pass]
// In order to properly compute stride information, the functionalization pass
// calls each {view} reference implementations with meta tensors.
@ -884,7 +896,7 @@ void functionalize_op_helper(const c10::OperatorHandle& op, torch::jit::Stack* s
const auto& ivalue = returns[idx];
if (ivalue.isTensor()) {
const auto& t = ivalue.toTensor();
if (!t.defined()) continue;
if (!t.defined()) { continue; }
at::functionalization::impl::sync(t);
auto t_new = c10::IValue(at::functionalization::impl::from_functional_tensor(t));
(*stack)[returns_begin + idx] = t_new;

View File

@ -56,7 +56,7 @@ struct TORCH_API FunctionalTensorWrapper : public c10::TensorImpl {
explicit FunctionalTensorWrapper(
const Tensor& view_value,
const FunctionalTensorWrapper* base,
const functionalization::ViewMeta& meta);
const std::shared_ptr<functionalization::ViewMeta>& meta);
// Get the underlying, actual tensor, that doesn't know anything about
// functionalization.
@ -99,17 +99,17 @@ struct TORCH_API FunctionalTensorWrapper : public c10::TensorImpl {
->are_all_mutations_under_no_grad_or_inference_mode();
}
void maybe_mark_symbolic(const functionalization::ViewMeta& meta) {
is_symbolic_ = is_symbolic_ | meta.has_symbolic_inputs;
void maybe_mark_symbolic(functionalization::ViewMeta* meta) {
is_symbolic_ = is_symbolic_ | meta->has_symbolic_inputs;
}
bool is_symbolic() const {
return is_symbolic_;
}
// Runs the forward_fn of every ViewMeta collected in the current instance
// to some other base.
Tensor apply_view_metas(const Tensor& base);
// Retrieves the ViewMeta sequence of this tensor.
const std::vector<std::shared_ptr<functionalization::ViewMeta>>& view_metas()
const;
// Sync's the underlying tensor with its alias, if it's out of date. This
// involves two steps: 1) Apply any pending updates/mutations to the alias 2)
@ -146,7 +146,8 @@ struct TORCH_API FunctionalTensorWrapper : public c10::TensorImpl {
// from the base tensor. This method is used by inplace-view ops like
// transpose_. It appends a ViewMeta to the existing stack, and refreshes the
// tensor by replaying the views off of the alias.
void mutate_view_meta(const at::functionalization::ViewMeta& meta);
void mutate_view_meta(
const std::shared_ptr<at::functionalization::ViewMeta>& meta);
// Custom implementation of self.set_(src)
void set__impl(const FunctionalTensorWrapper* other);
@ -285,7 +286,7 @@ struct TORCH_API FunctionalTensorWrapper : public c10::TensorImpl {
bool is_symbolic_ = false;
size_t generation_ = 0;
std::vector<at::functionalization::ViewMeta> view_metas_;
std::vector<std::shared_ptr<at::functionalization::ViewMeta>> view_metas_;
protected:
static void copy_tensor_metadata(
@ -377,16 +378,20 @@ TORCH_API void propagate_xla_data_direct(
Tensor create_functional_tensor_with_view_meta(
const Tensor& view_to_wrap,
const Tensor& base,
functionalization::ViewMeta meta,
const std::shared_ptr<functionalization::ViewMeta>& meta,
int64_t out_idx = 0);
std::vector<Tensor> create_functional_tensor_with_view_meta(
ITensorListRef view_to_wrap,
const Tensor& base,
const functionalization::ViewMeta& meta);
const std::shared_ptr<functionalization::ViewMeta>& meta);
void mutate_view_meta(
const Tensor& self,
const functionalization::ViewMeta& meta);
const std::shared_ptr<functionalization::ViewMeta>& meta);
TORCH_API Tensor apply_view_meta_sequence(
const Tensor& base,
const std::vector<std::shared_ptr<functionalization::ViewMeta>>& sequence);
void set_sizes_strides_offset(const Tensor& out, const Tensor& meta_out);
void set_sizes_strides_offset(

View File

@ -1,3 +1,5 @@
#include <ATen/FunctionalizeFallbackKernel.h>
#include <ATen/core/dispatch/Dispatcher.h>
#include <ATen/core/LegacyTypeDispatch.h>
#include <ATen/EmptyTensor.h>
@ -7,7 +9,6 @@
#include <torch/library.h>
#include <c10/util/irange.h>
#include <c10/util/strides.h>
#include <ATen/EmptyTensor.h>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/ATen.h>
@ -28,6 +29,31 @@
#include <utility>
#endif
namespace at::functionalization {
Tensor resize__ViewMeta::forward(const Tensor& base) {
if (reapply_views) {
return base.as_strided(size, c10::contiguous_strides(size));
} else {
return at::as_strided_copy(base, size, c10::contiguous_strides(size));
}
}
Tensor resize__ViewMeta::reverse(const Tensor& base, const Tensor& mutated_view) {
return base.as_strided_scatter(
mutated_view, size, c10::contiguous_strides(size));
}
Tensor _unsafe_view_ViewMeta::forward(const Tensor& base) {
return at::_unsafe_view_symint(base, size);
}
Tensor _unsafe_view_ViewMeta::reverse(const Tensor& base, const Tensor& mutated_view) {
return at::_unsafe_view_symint(mutated_view, base.sym_sizes());
}
} // namespace at::functionalization
namespace {
void functionalizeFallback(const c10::OperatorHandle& op, c10::DispatchKeySet dispatchKeySet [[maybe_unused]], torch::jit::Stack* stack) {
const auto& schema = op.schema();
@ -106,7 +132,9 @@ namespace {
const auto& ivalue = returns[idx];
if (ivalue.isTensor() && should_wrap_outputs) {
const auto& t = ivalue.toTensor();
if (!t.defined()) continue;
if (!t.defined()) {
continue;
}
auto t_new = c10::IValue(at::functionalization::impl::to_functional_tensor(t));
(*stack)[returns_begin + idx] = t_new;
} else if (ivalue.isTensorList() && should_wrap_outputs) {
@ -169,19 +197,8 @@ static const at::Tensor & resize__functionalization(c10::DispatchKeySet dispatch
// The output of resizing is equivalent to taking a slice of a larger tensor.
// We have to emulate this "slicing" with an as_strided call.
auto reapply_views = at::functionalization::impl::getFunctionalizationReapplyViewsTLS();
at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
[reapply_views = reapply_views, size = size.vec()](const at::Tensor & base, int64_t mutated_view_idx [[maybe_unused]]) -> at::Tensor {
if (reapply_views) {
return base.as_strided(size, c10::contiguous_strides(size));
} else {
return at::as_strided_copy(base, size, c10::contiguous_strides(size));
}
},
[size = size.vec()](const at::Tensor & base, const at::Tensor & mutated_view, int64_t mutated_view_idx [[maybe_unused]]) -> at::Tensor {
return base.as_strided_scatter(mutated_view, size, c10::contiguous_strides(size));
},
/*has_symbolic_inputs=*/false
);
auto view_meta = std::make_shared<at::functionalization::resize__ViewMeta>(
reapply_views, size.vec());
at::functionalization::impl::mutate_view_meta(self, view_meta);
return self;
}
@ -300,17 +317,11 @@ static at::Tensor _unsafe_view_functionalize(const at::Tensor & self, at::SymInt
tmp_output = at::_unsafe_view_symint(self_, size);
}
bool has_symbolic_inputs = std::any_of(size.begin(), size.end(), [=](auto& s) { return s.is_symbolic(); });
at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
[size = size.vec()](const at::Tensor & base, int64_t mutated_view_idx [[maybe_unused]]) -> at::Tensor {
return at::_unsafe_view_symint(base, size);
},
[size = size.vec()](const at::Tensor & base, const at::Tensor & mutated_view, int64_t mutated_view_idx [[maybe_unused]]) -> at::Tensor {
return at::_unsafe_view_symint(mutated_view, base.sym_sizes());
},
/*has_symbolic_inputs=*/has_symbolic_inputs
);
bool has_symbolic_inputs = std::any_of(
size.begin(), size.end(), [=](auto& s) { return s.is_symbolic(); });
auto view_meta =
std::make_shared<at::functionalization::_unsafe_view_ViewMeta>(
has_symbolic_inputs, size.vec());
auto out = at::functionalization::impl::create_functional_tensor_with_view_meta(tmp_output, self, std::move(view_meta));
// See Note [Propagating strides in the functionalization pass]

View File

@ -0,0 +1,58 @@
#pragma once
#include <ATen/FunctionalStorageImpl.h>
namespace at::functionalization {
// `ViewMeta` implementation for `resize_` operation.
struct TORCH_API resize__ViewMeta : public ViewMeta {
FUNCTIONALIZATION_VIEWMETA_NAME(resize__ViewMeta)
FUNCTIONALIZATION_VIEWMETA_SERIALIZABLE_TUPLE(
bool /* reapply_views */,
const std::vector<int64_t>&);
resize__ViewMeta(const SerializableTuple& tpl)
: resize__ViewMeta(std::get<0>(tpl), std::get<1>(tpl)) {}
resize__ViewMeta(bool reapply_views, const std::vector<int64_t>& size)
: ViewMeta(/*has_symbolic_inputs=*/false),
reapply_views(reapply_views),
size(size) {}
Tensor forward(const Tensor& base) override;
Tensor reverse(const Tensor& base, const Tensor& mutated_view) override;
SerializableTuple to_serializable_tuple() {
return std::make_tuple(reapply_views, size);
}
bool reapply_views;
std::vector<int64_t> size;
};
// `ViewMeta` implementation for `_unsafe_view` operation.
struct TORCH_API _unsafe_view_ViewMeta : public ViewMeta {
FUNCTIONALIZATION_VIEWMETA_NAME(_unsafe_view_ViewMeta)
FUNCTIONALIZATION_VIEWMETA_SERIALIZABLE_TUPLE(
bool /* has_symbolic_inputs */,
const std::vector<c10::SymInt>&);
_unsafe_view_ViewMeta(const SerializableTuple& tpl)
: _unsafe_view_ViewMeta(std::get<0>(tpl), std::get<1>(tpl)) {}
_unsafe_view_ViewMeta(
bool has_symbolic_inputs,
const std::vector<c10::SymInt>& size)
: ViewMeta(has_symbolic_inputs), size(size) {}
Tensor forward(const Tensor& base) override;
Tensor reverse(const Tensor& base, const Tensor& mutated_view) override;
SerializableTuple to_serializable_tuple() {
return std::make_tuple(has_symbolic_inputs, size);
}
std::vector<c10::SymInt> size;
};
} // namespace at::functionalization

View File

@ -45,7 +45,39 @@ inline void infer_size_impl(
}
}
auto set_infer_dim = [&]() {
if (infer_dim) {
// numel is the product of known sizes, it has to be divisible by newsize.
// and newsize should be positive unless newsize == numel (we throw
// different) error message in that case.
if constexpr (std::is_same_v<NumelType, c10::SymInt>) {
auto v = newsize.maybe_as_int();
if (v and *v == 0) {
// Avoid div by 0 when sym_eq(numel % newsize, 0) is constructed!
// which may happen when newsize is not a symbol! if its a symbol
// division won't happen anyway during compile.
TORCH_MAYBE_SYM_CHECK(
numel == newsize,
"shape '",
shape,
"' is invalid for input of size ",
numel);
} else {
auto cond = sym_gt(newsize, 0)
.sym_and(sym_eq(numel % newsize, 0))
.sym_or(sym_eq(numel, newsize));
TORCH_MAYBE_SYM_CHECK(
cond, "shape '", shape, "' is invalid for input of size ", numel);
}
} else {
TORCH_CHECK(
(newsize > 0 && (numel % newsize == 0)) || numel == newsize,
"shape '",
shape,
"' is invalid for input of size ",
numel);
}
// We have a degree of freedom here to select the dimension size; follow
// NumPy semantics and just bail. However, a nice error message is needed
// because users often use `view` as a way to flatten & unflatten
@ -54,18 +86,14 @@ inline void infer_size_impl(
// works yet
// empty_tensor.view(-1, 0)
// doesn't.
TORCH_CHECK(
TORCH_MAYBE_SYM_CHECK(
newsize != 0,
"cannot reshape tensor of 0 elements into shape ",
shape,
" because the unspecified dimension size -1 can be any "
"value and is ambiguous");
res[*infer_dim] = numel / newsize;
return;
};
if (infer_dim && newsize > 0 && numel % newsize == 0) {
set_infer_dim();
res[*infer_dim] = numel / newsize;
return;
}
@ -75,9 +103,6 @@ inline void infer_size_impl(
shape,
"' is invalid for input of size ",
numel);
if (infer_dim) {
set_infer_dim();
}
}
inline std::vector<int64_t> infer_size(IntArrayRef shape, int64_t numel) {

View File

@ -103,7 +103,9 @@ std::string get_cpu_capability() {
#elif defined(HAVE_ZVECTOR_CPU_DEFINITION)
case native::CPUCapability::ZVECTOR:
return "Z VECTOR";
#elif defined(HAVE_SVE256_CPU_DEFINITION) && defined(HAVE_ARM_BF16_CPU_DEFINITION)
#elif defined(HAVE_SVE_CPU_DEFINITION) && defined(HAVE_ARM_BF16_CPU_DEFINITION)
case native::CPUCapability::SVE128:
return "SVE128";
case native::CPUCapability::SVE256:
return "SVE256";
#else

View File

@ -1234,7 +1234,7 @@ struct TORCH_API TupleType : public NamedType {
std::shared_ptr<FunctionSchema> schema_;
};
// the common supertype of all Enums, only used in operator registraion.
// the common supertype of all Enums, only used in operator registration.
// EnumType <: AnyEnumType for all Enums
struct AnyEnumType;
using AnyEnumTypePtr = SingletonTypePtr<AnyEnumType>;

View File

@ -102,8 +102,31 @@ struct VecReduceAllSIMD<float, Op> {
#endif // defined(__GNUC__) && (__GNUC__ > 5) && !defined(_MSC_VER) &&
// !defined(C10_MOBILE)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && \
!defined(CPU_CAPABILITY_SVE)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__)
#if defined(CPU_CAPABILITY_SVE256)
template <typename Op>
struct VecReduceAllSIMD<float, Op> {
static inline float apply(
const Op& vec_fun,
const Vectorized<float>& acc_vec) {
using Vec = Vectorized<float>;
Vec v = acc_vec;
// 128-bit shuffle
svuint32_t ind = svdupq_n_u32(4, 5, 6, 7);
Vec v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
// 64-bit shuffle
ind = svdupq_n_u32(2, 3, 0, 1);
v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
// 32-bit shuffle
ind = svdupq_n_u32(1, 0, 2, 3);
v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
return svlasta(svpfalse(), v);
}
};
#else
template <typename Op>
struct VecReduceAllSIMD<float, Op> {
static inline float apply(
@ -140,35 +163,8 @@ struct VecReduceAllSIMD<float, std::plus<Vectorized<float>>> {
return vaddvq_f32(acc_vec);
}
};
#endif // defined(CPU_CAPABILITY_SVE256)
#endif // defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__)
// && !defined(CPU_CAPABILITY_SVE)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && \
defined(CPU_CAPABILITY_SVE256)
template <typename Op>
struct VecReduceAllSIMD<float, Op> {
static inline float apply(
const Op& vec_fun,
const Vectorized<float>& acc_vec) {
using Vec = Vectorized<float>;
Vec v = acc_vec;
// 128-bit shuffle
svuint32_t ind = svdupq_n_u32(4, 5, 6, 7);
Vec v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
// 64-bit shuffle
ind = svdupq_n_u32(2, 3, 0, 1);
v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
// 32-bit shuffle
ind = svdupq_n_u32(1, 0, 2, 3);
v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
return svlasta(svpfalse(), v);
}
};
#endif // defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__)
// && defined(CPU_CAPABILITY_SVE256)
template <typename scalar_t, typename Op>
inline scalar_t vec_reduce_all(

View File

@ -1,9 +1,21 @@
#pragma once
#include <ATen/cpu/vec/intrinsics.h>
#include <c10/macros/Macros.h>
#include <cstdint>
#include <ATen/cpu/vec/vec_base.h>
#if defined(__aarch64__) && \
(defined(AT_BUILD_ARM_VEC256_WITH_SLEEF) || \
defined(AT_BUILD_ARM_VECSVE_WITH_SLEEF))
#define SLEEF_STATIC_LIBS
#include <sleef.h>
#define USE_SLEEF(sleef_code, non_sleef_code) sleef_code
#else
#define USE_SLEEF(sleef_code, non_sleef_code) non_sleef_code
#endif
#if defined(CPU_CAPABILITY_SVE)
// Define the data type of VLS(vector-length specific).

Some files were not shown because too many files have changed in this diff Show More