Compare commits

..

402 Commits

Author SHA1 Message Date
d0cd09de94 Update (base update)
[ghstack-poisoned]
2025-10-09 15:49:41 +08:00
f231be25c6 Mark unused parameters in C++ code (#164912)
This PR adds unused parameter name comments in C++ declarations to improve code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164912
Approved by: https://github.com/Skylion007
2025-10-09 06:23:25 +00:00
a753ffa9af Revert "Use runner with more memory for ASAN builds (#165000)"
This reverts commit f5fd18f7e24378bd9eb91404f697f1c81a8187d5.

Reverted https://github.com/pytorch/pytorch/pull/165000 on behalf of https://github.com/izaitsevfb due to not sure how, but this broke lint ([comment](https://github.com/pytorch/pytorch/pull/165000#issuecomment-3384286412))
2025-10-09 06:22:28 +00:00
a9a9a3438a Fix Avoid DDE in item numel check (#164934)
address https://github.com/pytorch/pytorch/issues/164725 and https://github.com/pytorch/pytorch/issues/164704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164934
Approved by: https://github.com/ezyang, https://github.com/aorenste, https://github.com/Skylion007
2025-10-09 06:06:25 +00:00
263db92563 Add knobs in FR dump by watchdog (stacktrace and only active collectives) and trigger FR even on any exceptions (#164591)
This PR includes a couple of changes to extend FlightRecorder dump by PyTorch watchdog

- New knobs to control FR dump as suggested in the public documentation even for watchdog
(TORCH_INCLUDE_STACK_TRACE, TORCH_INCLUDE_ONLY_ACTIVE)
- Trigger the flight recorder dump on exceptions which could be triggered by any CUDA / host side error
  (TORCH_NCCL_EXTRA_DUMP_ON_EXEC)
-> Can be used as a snapshot of the workload progress for post-mortem analysis

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164591
Approved by: https://github.com/fduwjj
2025-10-09 05:33:35 +00:00
ed6156e3ea non-fb impls + unit tests (#164722)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Differential Revision: D83714692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164722
Approved by: https://github.com/NikhilAPatel, https://github.com/adamomainz
2025-10-09 05:10:57 +00:00
d40a9bfb8d Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939)
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition.  You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.

This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.

Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
ghstack dependencies: #164573
2025-10-09 04:49:44 +00:00
e532f62e0d Introduce joint_custom_pass callback (#164981)
```
        def joint_custom_pass(joint_gm: torch.fx.GraphModule, joint_inputs):
           # apply your pass for joint graph here

            return joint_gm

        class M(torch.nn.Module):
            def forward(self, x):
                return x.sin()

        x = torch.randn(10, requires_grad=False)
        compiled_fn = torch.compile(M(), backend="aot_eager")

        with torch._functorch.config.patch("joint_custom_pass", joint_custom_pass):
            out = compiled_fn(x)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164981
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2025-10-09 04:40:54 +00:00
1f73b96668 [PGO] log missing sources in allowlist (#164881)
Summary:
- logs missing dynamic sources
- emits MLHub insight only on size mismatch recompiles

Test Plan: test_pgo

Differential Revision: D84098898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164881
Approved by: https://github.com/bobrenjc93
2025-10-09 04:39:09 +00:00
7b691546d2 [vllm hash update] update the pinned vllm hash (#164628)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164628
Approved by: https://github.com/pytorchbot
2025-10-09 04:35:36 +00:00
f05e23e1bc Add less warps config to inner reductions (#162447)
Add less warps to ensure proper vectorization + memory coalescing for inner reductions, prefer more work per thread

<img width="1717" height="731" alt="Screenshot 2025-09-17 at 10 03 25 AM" src="https://github.com/user-attachments/assets/7b1f4a30-62f2-4bee-bb9c-122501bde63e" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162447
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
2025-10-09 04:22:16 +00:00
d386325ca9 Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/ngimel
ghstack dependencies: #164997
2025-10-09 04:22:03 +00:00
7457d139c5 Add pyrefly suppressions to torch/distributed (7/n) (#165002)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

One more PR after this one.

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165002
Approved by: https://github.com/oulgen
2025-10-09 04:08:25 +00:00
ab94a0d544 [CUDA][cuBLAS] addmm -- some refactoring for easier navigation between the Lt and non-Lt paths (#163955)
As per title. Additionally, some Lt selection conditions are revisited, and some redundancy removed (especially in the ROCm vs non-ROCm paths).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163955
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-09 04:07:45 +00:00
0e9b3a772a [export] Turn on install_free_tensors flag (#164691)
The final step in removing the discrepancy between
torch.compile(fullgraph=True) and torch.export(strict=True).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164691
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #164721
2025-10-09 03:25:15 +00:00
af7ca55ced [export][dynamo] Fallback to slowpath for MultiHeadAttention for strict export (#164721)
In https://github.com/pytorch/pytorch/pull/106824, export decided to slow-path for MultiHeadAttention module (look into the PR description as to why). But that PR eventually caused a divergence between Dynamo and export.

Today, strict-export does not inline into builtin modules (like MultiHeadAttention), and therefore make_fx sees the original nn.Module and takes the slow path. But compile inlines into the nn module, and at this time the condition `_is_make_fx_tracing` is False. As a result, Dynamo takes a fast path, resulting in a different op being called.

This divergence is undesirable. There are 2 ways to fix it

1) Make export take the fast path - As explained in the https://github.com/pytorch/pytorch/pull/106824 , this might be difficult. So, we go to (2)
2) Make compile as well take the slow path - This is easy to implement. The con here is that Pytorch eager and compile will use different operators, which can cause numerics issues etc.

Since (2) is easy to do, we will follow this path. We are tracking the issue in  https://github.com/pytorch/pytorch/issues/164062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164721
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
2025-10-09 03:25:15 +00:00
a029675f6f More ruff SIM fixes (#164695)
This PR applies ruff `SIM` rules to more files. Most changes are about simplifying `dict.get` because `None` is already the default value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164695
Approved by: https://github.com/ezyang
2025-10-09 03:24:50 +00:00
54ae61c573 Change test_emulate_precision_casts_mean_ratio_chain from gelu to relu (#164997)
gelu can be instable on local builds due to libdevice differences, as we lower to libdevice.erf. That combined with the semantics in the test can lead to catastrophic cancellation. We switch this test from gelu to relu to fix this instability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164997
Approved by: https://github.com/eellison, https://github.com/jansel
2025-10-09 03:14:05 +00:00
2fe37b5fde [RecSys][Combo Kernel] skip combo kernel generation if parition group is empty (#164918)
Summary: Noticed sometimes the combo kernel partition will contain empty group. Skip kernel generation in this case to unblock head model launching. The change in this diff is safe, but it's better to root cause why empty group is being created.

Test Plan:
Lowering passed after applying the diff

Differential Revision: D84134471

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164918
Approved by: https://github.com/mlazos
2025-10-09 02:55:23 +00:00
96d91da792 [dynamo] allow placement subclass to be traceble (#164985)
This pr is to unblock SimpleFSDP+`gradient_divide_factor` [here](https://github.com/pytorch/torchtitan/pull/1793). We will need to create a subclass for DTensor `Partial` placement. When tracing `SimpleFSDPPartial`, I hit the assertion error that `SimpleFSDPPartial` is not in `ok_types`. I'm updating the code to check placement dtype via `isinstance` instead of `type(val)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164985
Approved by: https://github.com/ezyang, https://github.com/eellison
2025-10-09 01:44:21 +00:00
f5fd18f7e2 Use runner with more memory for ASAN builds (#165000)
An attempt to [address OOM here](aed5ed1076/1).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165000
Approved by: https://github.com/seemethere, https://github.com/malfet, https://github.com/huydhn
2025-10-09 01:09:28 +00:00
8ca986ee60 [fr] Enable reset the FR recording for fault tolerance (#164988)
We also want to have a python side API for users to reset FR recording for FR entries. We don't need to reset the PGNCCL's member counter since we are creating new PGNCCL anyway. FR is a global ring buffer, so we need to reset it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164988
Approved by: https://github.com/tushar00jain
ghstack dependencies: #164752
2025-10-09 01:03:01 +00:00
81dbeb06f4 CUDA aarch64 12.6 and 12.8 builds fix triton constraints (#165013)
Since we have introduced CUDA aarch64 builds for all cuda versions we need to remove this constraint.
This was missed by https://github.com/pytorch/pytorch/pull/162364

Proper constraint on triton should be:
```
Requires-Dist: triton==3.5.0; platform_system == "Linux"
```

not:
```
Requires-Dist: triton==3.5.0; platform_system == "Linux" and platform_machine == "x86_64"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165013
Approved by: https://github.com/Camyll, https://github.com/nWEIdia, https://github.com/tinglvv
2025-10-09 00:49:28 +00:00
7a1ead755f [DeviceMesh] Add a warning for slicing flattened dim from root mesh and types for _get_slice_mesh_layout (#164993)
As title, we want to add a deprecate warning for slicing flattened dim from root mesh. Also cosmetic changes for adding types for `_get_slice_mesh_layout`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164993
Approved by: https://github.com/fegin
ghstack dependencies: #164750, #164954
2025-10-09 00:47:08 +00:00
90b4e130d6 [Benchmark] cleanup torchbench models (#164816)
Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes torchbench models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which removes timm and huggingface models from torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164816
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet
2025-10-09 00:31:25 +00:00
4308b8a28f [dynamo] Support torch.fx.traceback.annotate (#164678)
Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation.

The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node.

What does not work?
* We still have to set the context manager `torch.fx.traceback.preserve_node_meta()` in the user code because CI was unhappy. This can be fixed but with some perseverance.
* This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678
Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan
2025-10-08 22:41:00 +00:00
94b1ec8c7c [BE] Use torch check the way its intended (#164987)
Replace
`if (!foo) TORCH_CHECK(false, "bar");` with `TORCH_CHECK(foo, "bar");`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164987
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-10-08 22:28:08 +00:00
054268c9eb Consider collective inputs to be deallocated only when wait is completed (#164945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164945
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #164738, #164783, #164944
2025-10-08 22:19:25 +00:00
af40828bbb Limit coll bucketing within node idxs (#164944)
Respect max_coll_distance from overlap scheduler in bucketing, also, add an optimization in path searching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164944
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #164738, #164783
2025-10-08 22:18:53 +00:00
5a1fbf45ad [ez] remove unnecessary wrapper (#164720)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164720
Approved by: https://github.com/ydwu4
2025-10-08 22:12:29 +00:00
aed5ed1076 Refactor memory estimator to use node storages, add test (#164783)
- Update the Memory Estimator to use node storages for analysis, which simplifies book keeping, as opposed to manually looking at operator schema. This will also allow me to reuse this component elsewhere.

- Factor out into separate class, so that this same logic can be used  in scheduling (node allocations / aliasing / uses)

- Adds Tests for correctness - right now only on fwd/bwd by itself, not with both.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164783
Approved by: https://github.com/ruisizhang123
ghstack dependencies: #164738
2025-10-08 22:07:43 +00:00
af4c29fea8 [dynamo, nested graph breaks] fix nested step graph break related issues (#162737)
Turns out codegen'ing a nested step graph break is significantly more complicated than first thought. The optimized function should actually do:
- call graph/load values/do side effects etc.
- call into the leaf's resume function, but skipped (this essentially step graph break function for just the leaf function)
- call into all the other resume functions, traced.

This PR also adds `torch._dynamo.step_unsupported()`, which can be used for internal testing purposes to better test step graph break handling.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162737
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #160601
2025-10-08 22:02:52 +00:00
486b4d2414 [dynamo, nested graph breaks] move cell codegen before side effects codegen (#160601)
This is needed because if we codegen cells for nested frames AFTER side effects, then reconstruction could get messed up. From below:

>The added test case demonstrates the reconstruction failure if we kept cell codegen at the original place (only happens with nested graph breaks since we reconstruct nested frame cells from VariableTracker rather than directly using LOAD_CLOSURE).

>At a high level, what happened before this change was that side_effects was pruning the cells (I don't recall exactly why this happens), and because cells were codegen'd after the side effects were applied, we were unable to properly reconstruct the cell. The error I was seeing was a list/tuple IndexError.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160601
Approved by: https://github.com/mlazos
2025-10-08 22:02:52 +00:00
8f83b3e71c add device generalization support for distributed checkpoint tests (#159242)
## MOTIVATION
To generalize Distributed checkpoint test cases for non-CUDA devices

## CHANGES
18 test files with minimal device abstraction changes updated in
test/distributed/checkpoint/

- Use device_type from DTensorTestBase wherever appropriate
- Replaced hard coded device names with torch.accelerator.current_accelerator()
- extend multi gpu decrator for other devices

test/distributed/checkpoint/test_state_dict_stager.py has large diff, that's because i changed the name cuda_obj  to gpu_obj. Functional change is minimum.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159242
Approved by: https://github.com/guangyey, https://github.com/d4l3k
2025-10-08 21:56:31 +00:00
f0c9f3bddb [PP] [BE] Remove runtime tests (#164962)
BE cleaning up dead code since we migrated the Multi-stage schedules to use schedule execution runtime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164962
Approved by: https://github.com/Skylion007
ghstack dependencies: #162016
2025-10-08 21:42:33 +00:00
1d182dd81c [MPS] sparse norm (#164961)
Norms for sparse mps tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164961
Approved by: https://github.com/malfet
2025-10-08 21:41:42 +00:00
0b15f7ae05 [fr] Enable dynamic path write for FR dump when it comes to torchft (#164752)
When it comes to FR dump, in the case of fault tolerance, users want to set the dump path to a different one when there is restart, so we just enable this case for users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164752
Approved by: https://github.com/tushar00jain
2025-10-08 21:36:32 +00:00
f1229b6db9 [BE] Remove manual IP address resolution (#164969)
As https://github.com/pytorch/pytorch/issues/100400 has been closed a while back
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164969
Approved by: https://github.com/seemethere
ghstack dependencies: #164968
2025-10-08 21:22:34 +00:00
b1ac252f55 [Replicate][Test] tests that pp model grads are the same as single-device model grads (#164890)
**Summary:** Created a test so that we can verify that a model that has been pipelined + replicated has the same gradients as a reference model. To do this, I mapped the layers and their parameters in each partial model to the original full model and then compared the gradients.
**Test Case**
1. pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp_grads

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164890
Approved by: https://github.com/H-Huang
2025-10-08 21:07:05 +00:00
5ba11df4f8 [DeviceMesh] Make all members of DeviceMesh private and add public access API (#164954)
This is mostly mechanical change which make device mesh members all private and use a public property API instead. This is not a BC breaking change since the new API still guarantee BC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164954
Approved by: https://github.com/fegin
ghstack dependencies: #164750
2025-10-08 21:04:07 +00:00
15800888b6 [CI] Print GPU info during setup linux (#164968)
I.e. run `nvidia-smi` if present

Helps detecting what driver version this runner is on, which would have helped debugging some of the issues recently
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164968
Approved by: https://github.com/ngimel
2025-10-08 20:58:33 +00:00
e7ed1a00eb Run inductor-perf-test-nightly-h100 once per day (#164967)
To reduce inductor costs, though I'm not sure how much this one matters specifically since h100s are reserved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164967
Approved by: https://github.com/BoyuanFeng
2025-10-08 20:58:19 +00:00
2982406721 [inductor] ban benchmarking by default in deterministic mode (#164532)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164532
Approved by: https://github.com/eellison
ghstack dependencies: #164801
2025-10-08 20:55:15 +00:00
005c3d449e Support custom callback functions in schedule (#162016)
This is going to be used in https://github.com/pytorch/torchtitan/issues/1682

Add a `register_custom_function` to the `_PipelineScheduleRuntime` which allows users to implement any custom function to replace the runtime operation dynamically.

The signature of the callback should look like:

```python
class _CustomFunctionProtocol(Protocol):
    def __call__(self, action: _Action, ctx: _PipelineContext) -> None: ...
```

`_PipelineContext` contains a reference to the schedule which is executing the operations.

### Testing

Added a test which adds custom methods for `FORWARD` and `OVERLAP_F_B` which are just the same implementations as those used in the default schedule runtime. Check that the schedule can still run, numerics are correct, and the callbacks are executed the correct number of times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162016
Approved by: https://github.com/fegin
2025-10-08 20:43:26 +00:00
b2b3947565 [DeviceMesh] Remove private _set_mesh_dim_group_options API (#164750)
We allow passing in PG option via https://github.com/pytorch/pytorch/pull/159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on.

Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164750
Approved by: https://github.com/lw, https://github.com/fegin
2025-10-08 20:38:17 +00:00
81994b08a0 [inductor] don't tune xblock for reduction (#164801)
It turns out that tuning XBLOCK for a reduction can also change numerics ( https://github.com/pytorch/pytorch/pull/164525#pullrequestreview-3306235454 ).

The PR skip tuning XBLOCK for a reduction. If we have multiple configs left with different XBLOCKs, the heuristic will pick the configs with second-largest XBLOCK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164801
Approved by: https://github.com/jansel, https://github.com/mlazos, https://github.com/v0i0
2025-10-08 20:31:39 +00:00
71aefd5595 [reland] Allow setting grad_dtype on leaf tensors (#164751)
ghstack-source-id: e44b3941530be83a630ec93f1478eec741ffca2e
Pull-Request-resolved: https://github.com/pytorch/pytorch/pull/162815

Fixes #ISSUE_NUMBER

Relanding due to internal weirdness. Separate PR to codev w/o ghstack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164751
Approved by: https://github.com/albanD
2025-10-08 20:23:13 +00:00
001e1d2637 Add memory estimator (#164738)
Original work by @ShatianWang, with lints applied. I am going to a few changes and add tests in subsequent prs but I want to preserve original commit first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164738
Approved by: https://github.com/IvanKobzarev
2025-10-08 20:04:33 +00:00
e0cb1848d0 Use TMA loads always for Triton grouped MM kernel (#164256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164256
Approved by: https://github.com/ngimel
2025-10-08 19:40:06 +00:00
a4110fedcf Use insert_or_assign instead of erase+emplace (#164868)
insert_or_assign does effectively the same thing as
erase+emplace but more efficiently since the search
does not need to be repeated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164868
Approved by: https://github.com/eqy
2025-10-08 19:13:49 +00:00
37c6087334 Add split-K control to cuBLAS reduced-precision settings (#164766)
## Summary
- add a CuBLASReductionOption enum so the CUDA context can track reduced-precision and split-K options
- extend the Python bindings, backend helpers, and docs to accept an optional allow_splitk argument for fp16/bf16 matmul controls
- update cuBLAS/cuBLASLt call sites plus dynamo guards and tests to respect the new combinations

## Testing
- python test/test_cuda.py TestCuda.test_cublas_allow_fp16_reduced_precision_reduction_get_set -v *(fails: ModuleNotFoundError: No module named 'psutil')*

------
https://chatgpt.com/codex/tasks/task_e_68e404623178832f8a3e1d34e1e175da

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164766
Approved by: https://github.com/malfet, https://github.com/albanD
2025-10-08 18:48:45 +00:00
0b85236477 Fix refine_ranges corner case (#164075) (#164846)
Summary:
address https://github.com/pytorch/pytorch/issues/161360

u0>0 should update the range of u0 to start from [1, ..] this fix it. it was not doing that.

Test Plan: contbuild & OSS CI, see 27234792ad

D84038721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164846
Approved by: https://github.com/izaitsevfb, https://github.com/ezyang
2025-10-08 18:42:37 +00:00
4c0fec3e4d [Max Autotune][B200] Skip carveout tests (#164435)
Summary: Skip sm `carveout` tests on B200, as carveout is currently unsupported.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -c fbcode.re_gpu_tests=False -- test_honor_sm_carveout_with_triton_tma
```

Differential Revision: D83395610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164435
Approved by: https://github.com/eellison
2025-10-08 18:39:43 +00:00
cyy
fdc622b513 [CMake] Remove LLVM link code (#134940)
This handling is not needed no recent LLVM APIs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134940
Approved by: https://github.com/ezyang, https://github.com/malfet
2025-10-08 18:39:16 +00:00
91b9484264 [ez] fix small doc error (#164915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164915
Approved by: https://github.com/svekars
2025-10-08 18:27:44 +00:00
5c827a4133 [SymmMem] Multi-root tile reduction (#164757)
Stack from [ghstack](https://github.com/ezyang/ghstack/tree/0.12.0) (oldest at bottom):

Perform multiple tile reductions concurrently, with each tile reduced to a separate root.

- The number of concurrent reductions can be smaller than world size, i.e. roots can be a subset of all ranks. But all ranks are still required to call into this API.

- Currently supports NVLink SHARP scope only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164757
Approved by: https://github.com/weifengpy, https://github.com/fegin
ghstack dependencies: #162243
2025-10-08 17:28:00 +00:00
83458197d1 [Benchmark] remove old timm models from benchmark (#164805)
Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes for timm models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which reduces from 60 to 14 models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164805
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet
2025-10-08 17:14:58 +00:00
0b01ff4de0 [ROCm] Improve non stride-one backwards indexing for small index sets (#164409)
This patch fixes a performance problem which occurs when a small set of indices is used and there are practically no duplicates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164409
Approved by: https://github.com/jerrymannil, https://github.com/jeffdaily
2025-10-08 17:04:52 +00:00
01f3a43462 [MPS] Update OS version in error message (#164946)
Followup after https://github.com/pytorch/pytorch/pull/159912
Fixes https://github.com/pytorch/pytorch/issues/164943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164946
Approved by: https://github.com/Camyll
2025-10-08 16:43:50 +00:00
f332017294 C++ API handle optimizer defaults (#161825)
Fixes #141884

This fixes the issue for all optimizers and parameter options.
A member function `overwrite_from` is added to the optimizer base class. Each optimizer then implements this function for comparing their accepted parameters to defaults. A SFINAE approach to handle the different optimizer parameters generically (in optimizer.h only) was evaluated, but I think this is easier to review and maintain.

This mirrors the Python API up to one edge case. An example of the edge case is provided below.

Python can distinguish between 1) Key not present in dict = "not specified"  and 2) Key present in dict = "explicitly set". The C++ implementation cannot.
The issue hinges on whether or not to track if a particular parameter was set by the user explicitly or not (discrepancy in the case when the constructor default is explicitly passed in).

To track this seems like it will take more intervention than would be worth it (modify TORCH_ARG to keep track, use std::optional for the parameter types, use bitset tracking) and was not pursued in the current PR. I'm happy to alter the design if appropriate.

### Example of edge case hinging on CONSTRUCTOR DEFAULTS vs OPTIMIZER DEFAULTS

1. CONSTRUCTOR DEFAULTS:
   These are the values you get when calling AdamOptions()
   AdamOptions().lr() = 0.001
   AdamOptions().weight_decay() = 0
   AdamOptions().eps() = 1e-08

2. OPTIMIZER DEFAULTS:
   These are the values the user chose when creating the optimizer
   User's optimizer defaults:
   optimizer.lr() = 0.005
   optimizer.weight_decay() = 0.1
   optimizer.eps() = 1e-07

3. THE PROBLEM SCENARIO:
   User wants to add a parameter group with explicit weight_decay=0.0
   User sets: weight_decay(0)

4. THE CONFUSION:
   Constructor default weight_decay: 0
   User's explicit weight_decay:     0
   Are they equal? YES

   Since they're equal, our overwrite_from() logic thinks:
   "User didn't set weight_decay explicitly, use optimizer default"

5. CURRENT BEHAVIOR:
   Final weight_decay: 0.1
   User expected:      0
   Match?  NO

=== KEY INSIGHT ===
Constructor defaults are built into the C++ class definition.
Optimizer defaults are chosen by the user at runtime. We want to respect the user intention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161825
Approved by: https://github.com/janeyx99
2025-10-08 16:40:45 +00:00
0a3e4e894c [PP]: Optimize memory by early releasing stage inputs' gradients (#164329)
Seems that we can release input activations' gradients early in `stage_backward()` in PP, which helps to reduce the peak memory.

I tested this using `1F1B` and `Interleaved1F1B` PP strategy (for simplicity, I use 4 decoder layers of llama3, set PP size to 2 and set num_microbatches to 128)  based on torchtitan
run command using torchtitan:
```bash
CUDA_VISIBLE_DEVICES=4,5 LOG_RANK=0,1 NGPU=2 CONFIG_FILE=./torchtitan/models/llama3/train_configs/llama3_8b.toml ./run_train.sh --metrics.log_freq 1  --training.seq_len 8192 --training.steps 10 --parallelism.data_parallel_shard_degree 1 --activation_checkpoint.mode full --model.tokenizer_path /workspace/torchtitan-v0.1.0/torchtitan/torchtitan/datasets/tokenizer/original/tokenizer.model --tr
aining.dataset wikipedia  --parallelism.pipeline_parallel_degree 2  --training.local_batch_size 128 --parallelism.pipeline_parallel_microbatch_size 1 --training.dataset_path /workspace/wikipedia_subset --training.seed 42 --parallelism.pipeline_parallel_schedule 1F1B
```
## 1F1B torchtitan train results
### before fix
<img width="1526" height="606" alt="b8e281cce1dac15e827c216e7d83f402" src="https://github.com/user-attachments/assets/545c0a80-6276-40c0-893f-fd2df0a53b8d" />

### after fix
<img width="1526" height="594" alt="70d5ceba311a8398d041189bf8897cfc" src="https://github.com/user-attachments/assets/0d606e08-238a-4115-a1c0-b40df101d867" />

after fix, the memory usage on rank1, i.e., non first stages saving 6.9GB compare to before fix. the memory usage on rank0 remains unchanged (rank0 represents stage0)

## Interleaved1F1B torchtitan train results
### before fix
<img width="1514" height="601" alt="a28b7f9704b9234870619c43194e8a72" src="https://github.com/user-attachments/assets/2c28565f-ffff-4747-a8f5-722b5c65dc7e" />

### after fix
<img width="1526" height="621" alt="2d8d6d956b72885186f8c7059146c41a" src="https://github.com/user-attachments/assets/8c4a4ff2-336b-4e0b-8ac4-014ae22c2ed1" />

after fix, the memory usage on rank1 saving 14.57GB (rank1 holds layer1 and layer3) and rank0 saving 7.5GB (rank0 holds layer0 and layer2)

## Memory snapshot results
also, I have dumped the memory snapshot to observe the memory under the 1F1B PP strategy.

### before fix
<img width="1906" height="918" alt="6fd4e4ba82b8bacf9ca6edee4f3d5581" src="https://github.com/user-attachments/assets/d1b9245c-b09f-43c5-87ce-87ba48533a70" />

we can see the memory is increasing as pp step_microbatches running. (the lifetime of input activation's gradient, i.e., the output of `FusedRMSNormBackward`  lasts too long)

### after fix
<img width="1903" height="918" alt="2e415f25af6750d06e5e647683b212b9" src="https://github.com/user-attachments/assets/b657c8f6-5a56-46bd-8743-f3b8375c81b0" />

after fix, we got more steady memory usage during training. (the input activation's gradient will be released or return allocator soon)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164329
Approved by: https://github.com/H-Huang
2025-10-08 16:12:00 +00:00
73adac05d1 Triton 3.5.x pin update to 7416ffc (#164587)
Updates triton pin to latest: https://github.com/triton-lang/triton/commits/release/3.5.x/

This updates contains 1 cherry-pick to fix flex_attention_fwd regression on B200:
- https://github.com/triton-lang/triton/pull/8366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164587
Approved by: https://github.com/atalman
2025-10-08 16:07:18 +00:00
eqy
0d39ecb2ce [cuDNN][RNN] cuDNN RNN supports BFloat16 inputs since 9.13 (#164411)
seems to work

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164411
Approved by: https://github.com/Skylion007
2025-10-08 15:26:50 +00:00
90c0825e2d [GHF] Allow reverts from pytorch-auto-revert app (#164911)
This is a bit weird, but author_login is not a unique field, but author_url is.

Explicitly allow https://github.com/apps/pytorch-auto-revert to issue revert commands

Update mocks by running
```
sed -i -e s/8e262b0495bd934d39dda198d4c09144311c5ddd6cca6a227194bd48dbfe7201/47860a8f57a214a426d1150c29893cbc2aa49507f12b731483b1a1254bca3428/ gql_mocks.json
```

Test plan: Run
```python
from trymerge import GitHubPR
pr=GitHubPR("pytorch", "pytorch", 164660)
print(pr.get_last_comment().author_url, pr.get_comment_by_id(3375785595).author_url)
```
that should produce
```
https://github.com/pytorch-auto-revert https://github.com/apps/pytorch-auto-revert
```
Plus added a regression test that checks two particular comments for revert validity

`pytorch-auto-revert` user is my alter ego :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164911
Approved by: https://github.com/jeanschmidt
2025-10-08 15:15:45 +00:00
fd4bde430a Revert "list_stored_sd_metadata API. (#160610)"
This reverts commit da903b6a8be422529d47649e89c0d50bb95c37ca.

Reverted https://github.com/pytorch/pytorch/pull/160610 on behalf of https://github.com/jeffdaily due to broke ROCm CI, but flaky also on CUDA CI https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-jammy-rocm-py3.10%20%2F%20test%20(distributed%2C%202%2C%203%2C%20linux.rocm.gpu.mi250.4%2C%20module%3Arocm%2C%20oncall%3Adistributed)&jobName=undefined&failureCaptures=distributed%2Fcheckpoint%2Ftest_list_stored_state_dict.py%3A%3ATestListStateDict%3A%3Atest_list_stored_sd_metadata ([comment](https://github.com/pytorch/pytorch/pull/160610#issuecomment-3382023022))
2025-10-08 15:10:38 +00:00
b5e93ffdcf Revert "Limit path search within range (#164581)"
This reverts commit 415e641572473479fc9d9eaea12762e1a223a9e0.

Reverted https://github.com/pytorch/pytorch/pull/164581 on behalf of https://github.com/eellison due to merge sets makes this trickier ([comment](https://github.com/pytorch/pytorch/pull/164581#issuecomment-3381955240))
2025-10-08 14:56:21 +00:00
f8d0d65ddc Revert "Add memory estimator (#164738)"
This reverts commit ab01a0d7d352e7fd07989b8d6bf035bf82aea74e.

Reverted https://github.com/pytorch/pytorch/pull/164738 on behalf of https://github.com/eellison due to merge sets makes this trickier ([comment](https://github.com/pytorch/pytorch/pull/164581#issuecomment-3381955240))
2025-10-08 14:56:21 +00:00
f46ddb1e65 [ROCm][CI] add gfx1150 gfx1151 to docker images for binary builds (#164854)
Fixes #164346.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164854
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-08 14:34:22 +00:00
20082d7136 Revert "fix flex attention eager bwd: more rounding (#164317)"
This reverts commit 41808b2ba9a61ab2f4c7af394c1668d09a4a0331.

Reverted https://github.com/pytorch/pytorch/pull/164317 on behalf of https://github.com/jeffdaily due to inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_builtin_score_mods_seqlen_lt_custom_sparse_block_size_score_mod4_cuda_float16 [GH job link](https://github.com/pytorch/pytorch/actions/runs/18330774537/job/52207370954) [HUD commit link](41808b2ba9) ([comment](https://github.com/pytorch/pytorch/pull/164317#issuecomment-3381812090))
2025-10-08 14:29:10 +00:00
7158aa22e8 remove more (#164753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164753
Approved by: https://github.com/aorenste, https://github.com/mlazos
ghstack dependencies: #164664, #164665, #164667, #164668
2025-10-08 14:23:38 +00:00
2035f6b2e6 use check_size instead of check_is_size in ops.py (#164668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164668
Approved by: https://github.com/angelayi
ghstack dependencies: #164664, #164665, #164667
2025-10-08 14:23:38 +00:00
2b58adc3bd [inductor][templates] Distinguish between kernel input nodes and codegen input nodes (#163752)
If there is a single autotuner choice, the wrong type of input node is used to instantiate `TritonTemplateBuffer` through `TritonTemplateCaller.output_node`. This PR distinguishes the input nodes used in `AlgorithmSelectorCache.__call__` between the actual inputs passed to the kernel at runtime, vs the possibly viewed inputs that influence scheduling behaviour (e.g. `MemoryDeps`) and codegen. See the added unit test for more detail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163752
Approved by: https://github.com/eellison
2025-10-08 14:12:14 +00:00
322091d8d8 [opaque_obj] Add make_fx tracing support (#163278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163278
Approved by: https://github.com/zou3519
ghstack dependencies: #163279, #163277
2025-10-08 09:09:16 +00:00
2bb4e6876c [opaque obj] Error for torch.library.custom_op infer_schema (#163277)
Unsure how we can get infer_schema to infer the scriptObject type from just the type annotation, so for now will just error clearly and ask users to specify a schema.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163277
Approved by: https://github.com/zou3519
ghstack dependencies: #163279
2025-10-08 09:09:16 +00:00
56ef7743fc [opaque_obj] Add __eq__ and __deepcopy__ (#163279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163279
Approved by: https://github.com/zou3519
2025-10-08 09:09:16 +00:00
64108bdbed [BC-Breaking] Remove long-deprecated casting functions from native_functions.yaml (#164641)
This PR removes `torch._cast_XXX` from generated OPs. They were deprecated in PyTorch 1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164641
Approved by: https://github.com/albanD, https://github.com/justinchuby
2025-10-08 08:27:58 +00:00
c855f8632e Pyrefly suppressions 7/n (#164913)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Almost there!

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:
 INFO 0 errors (6,884 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164913
Approved by: https://github.com/oulgen
2025-10-08 07:27:17 +00:00
12d2ef557f Update round size with 1 division behavior (#162203)
have round size return nearest power of 2 greater than or equal to size with 1 division

Fixes #161139

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162203
Approved by: https://github.com/ezyang
2025-10-08 06:41:46 +00:00
65aa62d50d Use codegen for the boxed interpreters (#164573)
Authored with claude code.  The arg parsing is kind of horrible, open
to more suggestions.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164573
Approved by: https://github.com/albanD, https://github.com/jansel
2025-10-08 06:27:44 +00:00
6a09f9306c Fix #164742, all header-impl'd userfacing functions should be inline (#164871)
It is as @mxmpl pointed out; we are missing an inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164871
Approved by: https://github.com/mikaylagawarecki
2025-10-08 05:57:19 +00:00
19bf67be32 multimem reduce (#164517)
Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op.

The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517
Approved by: https://github.com/ngimel
2025-10-08 05:25:16 +00:00
1927783aa3 Revert "Reland vision pinned commit hash update (#164492)"
This reverts commit 6861a270624b44954826688f8dad668eb0154452.

Reverted https://github.com/pytorch/pytorch/pull/164492 on behalf of https://github.com/izaitsevfb due to see autorevert msg above, inductor breakage is legit ([comment](https://github.com/pytorch/pytorch/pull/164492#issuecomment-3379537888))
2025-10-08 04:38:26 +00:00
184817c7a8 locks + unit tests (#164636)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

D83714690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164636
Approved by: https://github.com/aorenste
2025-10-08 04:34:22 +00:00
da903b6a8b list_stored_sd_metadata API. (#160610)
Summary:
1\ Certain checkpoint load use cases are not aware of the properties of the data/tensors they want to load.
2\ These usecases include data loader checkpoints, reading data for post processing (when the original model definition is not available).
3\ There, we have to use saved checkpoint  (metadata) as our source of truth.
4\ This RFC proposal exposes the checkpoint metadata using a public API.

In this proposal we expose the stored state-dict metadata  (minus associated storage/chunk metadata).

Chunk/storage details should not be exposed to the users and is a impl detail of the storage writer/reader.

Test Plan:
UT.

Rollback Plan:

Differential Revision: D80231457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160610
Approved by: https://github.com/saumishr
2025-10-08 04:33:51 +00:00
f76fdcaaf8 [Benchmark] cleanup huggingface models (#164815)
Prune models from TorchInductor dashboard to reduce ci cost. This PR prunes for hugging face models according to the [doc](https://docs.google.com/document/d/1nLPNNAU-_M9Clx9FMrJ1ycdPxe-xRA54olPnsFzdpoU/edit?tab=t.0), which reduces from 46 to 27 models.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164815
Approved by: https://github.com/anijain2305, https://github.com/seemethere, https://github.com/huydhn, https://github.com/malfet
2025-10-08 03:21:04 +00:00
608792153f [inductor][codecache] Print bytes in codecache debug output (#164898)
Summary: We have an internal request to help understand why the hash of `post_grad_custom_post_pass` is changing between attempts. We don't get useful info from the debug output, because we just print "<bytes>". Instead, attempt to print at least _some_ of the value in case it contains readable characters.

Test Plan:
Registered a dummy post_grad_custom_pass and printed codecache debug output
`TORCH_LOGS=+torch._inductor.codecache python ~/foo.py`

Yields something like:
```
V1007 16:41:19.024000 3546009 /data/users/slarsen/pytorch-3.10_4/torch/_inductor/codecache.py:989] [0/0] [law2ujt2wzjb5tyiu6jh64r2lxpvl62yvxcsmdouhg3qyelhhdv] post_grad_custom_post_pass: HelloWorld!����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������...
```

Differential Revision: [D84108770](https://our.internmc.facebook.com/intern/diff/D84108770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164898
Approved by: https://github.com/oulgen
2025-10-08 02:45:20 +00:00
086dec3235 Pyrefly suppressions 6/n (#164877)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Almost there!

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the project-excludes field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:

INFO 0 errors (5,064 ignored)

Only four directories left to enable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164877
Approved by: https://github.com/oulgen
2025-10-08 02:30:57 +00:00
ad7b2bebc6 Use tuples to have a deterministic ordering. (#164851)
When debugging I noticed some non-deterministic behavior and tracked it down to this literal set. Changed to be a tuple for determinism. Changed two other small literal sets also because using a set for a small lookup like that is slow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164851
Approved by: https://github.com/bobrenjc93, https://github.com/bdhirsh
2025-10-08 02:12:03 +00:00
d444384003 [SymmMem] Tiled reduce (#162243)
Added op: `tile_reduce(Tensor input, Tensor(a!) out, int root, str group_name)`

For now supports only:
- NVSHMEM backed symmetric tensor;
- 2D tensor and tile;
- torch.float.

Testing on right-bottom quandrant:
```
rank 0:
tensor([[0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.],
        [0., 0., 0., 0., 1., 1., 1., 1.]], device='cuda:0')
PASSED
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162243
Approved by: https://github.com/ngimel
2025-10-08 02:03:04 +00:00
3040a5d294 Revert "[dynamo] Support torch.fx.traceback.annotate (#164678)"
This reverts commit 801e282f39e9ef4424dfd3ecfd2b550a44595229.

Reverted https://github.com/pytorch/pytorch/pull/164678 on behalf of https://github.com/izaitsevfb due to breaks executorch internally, see [D84068062](https://www.internalfb.com/diff/D84068062?entry_point=16) ([comment](https://github.com/pytorch/pytorch/pull/164678#issuecomment-3379281844))
2025-10-08 01:49:34 +00:00
97463d4cf3 Revert "Fix double dispatch to Python for detach (#163671)"
This reverts commit c32118dc3e50505fd285e6e448a90883fce11535.

Reverted https://github.com/pytorch/pytorch/pull/163671 on behalf of https://github.com/izaitsevfb due to breaks export tests ([comment](https://github.com/pytorch/pytorch/pull/163671#issuecomment-3379281422))
2025-10-08 01:46:45 +00:00
c813617c53 [PP] Migrate other schedules to use PipelineScheduleRuntime (#164777)
Second fix for https://github.com/pytorch/pytorch/issues/164756

This has been a TODO to make the all schedules execute using the same runtime. Now after this change, schedules will use the same logic for `_PipelineScheduleRuntime` where it adds `UNSHARD` and `RESHARD` operations to the schedules which fixes the issue mentioned above.

<img width="920" height="406" alt="image" src="https://github.com/user-attachments/assets/a4d5bcd0-7dac-43cd-96f9-8ca33cfd8b91" />

A test is failing after the conversion:
- Fixed a gradient scaling issue for dWeight

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164777
Approved by: https://github.com/fegin
ghstack dependencies: #164775
2025-10-08 01:45:57 +00:00
e659661ffa [PP] Fix FSDP unshard/reshard (#164775)
First fix for https://github.com/pytorch/pytorch/issues/164756

In the pipeline IR we call `UNSHARD` and `RESHARD`,  but there is a bug because when we call `module.unshard()` these do not recursively call the FSDP modules, hence leading to sometime call allgather before the module forward.

Since we want the pipeline IR to explicitly handle this, we can call `group.unshard` instead which ensures that all the modules are unsharded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164775
Approved by: https://github.com/weifengpy
2025-10-08 01:45:57 +00:00
41808b2ba9 fix flex attention eager bwd: more rounding (#164317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164317
Approved by: https://github.com/drisspg
ghstack dependencies: #163986
2025-10-08 01:17:45 +00:00
c0510dc447 [ContextParallel] add _LoadBalancer classes, and load-balance interface to Context Parallel APIs (#161062)
**Summary**
This PR provides an interface for users to specify how to load-balance the attention
input. The load-balance is essentially a rearrangement of the input tensor(s) over the
seq_dim before sharding and can be specified via an index tensor `rearrange` such
that Q[rearrange] is the balanced Q users want (i.e. `rearrange[i] == j` where `i` is the new
index of `Q[j]` in the balanced Q). An example is the `_generate_round_robin_indices()` added
in https://github.com/pytorch/pytorch/pull/155442.

**New `_LoadBalancer` classes**
New `_LoadBalancer` class (defined in `torch/distributed/tensor/experimental/_load_balancer.py`)
provides one interface for defining load-balance behavior: `_generate_indices(self, restore: bool = False)`.

When `restore == False`, this method should output an index Tensor (namely `rearrange_idx`) such
that QKV will be transformed into Q' K' V' in a way that `Q'[i] == Q[rearrange_idx[i]]` (same applies
to K and V).

When `restore == True`, this method outputs an index Tensor (namely `restore_idx` such that
`Q'[restore_idx] == Q` (same applies to K and V).

**Impact**
2 public CP APIs and 1 private CP API is modified. This PR should be backward-compatible by:
- For uses w/ SDPA, existing users must be using the `context_parallel()` API which does not
take in the extra `load_balancer` argument and solely determines from the global var
`_cp_options.enable_load_balance`.
- For new users including who want to try `flex_attention()`, we require to use the new API
`_context_parallel_buffers` to explicitly shard the QKV input instead of using `context_parallel()`
because we no longer rely on TorchDispatchMode nor TorchFunctionMode for op replacement. And
we also require users to explicitly pass in a `load_balancer` argument if load-balancing is demanded.

**Load-Balance Behavior**
`context_parallel_unshard()`, and `create_cp_block_mask()` APIs now take an extra optional argument
`load_balancer`. This argument is optional because of backward compatibility but we require new users
to explicitly pass in a `load_balancer` if load-balancing is demanded:
- if `load_balancer == None` and `_cp_options.enable_load_balance == False`, CP performs
no load-balancing on input Tensors.
- if `load_balancer == None` and `_cp_options.enable_load_balance ==True`, CP performs
head-tail load-balancing (e.g. split a Tensor into 2*N chunks and first N are called head and
the rest are called tail. Place the first head chunk the last tail chunk on rank 0, and the second
head along with the second last tail chunk on rank 1, and so on).

`_context_parallel_buffers()` also takes the extra optional argument `load_balancer`, but the behavior
is slightly different from the other 2 APIs -- it doesn't branch on `_cp_options.enable_load_balance` :
- if `load_balancer == None`, no load-balancing will be performed
- otherwise, apply load-balancing using `load_balancer._generate_indices()` before sharding.

**Changes**
This PR moves the index Tensor generation logic into a set of LoadBalancer classes and
make LoadBalancer the common interface for Context Parallel APIs that leverages
load-balancing:
* _context_parallel_buffers
* context_parallel_unshard
* create_cp_block_mask

The `_LoadBalancer` classes added are:
- `_LoadBalancer`: the abstract base class that provides “_generate_indices” interface index Tensor generation.
- `_HeadTailLoadBalancer`: Implements head-tail balancing logic.
- `_PerDocumentHeadTailLoadBalancer`: Supports per-document head-tail balancing for batched sequences.

**Test**
`pytest test/distributed/tensor/test_attention.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161062
Approved by: https://github.com/fegin
2025-10-08 01:09:14 +00:00
9ec10dc26a utils + unit tests (#164551)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

Differential Revision: D83714691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164551
Approved by: https://github.com/aorenste
2025-10-08 01:05:45 +00:00
43fc859625 Don't return values in void functions (#164809)
This PR fixes returning values in void C++ functions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164809
Approved by: https://github.com/janeyx99
2025-10-08 01:04:14 +00:00
f713abab16 Revert "Enable all flake8-logging-format rules (#164655)"
This reverts commit e98c4e835b1db22092fc93b49d2cddd7b3537d1f.

Reverted https://github.com/pytorch/pytorch/pull/164655 on behalf of https://github.com/malfet due to Looks like it broke lint in trunk, see bd3b98a8a5/1 ([comment](https://github.com/pytorch/pytorch/pull/164655#issuecomment-3379209309))
2025-10-08 00:55:17 +00:00
bd3b98a8a5 [dynamic shapes] make backed_size_oblivious behavior consistent b/w symbolic_shapes/inductor (#164796)
Summary: call guard_or_ directly to enable backed_size_obl in inductor calls to guard_or

Test Plan: CI and unit test added.

Differential Revision: D84009392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164796
Approved by: https://github.com/laithsakka
2025-10-08 00:19:06 +00:00
e98c4e835b Enable all flake8-logging-format rules (#164655)
These rules are enabled by removing existing suppressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164655
Approved by: https://github.com/janeyx99
2025-10-08 00:16:13 +00:00
7b15534434 [export] Fix weight sharing when there is no complete tensor (#164857)
Summary: As titled.

Test Plan: CI

Differential Revision: D84079625

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164857
Approved by: https://github.com/yushangdi
2025-10-07 23:40:13 +00:00
c32118dc3e Fix double dispatch to Python for detach (#163671)
This fixes #71725.

Differential Revision: [D83857880](https://our.internmc.facebook.com/intern/diff/D83857880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163671
Approved by: https://github.com/ezyang, https://github.com/albanD
2025-10-07 23:34:37 +00:00
e3ae80fc03 [PP] Let PP split BlockMask into micro-BlockMask (#164111)
BlockMask has batch dimension information. So PP has to split it as well just like all other tensors. All the tensors in BlockMask have the batch dimension, so we can just split it without too many issues. However, `mask_mod` requires the batch index as the input, which the value is going to be changed after the split. So we have to wrap it inside a closure to modify the batch index.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164111
Approved by: https://github.com/H-Huang
2025-10-07 23:25:34 +00:00
483f4e0db9 CUDA 13.0 builds fix on Amazon Linux 2023 (#164870)
During 2.9 rc testing I am seeing an issue on Amazon Linux 2023 with CUDA 13.0 builds

This is related to:
 https://github.com/pytorch/pytorch/issues/152756

Workflow: https://github.com/pytorch/test-infra/actions/runs/18324074610/job/52184079262

Error:
```
WARNING: There was an error checking the latest version of pip.
+ python3.11 .ci/pytorch/smoke_test/smoke_test.py --package torchonly
Traceback (most recent call last):
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 333, in _load_global_deps
    ctypes.CDLL(global_deps_lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib64/python3.11/ctypes/__init__.py", line 376, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: libcudart.so.13: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/pytorch/pytorch/.ci/pytorch/smoke_test/smoke_test.py", line 12, in <module>
    import torch
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 425, in <module>
    _load_global_deps()
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 383, in _load_global_deps
    _preload_cuda_deps(lib_folder, lib_name)
  File "/usr/local/lib64/python3.11/site-packages/torch/__init__.py", line 317, in _preload_cuda_deps
    raise ValueError(f"{lib_name} not found in the system path {sys.path}")
Traceback (most recent call last):
ValueError: libnvToolsExt.so.*[0-9] not found in the system path ['/pytorch/pytorch/.ci/pytorch/smoke_test', '/usr/lib64/python311.zip', '/usr/lib64/python3.11', '/usr/lib64/python3.11/lib-dynload', '/usr/local/lib64/python3.11/site-packages', '/usr/local/lib/python3.11/site-packages', '/usr/lib64/python3.11/site-packages', '/usr/lib/python3.11/site-packages']
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 102, in <module>
    main()
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 98, in main
    run_cmd_or_die(f"docker exec -t {container_name} /exec")
  File "/home/ec2-user/actions-runner/_work/test-infra/test-infra/test-infra/.github/scripts/run_with_env_secrets.py", line 39, in run_cmd_or_die
    raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
RuntimeError: Command docker exec -t 7d9c5bd403cac9a9ee824d63a1d6f6057ecce89a7daa94a81617dbf8eff0ff2e /exec failed with exit code 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164870
Approved by: https://github.com/Camyll

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
2025-10-07 22:52:53 +00:00
d1a62c8036 [BE][Ez]: Enable RUF007 Prefer itertools.pairwise over zip slicing (#164856)
Now that our min version is 3.10 we can support this rule. This is more concise, readable, and efficient than the previous zip slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164856
Approved by: https://github.com/williamwen42
2025-10-07 22:51:17 +00:00
6861a27062 Reland vision pinned commit hash update (#164492)
Redo https://github.com/pytorch/pytorch/pull/154694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164492
Approved by: https://github.com/yangw-dev
2025-10-07 22:45:05 +00:00
955f21dc2c [ROCm][CI] Add support for gfx1100 in rocm workflow + test skips (#148355)
This PR adds infrastructure support for gfx1100 in the rocm workflow. Nodes have been allocated for this effort.
@dnikolaev-amd contributed all the test skips.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148355
Approved by: https://github.com/jeffdaily

Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com>
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-07 22:36:25 +00:00
9f5e1beaf3 [multi-kernel] base tensor sizes for shape cache key (#164499)
to match shape key in 3ca09d65f1/torch/_inductor/select_algorithm.py (L3571)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164499
Approved by: https://github.com/ColinPeppler
2025-10-07 21:27:40 +00:00
2e027e8742 [inductor] Improve bound on the number of dims to match for the block (#163755)
- Removes redundant broadcast code when `len(kernel.range_tree_nodes)` is much larger than `len(range_tree.nodes)`. For example:
```python
# before, the broadcast is to [1, 1, XBLOCK, R0_BLOCK]
tmp0 = tl.reshape(tl.broadcast_to(tl.load(block_ptr0, boundary_check=[2], padding_option='zero', eviction_policy='evict_last')[:, None, :, :], [(511 + XBLOCK) // 512, ((1) * ((1) <= ((511 + XBLOCK) // 512)) + ((511 + XBLOCK) // 512) * (((511 + XBLOCK) // 512) < (1))), ((512) * ((512) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (512))), R0_BLOCK]), [XBLOCK, R0_BLOCK])
# after
tmp0 = tl.reshape(tl.load(block_ptr0, boundary_check=[2], padding_option='zero', eviction_policy='evict_last'), [XBLOCK, R0_BLOCK])
```
- Fix: also save range_tree_nodes per subgraph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163755
Approved by: https://github.com/eellison, https://github.com/blaine-rister
2025-10-07 21:02:37 +00:00
1e42fde45e Revert "[CUDA] Add experimental green context support for SM carveout (#159104)"
This reverts commit 746fe78ecd52f3e9cfddda41f0ac82dada7bdd0b.

Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/malfet due to Breaks Windows CD build ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3378675515))
2025-10-07 20:51:22 +00:00
f505caa71b Revert "multimem reduce (#164517)"
This reverts commit d1cbb74fb16406488a174832e1b58b7c242f418d.

Reverted https://github.com/pytorch/pytorch/pull/164517 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164517#issuecomment-3378529654))
2025-10-07 20:12:38 +00:00
65f10becdf Support OVERLAP_F_B in schedule (#161072)
Previously, we converted the overlap_f_b into separate forward and backward operations in the plan. This is a small change that includes it in the plan and handles it in the runtime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161072
Approved by: https://github.com/fegin, https://github.com/wconstab
2025-10-07 19:55:10 +00:00
df640df68a Revert "Reapply "C++-accessible Placements via pybind11 (#163030)" (#164519)"
This reverts commit 8c0bc879b97bc580aaa0777b2d266bdd068cb528.

Reverted https://github.com/pytorch/pytorch/pull/164519 on behalf of https://github.com/malfet due to Still breaks internal workflows ([comment](https://github.com/pytorch/pytorch/pull/164519#issuecomment-3378469432))
2025-10-07 19:46:17 +00:00
4c3c0ef2f1 [precompile] Load source cache for AOT compile as well. (#164773)
Adding source_get_cache also to AOT compile case. Since the guard manager loader code can be shared between AOT and caching, we added a new function load_guard_manager to avoid code duplication between two workflows, for loading guards.

Test Plan: test_guard_serialization.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164773
Approved by: https://github.com/yiming0416, https://github.com/dolpm
2025-10-07 18:47:09 +00:00
bc33b10202 fix copy_ for scalar in inductor (#164167)
Fixes #158437

### Summary

- TorchInductor was not properly handling scalar copy operations `(tensor.copy_(scalar_value))`
- Ensured scalar sources are converted to appropriate tensor representations with correct dtype and device

### Impact

- Enables compilation of models using ` tensor.copy_(scalar) `patterns
- module: inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164167
Approved by: https://github.com/shunting314
2025-10-07 18:31:37 +00:00
2855a045b3 Use sym_eq and sym_and on symbolic shapes in common_meta_baddbmm_bmm (#164781)
Differential Revision: D84005053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164781
Approved by: https://github.com/Skylion007
2025-10-07 18:25:00 +00:00
9ecd092bd9 Add python bindings for NCCL CTA policies (#164309)
NCCLConfig can now be constructed with non-default [cta policies][1]

```python
import torch
from torch.distributed import ProcessGroupNCCL as nccl

config = nccl.NCCLConfig()
config.cta_policy = nccl.NCCL_CTA_POLICY_ZERO  # NCCL version >= 2.28
```

[1]: https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2283/user-guide/docs/api/flags.html#nccl-communicator-cta-policy-flags

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164309
Approved by: https://github.com/eqy
2025-10-07 18:16:20 +00:00
078d475d3b move partition and compiler fns from stage 1 to stage 2 (#164765)
Differential Revision: D83995689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164765
Approved by: https://github.com/zhxchen17
2025-10-07 18:02:03 +00:00
f37a6523ef Move version.h to torch/headeronly (#164381)
Differential Revision: [D83685392](https://our.internmc.facebook.com/intern/diff/D83685392)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164381
Approved by: https://github.com/janeyx99
2025-10-07 17:47:30 +00:00
b13cd141b3 Add pyrefly suppressions (#164748)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: delete lines in the pyrefly.toml file from the `project-excludes` field
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/4b3bf2037014e116bc00706a16aef199

after:

0 errors (4,263 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164748
Approved by: https://github.com/oulgen
2025-10-07 17:31:18 +00:00
5e47b4dd60 Remove device_id param from DeviceCachingAllocator::malloc (#164798)
The `malloc` call in DeviceCachingAllocator accepts a DeviceIndex param which
can be confusion because the allocator can only allocate memory for the device
that it corresponds to. This associated device is fixed at construction time
and the runtime param can be misleading.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164798
Approved by: https://github.com/ngimel, https://github.com/cyyever, https://github.com/eqy
2025-10-07 16:42:04 +00:00
ee5389d520 Enable batch samples in sparse tests (#164677)
The test cases are enabled because the issue was fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164677
Approved by: https://github.com/albanD
2025-10-07 15:58:37 +00:00
ab01a0d7d3 Add memory estimator (#164738)
Original work by @ShatianWang, with lints applied. I am going to a few changes and add tests in subsequent prs but I want to preserve original commit first.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164738
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #164568, #164569, #164581
2025-10-07 15:32:27 +00:00
801e282f39 [dynamo] Support torch.fx.traceback.annotate (#164678)
Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation.

The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node.

This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678
Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan
2025-10-07 14:54:26 +00:00
87c9fbda22 Follow up to PR 163980 for s390x (#164464)
Now with same updates propagated to s390x it works on s390x runners too.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164464
Approved by: https://github.com/atalman
2025-10-07 12:02:29 +00:00
3cc8af2d67 torch.topk: refactor global histogram/cumsum into a dedicated kernel to eliminate redundant memory access (#164459)
# TLDR
This PR removes the regression in torch.topk introduced from torch 2.7.0 and delivers much better performance for large inputs.

The table below reports execution times on H20 for various input sizes with float32 data, extracting the top-100 values. Results indicate that this PR restores and improves performance, especially on large inputs.
| Input Shape    | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) |
| -------------- | --------------- | --------------- | ------------------ |
| (1, 1B)        | 36.6            | 1564.1          | 25.6               |
| (1, 100M)      | 3.56            | 17.4            | 2.54               |
| (1, 1000,000)  | 0.135           | 0.145           | 0.098              |
| (512, 128000)  | 1.33            | 1.33            | 1.32               |
| (8192, 128000) | 19.6            | 19.6            | 19.4               |

# Background
After upgrading PyTorch from 2.6.0 to 2.7.0, we observed a significant GPU performance regression in `torch.topk` on NVIDIA GPUs. For instance, extracting the top-1000 largest values from one billion floats on an NVIDIA H20 increased from **36 ms** to **1.6 s**.

Profiling with Nsight Compute indicates that the slowdown is caused by redundant memory accesses introduced in [PR #145536](https://github.com/pytorch/pytorch/pull/145536).

# Analysis

`torch.topk` relies on **RadixSelect** to find the target values. Each radix pass requires computing a histogram of the input values. For large inputs, histogram computation is split into two stages:

1. **Local histogram**: Each CUDA block processes a subset of the input and writes its local histogram to global memory.
2. **Global reduction**: A single CUDA block reads all local histograms from global memory and reduces them into the final global histogram.

Before [PR #145536](https://github.com/pytorch/pytorch/pull/145536), both stages ran inside a single kernel (`radixFindKthValues`), using a semaphore to ensure that all local histograms were completed before reduction.

In PR #145536, the global histogram computation was merged with subsequent top-k calculations into a single kernel (`computeBlockwiseKthCounts`) to avoid the semaphore. While this simplifies synchronization, it introduces **redundant memory reads**:

- `computeBlockwiseKthCounts` launches `numInputSlices * blocks_per_slice` blocks.
- For each row (slice), `blocks_per_slice` CUDA blocks redundantly reload the same local histograms from global memory.

# This PR

To address this inefficiency, we introduce the following optimizations:

1. **Dedicated kernel**: Refactor global histogram and cumsum computation into a separate GPU kernel, `computeDigitCumSum`.
2. **Loop unrolling**: Apply loop unrolling in `computeDigitCumSum` to speed up local histogram reads.

# Performance
We benchmarked torch.topk on NVIDIA H20 with float32 inputs, extracting the top-100 values across different input sizes. The results in the table below demonstrate that this PR effectively eliminates the performance regression introduced in 2.7.0 and delivers substantial improvements on large inputs.

| Input Shape    | torch2.6.0 (ms) | torch2.8.0 (ms) | 2.8.0+this PR (ms) |
| -------------- | --------------- | --------------- | ------------------ |
| (1, 1B)        | 36.6            | 1564.1          | 25.6               |
| (1, 100M)      | 3.56            | 17.4            | 2.54               |
| (1, 1000,000)  | 0.135           | 0.145           | 0.098              |
| (512, 128000)  | 1.33            | 1.33            | 1.32               |
| (8192, 128000) | 19.6            | 19.6            | 19.4               |

Besides, I have verified the correctness of this PR with different inputs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164459
Approved by: https://github.com/ngimel, https://github.com/Skylion007
2025-10-07 11:04:03 +00:00
1fb072ac2a exceptions + unit tests (#164550)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

Differential Revision: D83714688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164550
Approved by: https://github.com/aorenste
2025-10-07 10:04:58 +00:00
cac5e13e13 [dynamo] Inline nn module calls using __call__ methods (#164817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164817
Approved by: https://github.com/SherlockNoMad, https://github.com/mlazos
2025-10-07 08:57:20 +00:00
68350660ee Increase timeout for nightly macOS performance tests to 300 minutes (#164793)
the Test step time recently went slightly up.

hopefully this fixes https://github.com/pytorch/alerting-infra/issues/263
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164793
Approved by: https://github.com/seemethere
2025-10-07 08:44:07 +00:00
ef7e2ca77e remove check_is_size from test_misc.py (#164667)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164667
Approved by: https://github.com/angelayi
ghstack dependencies: #164664, #164665
2025-10-07 07:33:50 +00:00
cdaaf3e4a3 remove size-like based size-oblivious special max simplifications (#164665)
As we removed guard_size_oblivious this simplification is no longer relevant, this is part of the process of
deprecation for guard_size_oblivious and its dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164665
Approved by: https://github.com/aorenste
ghstack dependencies: #164664
2025-10-07 07:33:50 +00:00
0ea59c3c55 do not suggest torch._check_is_size() (#164664)
size like concept for data dependency is not relevant anymore as we removed all guard_size_oblivious calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164664
Approved by: https://github.com/angelayi, https://github.com/mlazos
2025-10-07 07:33:50 +00:00
8f705d019a context + unit tests (#164549)
Summary:
the context module provides configurable context selection + isolation key hashing;

context selection is broken into runtime and compile context. runtime context is decided at call time (inductor configs, precision configs, etc.) and compile context is decided at compile time (hardware type, software hashes).

callees will be given access to SelectedRuntimeContext and SelectedCompileContext, which they can use to determine and select what context is necessary with regards to the function which is being cached.

these selected contexts are wrapped in an IsolationSchema, which denotes what context should be taken into consideration when producing an isolation key. The isolation key is essentially a salt of the function signature key, which says that some function signature key result is valid under a given context (isolation schema)

Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

 D83714689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164549
Approved by: https://github.com/aorenste
2025-10-07 06:02:10 +00:00
4bcc05777e [torchfuzz] synthesize inputs for data dependent ops (#164716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164716
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693, #164694, #164715
2025-10-07 05:40:32 +00:00
2a6cdba6e5 [torchfuzz] various edge case fixes (#164715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164715
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693, #164694
2025-10-07 05:30:46 +00:00
53f6cc7529 [torchfuzz] make ops_fuzzer deterministic (#164694)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164694
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688, #164693
2025-10-07 05:30:46 +00:00
ac901bf79a [torchfuzz] consolidate on a base implementation of args_codegen (#164693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164693
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687, #164688
2025-10-07 05:20:28 +00:00
c965d6dbb2 [torchfuzz] move into experimental dir (#164688)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164688
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649, #164687
2025-10-07 05:09:08 +00:00
ac08556f67 [torchfuzz] support more unbacked functions (#164687)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164687
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647, #164649
2025-10-07 05:00:03 +00:00
5fe7f29b9e [torchfuzz] add support for operator weights (#164649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164649
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646, #164647
2025-10-07 05:00:03 +00:00
ded099ecbf [torchfuzz] don't use the first gpu in multi process fuzzer (#164647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164647
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514, #164646
2025-10-07 04:59:56 +00:00
63fcc3e6c4 [torchfuzz] update README.md (#164646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164646
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434, #164514
2025-10-07 04:59:50 +00:00
fd3e15c14f Fix typo in class definition of bytecodedispatchtable (#164762)
ghstack-source-id: 84f0d7bb7e3780ca75473782abfae530010be56e
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164761

Fixes the type in naming of bytecodedispatchtable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164762
Approved by: https://github.com/StrongerXi, https://github.com/williamwen42
2025-10-07 04:36:09 +00:00
ff5faa744a Remove unused THPXXX macros (#164660)
These macros are not used in OSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164660
Approved by: https://github.com/albanD
2025-10-07 04:04:21 +00:00
4725871a81 Return fake mode from export graph capture API (#164730)
This PR is to temporarily unblock various experiments to re-use dynamo create fake mode. Note that this is still not what we want as the end state. The end state should look sth like:
```
out = fulllgraph_capture(mod, inputs)
fake_mode = out.backend_inputs.fake_mode
gm  = out.module()
```
This doesn't work today because export requires we need to wrap the original module to setup a flat module to trace for easier handling of pytree. As a result, we would need to carry export specific flag in fullgraph_capture which seems not ideal.
Regardless, the end state is that we need to give downstream user a graph module and a fake mode in some form, so I think _dynamo_graph_capture_for_export returning a fake mode within graph module itself via gm.meta

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164730
Approved by: https://github.com/avikchaudhuri
2025-10-07 03:42:46 +00:00
bcd96cc6ff [annotate] Copy fwd to bwd metadata for subgraphs as well (#164795)
The test is in the next PR. My older PR on dynamo annotate - https://github.com/pytorch/pytorch/pull/164678 is getting reverted due to unknown reasons, so difficult to add a test right now in this PR.  When I reland, I can add a test for this as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164795
Approved by: https://github.com/yushangdi
ghstack dependencies: #164772
2025-10-07 02:42:47 +00:00
50e077beaa Fix outdated info in requirements-ci.txt (#164441)
Fixes installation instructions and descriptions for `numba` and `scikit-image`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164441
Approved by: https://github.com/albanD
2025-10-07 02:10:41 +00:00
56d66ac0d7 Make custom op alias check consistent (#164576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164576
Approved by: https://github.com/soulitzer
2025-10-07 02:05:09 +00:00
49f7d8d19d [ROCm] Fix test_cuda_synchronize failure on ROCm (#164735)
This PR skips the hipify step of torch/csrc/jit/ir/ir.h to avoid a build-time error for the JIT cuda namespace.  This fixes two skipped tests in test/jit/test_cuda.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164735
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-07 01:14:24 +00:00
afee8062d5 Revert "Fix mesh.get_local_rank when it is > 1d (#164473)"
This reverts commit 83d71dfb2fd993a6242372b8123549acaa85ffdb.

Reverted https://github.com/pytorch/pytorch/pull/164473 on behalf of https://github.com/izaitsevfb due to appears to be causing vision_maskrcnn regression ([comment](https://github.com/pytorch/pytorch/pull/164473#issuecomment-3374738997))
2025-10-07 00:37:41 +00:00
e89d12bf5d Numpy zerotensor handling (#164487)
Fixes #89034

Updated tensor_to_numpy() function in tensor_numpy.cpp to handle ZeroTensors by throwing an error if force=False and returning an array full of zeros if force=True.

@ngimel, I just saw that you mentioned PyTorch is not too concerned with this issue but I had already worked on it so I figured I would push it anyways and see what you thought. Feel free to close the PR if you think it is not worth merging.

@albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164487
Approved by: https://github.com/izaitsevfb
2025-10-07 00:34:14 +00:00
d4752bc7f6 [caffe2] tweak Unpickler::readInstruction handling TUPLE (#164764)
Summary: Creating the vector was a bit awkward. Use the natural iterator-pair constructor with move-iterators.

Test Plan: CI.

Reviewed By: dolpm

Differential Revision: D83995108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164764
Approved by: https://github.com/drisspg
2025-10-07 00:18:10 +00:00
44a5d41993 [ROCm] add gfx1150 gfx1151 to supported gemm lists (#164744)
This is one of a few PRs needed to address https://github.com/pytorch/pytorch/pull/164744 fully.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164744
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-07 00:02:23 +00:00
361c5d362c [fx][traceback] Actually disable preservation of node metadata when enable=False (#164772)
This will come in handy when we run graph passes that add new nodes, and
create_proxy can add seq_nr meta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164772
Approved by: https://github.com/SherlockNoMad
2025-10-06 23:39:12 +00:00
1fc71d1b57 Revert "Numpy zerotensor handling (#164487)"
This reverts commit f7ad6dbad67161333a1473d1e0b478b7475a0ec1.

Reverted https://github.com/pytorch/pytorch/pull/164487 on behalf of https://github.com/malfet due to Did it break torchbench?, see 8c728e129d/1 ([comment](https://github.com/pytorch/pytorch/pull/164487#issuecomment-3374635051))
2025-10-06 23:32:12 +00:00
8f54e27e5d [ROCm][CI] rebuild magma binary for gfx1150 gfx1151 (#164782)
After #164763 added gfx1150 gfx1151 to list of targets, this PR will trigger rebuild of magma binary for ROCm 7 with the new targets.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164782
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-06 23:29:21 +00:00
8c0bc879b9 Reapply "C++-accessible Placements via pybind11 (#163030)" (#164519)
This makes Placement data representation available in C++ via pybind11. Reapply with fix for internal errors.

D83788896

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164519
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2025-10-06 23:19:14 +00:00
746fe78ecd [CUDA] Add experimental green context support for SM carveout (#159104)
Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...

Built on top of @drisspg 's branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104
Approved by: https://github.com/ngimel

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-06 23:11:23 +00:00
b63bbe1661 Remove old ROCm version check in tests (#164245)
This PR removes ROCm<6 version checks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164245
Approved by: https://github.com/jeffdaily
2025-10-06 22:42:01 +00:00
3912ba3e94 Revert "Fix refine_ranges corner case (#164075)"
This reverts commit 27234792add2ee9bedd84ca02dbf34f8f244bc5c.

Reverted https://github.com/pytorch/pytorch/pull/164075 on behalf of https://github.com/izaitsevfb due to fails executorch builds, see [D83938444](https://www.internalfb.com/diff/D83938444) ([comment](https://github.com/pytorch/pytorch/pull/164075#issuecomment-3374430964))
2025-10-06 22:09:39 +00:00
cfc5cc17dc Revert "[dynamo] Support torch.fx.traceback.annotate (#164678)"
This reverts commit 2883b5ab773daf5861d43ff0b65be49a441ab3f9.

Reverted https://github.com/pytorch/pytorch/pull/164678 on behalf of https://github.com/izaitsevfb due to fails inductor:max_autotune tests internally, see D83948169 ([comment](https://github.com/pytorch/pytorch/pull/164678#issuecomment-3374407009))
2025-10-06 22:03:42 +00:00
fdc8ccc5bc Make Adam, AdamW work with nonzero-dim Tensor betas (#149939)
Fixes #147921

## Changes

- Convert tensor `betas` using `_to_scalar`
- Change annotation of `betas` param
- Change param type in docs

## Test Result

```bash
pytest -s test/test_optim.py -k test_tensor_lr -vv
```

![image](https://github.com/user-attachments/assets/312ee045-1e8b-4789-aa6e-ba63e6df7e81)

![image](https://github.com/user-attachments/assets/7e6ec274-645b-46b9-b1a6-2b340a685203)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149939
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-10-06 22:03:25 +00:00
48b54b45d6 Replace pynvml with nvidia-ml-py in win-test.sh (#164681)
pynvml was deprecated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164681
Approved by: https://github.com/Aidyn-A, https://github.com/eqy
2025-10-06 21:57:26 +00:00
6861fa43e5 [CUDA] Cleanup persistent cuBLASLt workspaces before compile-regions test (#163299)
Fixes some tests that seemed to start flaking out as reported in #163202, due to cuBLASLt workspaces becoming persistent following that change.

It's relatively obvious why the workspaces/allocations corresponding to them should be cleaned up for `test_memory_snapshot_script` but less obvious for `test_memory_plots_free_segment_stack`?  Why does not cleaning up workspace prevent `empty_cache` from showing up?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163299
Approved by: https://github.com/albanD
2025-10-06 21:13:03 +00:00
c1f40d33c8 Fix docker build issue after 164575 (#164774)
Looks like https://github.com/pytorch/pytorch/pull/164575 introduced an issue.
The command is wrong:
```
conda install -c "whl/nightly" -y python=3.11 conda=25.7.0
```
Should be just using default conda channel:
```
conda install  -y python=3.11 conda=25.7.0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164774
Approved by: https://github.com/Camyll
2025-10-06 20:28:20 +00:00
7e7ac2039d [ROCm][CI] add gfx1150 gfx1151 to almalinux image (#164763)
First PR necessary to address missing gfx1151 reported in https://github.com/pytorch/pytorch/issues/164346.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164763
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-06 20:19:43 +00:00
23ab6a45e5 [precompile][ez] Add instrumentation for guard loading/building. (#164602)
Summary: as title.

Test Plan: CI

Differential Revision: D83868533

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164602
Approved by: https://github.com/dolpm
2025-10-06 20:16:09 +00:00
b558c986e8 Add regression test for get_root_mesh with multiple independent meshes (#164731)
Fixes #163330

I tried to reproduce the bug with my 4-GPU setup (the original issue used 8 GPUs). I created several different test scenarios, trying to trigger the bug by:
- creating two different device meshes
- slicing them in various ways
- checking if get_root_mesh() would get confused

but the bug didn't show up! Everything worked correctly in `2.10`. I found that there was a massive refactoring of the `DeviceMesh` code (PR #163213) that landed on October 2nd. That PR completely rewrote how `DeviceMesh` tracks relationships between parent meshes and submeshes using. It seems like this refactoring fixed the bug! But I added a regression test to make sure it doesn't come back. The test (`test_get_root_mesh_multiple_independent_meshes`) does exactly what the bug report described:
  - creates two independent meshes
  - slices them both
  - verifies that each submesh correctly points back to its real parent
  - makes sure submeshes from mesh1 don't incorrectly claim mesh2 as their parent

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164731
Approved by: https://github.com/fduwjj
2025-10-06 18:52:25 +00:00
415e641572 Limit path search within range (#164581)
When we are looking if two nodes are dependent, limit path search within the bounds of their node idxs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164581
Approved by: https://github.com/ezyang
ghstack dependencies: #164568, #164569
2025-10-06 18:29:27 +00:00
11f5f65686 Use PyObject_GetOptionalAttrString in PyObject_FastGetAttrString when available (#164624)
Python 3.13 added PyObject_GetOptionalAttrString. I'm not 100% certain that it is strictly better than the old approach in all cases, but based on documentation/comments it seems to be meant for this type of use, and it's faster when I profile torchtitan training (which gets to the "check for the `__torch_function__` attr on some object" part of maybe_has_torch_function frequently enough to notice, but wastes a bunch of time generating exceptions that we then suppressed here).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164624
Approved by: https://github.com/Skylion007
2025-10-06 18:26:09 +00:00
af32d16a71 Add pure view support in autograd Function (#164736)
This is the same as https://github.com/pytorch/pytorch/pull/164467
But it needs to be co-deved due to internal insanity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164736
Approved by: https://github.com/soulitzer
2025-10-06 18:21:05 +00:00
ba480d6bf7 torch.compile: Increase subprocess parent death check interval to lower cpu (#164594)
Summary:
This check is a good idea (we could potentially do it with prctl). However
we're seeing elevated rates of cpu usage in idle worker threads. This causes issues on production jobs, causing a large amount of spikeness in qps.

Test Plan:
Tested on a prod job with caches force disabled via
TORCH_COMPILE_FORCE_DISABLE_CACHES=1

Baseline
<img width="454" height="403" alt="image" src="https://github.com/user-attachments/assets/b88583a1-5b99-48cb-b03d-cd9b69546579" />

With this diff -
<img width="426" height="403" alt="image" src="https://github.com/user-attachments/assets/431217f1-0ed0-4f6e-9d81-6428bf34e0e3" />

Differential Revision: D83803302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164594
Approved by: https://github.com/masnesral
2025-10-06 18:15:21 +00:00
4a6abba0d9 [ROCm][CI] test_convolution.py uses miopen immediate mode (#164598)
This should help stabilize some flaky test behavior where miopen would pick different solutions for different parts of the same test and the test expects bitwise identical results.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164598
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-06 17:48:50 +00:00
96181d6f76 [BE][cutlass backend] BE changes post cutlass_cppgen name change (#164589)
Differential Revision: D83809105

Handle reviews from https://github.com/pytorch/pytorch/pull/164159

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164589
Approved by: https://github.com/Skylion007
2025-10-06 17:22:08 +00:00
2164b66121 [export] Better state_dict and constant dedup in torch.export.save (#164196)
Summary:

Previously, weight deduplication was done by simply grouping tensors with their untyped storage and saving the first tensor in the group.

A more rigorous approach would be to find a complete tensor that covers the storage and store that tensor. This is particularly important for GPU weights because when saving to raw bytes, we move the weight to CPU first, and if the weight being saved is not a complete one, it will lose the storage information during the copy to CPU.

In this diff, we reuse code in `_package_weights.py` for better weights and constants deduplication in `torch.export.save`.

Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_weight_sharing_gpu

Differential Revision: D83523690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164196
Approved by: https://github.com/angelayi
2025-10-06 17:03:15 +00:00
bde18c445d [Max Autotune][B200] Relax absolute tolerance for MM+MM test (#164022)
Summary: Relax absolute tolerance from 1e-2 to 1e-1 for `test_non_contiguous_input_mm_plus_mm` in `test_max_autotune.py`.

Test Plan: `test_max_autotune.py`

Differential Revision: D83391942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164022
Approved by: https://github.com/eellison
2025-10-06 16:29:07 +00:00
f3e43ff2d7 [Max Autotune][B200] Fix decompose_k test failure (#164021)
Summary:
Fix decompose_k test failure (`test_max_autotune_decompose_k `) in `test_max_autotune.py` on B200s by setting `torch._inductor.config` patches for variables `comprehensive_padding` and `shape_padding`. Initial failure was `AssertionError: False is not true : Could not find a split in {3, 9, 2187, 81, 243, 729, 27} in # AOT ID: ['6_forward']`.

Refactor decompose_k test to follow patch semantics when setting all environment variables within a test.

Test Plan:
`test_max_autotune.py`:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8 -c fbcode.re_gpu_tests=False -- test_max_autotune_decompose_k
```

Differential Revision: D83390563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164021
Approved by: https://github.com/njriasan, https://github.com/mlazos, https://github.com/eellison
2025-10-06 16:28:23 +00:00
39d0c06ed0 [torchfuzz] check in some more xfail repros (#164619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164619
Approved by: https://github.com/ezyang
2025-10-06 16:20:44 +00:00
4ab847bbc7 Pyrefly suppressions 4/n (#164615)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: uncomment lines in the pyrefly.toml file
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/356645cf8cfe33123d9a27f23b30f7b1

after:

0 errors (2,753 ignored)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164615
Approved by: https://github.com/oulgen
2025-10-06 16:14:36 +00:00
4bd1505f84 [precompile][ez] Inline type definition for dynamo cache entry. (#164580)
Summary: as title. DynamoCaptureOutput in package.py is not actively used in other files. Inline it to reduce confusion.

Test Plan: CI

Differential Revision: D83846957

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164580
Approved by: https://github.com/dolpm
2025-10-06 16:00:59 +00:00
1f9614cef8 [ROCm][CI] Change rocm periodic workflow label to linux.rocm.gpu.mi250.4 (#164616)
Testing done on this PR: https://github.com/pytorch/pytorch/pull/156491

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164616
Approved by: https://github.com/jeffdaily, https://github.com/huydhn
2025-10-06 15:51:07 +00:00
35f66b83f8 respect aten planned overlap in inductor (#164569)
Now that we have a hop to add implicit deps - use those deps for comm/compute overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164569
Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev
ghstack dependencies: #164568
2025-10-06 15:47:55 +00:00
4a39820e5e Add hop for additional control dependencies (#164568)
Adds [control_deps](https://en.wikipedia.org/wiki/Control_dependency) higher-order operator to enforce explicit scheduling dependencies in FX graphs. This prevents unwanted operation reordering/fusion by giving nodes additional dependencies, which we also respect in inductor by adding weakdeps on the additional dependencies.

This can be generally useful (such as for ordering collectives) but in this case I am using it so that fusions do not interfere with aten planned comm-compute overlap.

There's definitely some similarity with the `with_effects` hop. Talked with @angelayi  - when @zou3519  is back we will figure out how we want to consolidate.

The implementation needs to be a subgraph (as opposed to `with_effects`) because inductor relies on `V.graph.current_node`. Changing the signature of the node with `with_effects`  breaks this, and additionally, also breaks striding constraints on the wrapped node - see this [TODO](aed66248a0/torch/fx/experimental/proxy_tensor.py (L1246-L1249)). By maintaining the node with its original calling structure in subgraph this all works.

Example transformation:

Before:
```
%add : [num_users=1] = call_function[target=torch.ops.aten.add.Tensor](args = (%arg0_1, 1), kwargs = {})
%mm : [num_users=1] = call_function[target=torch.ops.aten.mm.default](args = (%arg1_1, %arg1_1), kwargs = {})
%mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%add, 2), kwargs = {})
```
After:
```
add: "f32[256, 256]" = torch.ops.aten.add.Tensor(arg0_1, 1)
mm: "f32[256, 256]" = torch.ops.higher_order.control_deps((add,), subgraph_mm, arg1_1, arg1_1)
mul: "f32[256, 256]" = torch.ops.higher_order.control_deps((mm,), subgraph_mul, add)
```

The mm operation now explicitly depends on add completing first, and mul depends on mm, with original operations preserved in subgraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164568
Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev
2025-10-06 15:47:55 +00:00
600267ea56 Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446)
Scale up XBLOCK for contiguous persistent reductions based on rnumel and number of loads + stores

<img width="928" height="656" alt="Screenshot 2025-09-18 at 5 02 57 PM" src="https://github.com/user-attachments/assets/ec3c561f-2a3f-4459-9e14-653715898da3" />

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162446
Approved by: https://github.com/v0i0, https://github.com/eellison, https://github.com/shunting314
ghstack dependencies: #162296
2025-10-06 14:29:07 +00:00
f11ac803d7 Update slow tests (#164726)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164726
Approved by: https://github.com/pytorchbot
2025-10-06 12:57:29 +00:00
ea42517e45 [xla hash update] update the pinned xla hash (#164727)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164727
Approved by: https://github.com/pytorchbot
2025-10-06 11:54:10 +00:00
91c211fb8c AC should work with pre-dispatch IR (#164505)
Previously we had to rely on turning off export verifier because the AC body was torch IR instead of aten IR. This PR makes it so that we create an IR that is export compatible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164505
Approved by: https://github.com/ydwu4, https://github.com/xmfan
2025-10-06 11:05:22 +00:00
660e369a68 [FSDP2] check storage equal and consider data_ptr() == 0 (#164595)
resolve https://github.com/pytorch/pytorch/issues/164554

unit test
* `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_cached_state_dict`
* `pytest -s test/distributed/_composable/fsdp/test_fully_shard_init.py -k test_meta_device_1d_init`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164595
Approved by: https://github.com/fegin
2025-10-06 08:44:38 +00:00
2883b5ab77 [dynamo] Support torch.fx.traceback.annotate (#164678)
Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation.

The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node.

This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678
Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan
2025-10-06 02:59:24 +00:00
9fff8155c3 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-06 01:06:01 +00:00
331191ce4b Revert "[BE] Make PyObjectSlot use a global PyInterpreter (#162659)"
This reverts commit 29cbcbac4215e0d9070a1b7a07ddaec9a36bbd08.

Reverted https://github.com/pytorch/pytorch/pull/162659 on behalf of https://github.com/izaitsevfb due to reverted internally, see [D83214133](https://www.internalfb.com/diff/D83214133) ([comment](https://github.com/pytorch/pytorch/pull/162659#issuecomment-3369348172))
2025-10-05 21:39:57 +00:00
2c5ed6e7c0 Revert "[2/N] Fix clang-tidy readability checks (#164652)"
This reverts commit 3c5ca685d6f5b6f3971c0cd20a054aa355610419.

Reverted https://github.com/pytorch/pytorch/pull/164652 on behalf of https://github.com/izaitsevfb due to need to revert due to a conflict with revert of https://github.com/pytorch/pytorch/pull/162659 ([comment](https://github.com/pytorch/pytorch/pull/164652#issuecomment-3369346707))
2025-10-05 21:36:57 +00:00
5d7360bb03 Revert "Enable all SIM rules except disabled ones (#164645)"
This reverts commit 321e6026925f6b6e8a36e3a8b7c0295cd7541911.

Reverted https://github.com/pytorch/pytorch/pull/164645 on behalf of https://github.com/izaitsevfb due to causes lint failures ([comment](https://github.com/pytorch/pytorch/pull/164645#issuecomment-3369274351))
2025-10-05 19:32:21 +00:00
321e602692 Enable all SIM rules except disabled ones (#164645)
`SIM` rules are useful for simplifying boolean expressions and enhances code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164645
Approved by: https://github.com/ezyang
2025-10-05 07:38:25 +00:00
3c5ca685d6 [2/N] Fix clang-tidy readability checks (#164652)
This PR applies clang-tidy readability checks to jit sources and all headers in the code base.
`readability-redundant-inline-specifier` is suppressed because it incurs too many changes. `readability-redundant-inline-specifier` is used to detect redundant inline specifiers on function and variable declarations. There are many in-class method definitions that are marked inline.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164652
Approved by: https://github.com/Skylion007
2025-10-05 07:05:11 +00:00
5178d0a480 [Compile] Fix Compile Warning for Capture Id (#163898)
```bash
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG     CaptureId_t capture_id_ = -1;
DEBUG                               ^
DEBUG
DEBUG Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
DEBUG
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG     CaptureId_t capture_id_ = -1;
DEBUG                               ^
DEBUG
DEBUG Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
DEBUG
DEBUG /data/vllm-community-homes/vllm-user-6/pytorch/aten/src/ATen/cuda/CUDAGraph.h(59): warning #68-D: integer conversion resulted in a change of sign
DEBUG     CaptureId_t capture_id_ = -1;
DEBUG                               ^
```

Cuda won't use 0 as a capture id, so it is safe to initialize with 0, which also matches the initialization in `pytorch/aten/src/ATen/native/cudnn/RNN.cpp:2362`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163898
Approved by: https://github.com/houseroad
2025-10-05 06:51:33 +00:00
cf0a00d4f3 Enable ruff FURB161 rule (#164654)
This PR enables FURB161 in ruff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164654
Approved by: https://github.com/Skylion007
2025-10-04 23:26:28 +00:00
5ed4270440 remove more no longer needed torch._check_is_size calls 1 (#164630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164630
Approved by: https://github.com/Skylion007
ghstack dependencies: #164627
2025-10-04 22:06:04 +00:00
8c728e129d remove no longer needed torch._check_is_size calls from test_dynamic_shapes (#164627)
No longer needed in those tests to prevent DDE

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164627
Approved by: https://github.com/ezyang
2025-10-04 22:06:04 +00:00
9fc2c6446d remove guard_size_oblivious from is_contiguous python eager eval path. (#164622)
Summary: this should not be needed anymore we shall have explicit is_contiguous_or_false calls where appropriate already !

Test Plan: run existing tests.

Differential Revision: D83884977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164622
Approved by: https://github.com/bobrenjc93
2025-10-04 21:02:39 +00:00
409aece3f9 [dynamo, 3.14] prevent StackRef compilation in 3.14 Windows (#164400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164400
Approved by: https://github.com/Camyll, https://github.com/atalman
2025-10-04 18:38:08 +00:00
b116c51330 torch.cond on DTensor triggers an internal assert, add xfail for this. (#164389)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164389
Approved by: https://github.com/albanD
2025-10-04 18:12:06 +00:00
2e1742dd63 Revert "Add device argument to torch.random.get_rng_state (#163034)"
This reverts commit 9580539e2f73d68e89544c713ff460bea3038701.

Reverted https://github.com/pytorch/pytorch/pull/163034 on behalf of https://github.com/cyyever due to It cased partially initialised torch module ([comment](https://github.com/pytorch/pytorch/pull/163034#issuecomment-3368349209))
2025-10-04 15:25:45 +00:00
f7ad6dbad6 Numpy zerotensor handling (#164487)
Fixes #89034

Updated tensor_to_numpy() function in tensor_numpy.cpp to handle ZeroTensors by throwing an error if force=False and returning an array full of zeros if force=True.

@ngimel, I just saw that you mentioned PyTorch is not too concerned with this issue but I had already worked on it so I figured I would push it anyways and see what you thought. Feel free to close the PR if you think it is not worth merging.

@albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164487
Approved by: https://github.com/ngimel, https://github.com/albanD
2025-10-04 12:03:48 +00:00
f46bb04dcc Revert "Add pure view support in autograd Function (#164467)"
This reverts commit 10335ffb2cce26c99958d055f415a16c1d14bc35.

Reverted https://github.com/pytorch/pytorch/pull/164467 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164467#issuecomment-3368152304))
2025-10-04 11:42:46 +00:00
6f6a919366 Revert "Make custom op alias check consistent (#164576)"
This reverts commit e438db254602cf39ba536aed0590b4144c019ee8.

Reverted https://github.com/pytorch/pytorch/pull/164576 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/164467#issuecomment-3368152304))
2025-10-04 11:42:45 +00:00
83d71dfb2f Fix mesh.get_local_rank when it is > 1d (#164473)
Previously, we would not take the arguments passed by get_local_rank into account. This means that we wouldn't be able to trace this call if we had a device_mesh > 1d

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164473
Approved by: https://github.com/xmfan, https://github.com/Skylion007
2025-10-04 11:27:55 +00:00
5103ecc5d8 [1/N] Fix clang-tidy readability checks (#164561)
Check all `.cpp` files except `jit` files for readability thoroughly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164561
Approved by: https://github.com/Skylion007
2025-10-04 09:40:38 +00:00
9580539e2f Add device argument to torch.random.get_rng_state (#163034)
Fixes #162812

Adds support for either passing a device directly into get_rng_state, or passing in a string or int (which is then wrapped into a device inside, as in torch.cuda.get_rng_state).

I wasn't exactly sure where tests for this should go, please let me know. I used this script for testing:
```python
import torch

# note: when running with CUDA GPU, first three tests will give the same result,
# as will the last two

# test with no device specified
print(torch.get_rng_state())

# test with CPU
cpu_device = torch.device("cpu")
print(torch.get_rng_state(cpu_device))

# test with direct name
print(torch.get_rng_state("cpu"))

# test with CUDA
cuda_device = torch.device("cuda:0")
print(torch.get_rng_state(cuda_device))

# test with integer
print(torch.get_rng_state(0))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163034
Approved by: https://github.com/ezyang, https://github.com/cyyever
2025-10-04 06:48:39 +00:00
a11a66ef32 Remove CUDA 11 branches for sparse code (#164531)
This PR removes outdated CUDA version checks from sparse code in aten/src/ATen/cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164531
Approved by: https://github.com/eqy
2025-10-04 06:07:49 +00:00
6b768e1890 Support propagating custom meta field to backward graph nodes (#164174)
# Propagate custom meta data to backward

Support propagating the user annotation tags to backward graph, by extending the `copy_fwd_metadata_to_bw_nodes` utils (recommended by @xmfan , thanks!).

Example annotation API (added in https://github.com/pytorch/pytorch/pull/163673):

```
class M(torch.nn.Module):
    def forward(self, x):
        with fx_traceback.annotate({"pp_stage": 0}):
            with fx_traceback.annotate({"fdsp_bucket": 0}):
                x = x + 1
            x = x - 2
            with fx_traceback.annotate({"cuda_stream": 2, "fsdp_bucket": 1}):
                x = x * 2
        x = x / 3
        return x
```

Assumptions (some inherited from https://github.com/pytorch/pytorch/pull/126573):

- I am trusting the seq_nr mapping introduced to aot_autograd nodes in https://github.com/pytorch/pytorch/pull/103129
- I am also trusting that the forward is single threaded, since seq_nr is thread local.  If this isn't always true, we'll need to also plumb thread_id through the same machinery which is populating seq_nr.
- **(This is changed in this PR!) I assume all backward graph nodes has "is_backward" for 'partitioner_tag', and all other nodes are forward graph nodes**.  If we don't run export before `aot_export_join_with_descriptors`, then none of the nodes has "nn_module_stack" in node meta. If we do run export first, then we don't need this change.
- I copy "custom" node meta from forward to backward graph nodes.

Question:
- Is it a good idea to copy all "custom" node meta? Or should we create a dedicated key in custom node meta to be copied? @SherlockNoMad
- Do we expect people to run export before using `aot_export_join_with_descriptors`?
- Can we assume the following for graph produced by `aot_export_join_with_descriptors`? "all backward graph nodes has "is_backward" for 'partitioner_tag', and all other nodes are forward graph nodes". Maybe this is a question for @ezyang

```
python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_
python test/export/test_export.py -k preserve_anno
python test/distributed/tensor/test_dtensor_export.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164174
Approved by: https://github.com/xmfan, https://github.com/SherlockNoMad
2025-10-04 05:03:32 +00:00
35c4130fd1 [2/N] Fix ruff warnings (#164460)
Apply ruff `SIM` rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164460
Approved by: https://github.com/ezyang
2025-10-04 03:40:32 +00:00
34042a9145 Change intra-graph offset dtype to uint64_t (#164515)
Even though `offset_intragraph_` only tracks RNG consumption within a single graph replay, we have observed that the 32bit storage for these offsets is easy to overshoot, especially for cases with big CUDA graph captures including kernels that are generating a large amount of random numbers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164515
Approved by: https://github.com/eee4017, https://github.com/eqy
2025-10-04 03:39:09 +00:00
Ken
9d1ab4f4bb [CI] Limit Numba CUDA-13 patch to CUDA environments only (#164607)
The patch introduced in https://github.com/pytorch/pytorch/pull/163111 caused issues in ROCm environments. This change guards the patching logic to CUDA environments only, thus ameliorating test failures in ROCm environments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164607
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-04 02:39:07 +00:00
3e0826c9d7 Update disabling fast-path for strict-export inside MultiheadAttention (#164544)
For some reason, executorch needs the slow path. But the original flag doesn't work for new export because we inline torch modules even before getting into make_fx. We still have to keep the old flag because lot of code assumes this exist.... grr

Differential Revision: [D83810733](https://our.internmc.facebook.com/intern/diff/D83810733)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164544
Approved by: https://github.com/anijain2305, https://github.com/mikaylagawarecki
2025-10-04 02:20:55 +00:00
86c789849e [fr] Re-order mismatch check in fr analysis script (#164606)
In reality we found the current mismatch order does not match the actual error distribution, so we reorder it a bit as following:
1. We do collective type check first
2. Then size check (excluding all2all)
3. dtype check
4. state check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164606
Approved by: https://github.com/VieEeEw
2025-10-04 01:16:15 +00:00
f3afbcf340 [ONNX] Bump tested onnxruntime to 1.23.0 and onnxscript to 0.5.2 (#164440)
Performs tests on the latest ONNX environment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164440
Approved by: https://github.com/justinchuby, https://github.com/albanD
2025-10-04 01:10:47 +00:00
40b25578e4 [Inductor] deterministic mode (#163589)
Add a deterministic mode to skip the on device benchmarking that we know should affect numeric. This include
- pad-mm
- dynamic rblock scaling
- template autotuning
- coordinate descent tuning for reduction
- reduction config autotuning in CachingAutotuner.  For reduction both RBLOCK, num_warps should affect numeric. XBLOCK does not. We can still autotune XBLOCK for reductions.
- benchmarking for computation communication reordering pass

The mode definitely has perf hit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163589
Approved by: https://github.com/v0i0
2025-10-04 01:05:08 +00:00
412c6d28ec [ROCm][CI] additional dynamo benchmarks for inductor-periodic (#164279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164279
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-04 00:55:17 +00:00
7d570129e0 Fix custom autograd Function memory leak when saving mutated view (#164407)
Fixes https://github.com/pytorch/pytorch/issues/160317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164407
Approved by: https://github.com/albanD
2025-10-04 00:47:12 +00:00
97ca21106d move fw|bw compiler args in aot joint with descriptors (#164584)
Summary: Minor refactor where we push some args in the aot joint with descriptors workflow that are not used in export stage to the compile stage where they are actually used.

Test Plan: existing tests should pass

Differential Revision: D83850316

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164584
Approved by: https://github.com/tugsbayasgalan
2025-10-04 00:24:46 +00:00
27234792ad Fix refine_ranges corner case (#164075)
address https://github.com/pytorch/pytorch/issues/161360

u0>0 should update the range of u0 to start from [1, ..] this fix it. it was not doing that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164075
Approved by: https://github.com/ColinPeppler
2025-10-03 23:30:46 +00:00
b6b7a44dec Fix common typos and misspellings (#164413)
Summary:
This commit fixes numerous typos and misspellings found throughout the codebase. The fixes improve code readability and documentation consistency across C++, Python, CUDA, and documentation files.

## Typos Fixed

| Before | After | Occurrences |
|--------|-------|-------------|
| occured | occurred | 14 |
| accross | across | 9 |
| lenght/lenghts | length/lengths | 8 |
| unneccessary | unnecessary | 5 |
| Peform | Perform | 4 |
| furture | future | 3 |
| paritioned | partitioned | 2 |
| desireable | desirable | 2 |
| registerations | registrations | 2 |
| seperated | separated | 2 |
| intialized | initialized | 2 |
| capatibility | compatibility | 2 |
| peformed | performed | 2 |
| Exmple | Example | 2 |
| comma_seperated | comma_separated | 2 |
| cumsuming | consuming | 2 |
| neccessary | necessary | 1 |
| ParamterMetadataTable | ParameterMetadataTable | 1 |
| matached | matched | 1 |
| conaitner | container | 1 |
| reivew | review | 1 |
| prioriry | priority | 1 |
| Alocated | Allocated | 1 |
| opportunixtically | opportunistically | 1 |
| peformance | performance | 1 |
| equavalent | equivalent | 1 |
| asssumed | assumed | 1 |
| valdiation | validation | 1 |
| apprear | appear | 1 |
| consectuve | consecutive | 1 |
| dependending | depending | 1 |
| copnversion | conversion | 1 |
| weigted | weighted | 1 |
| repreesenting | representing | 1 |
| finialize | finalize | 1 |
| unintialized | uninitialized | 1 |
| conbined | combined | 1 |
| tesnor | tensor | 1 |
| desugared | discarded | 1 |
| behaviour | behavior | 1 |
| paramerizaitons | parametrizations | 1 |
| compute_output_lenghths_kernel | compute_output_lengths_kernel | 1 |

Test Plan: N/A - mostly comments - waiting on CI

Differential Revision: D83695665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164413
Approved by: https://github.com/eqy, https://github.com/larryliu0820
2025-10-03 23:19:41 +00:00
3ddf2018d0 Revert "Support setting grad_dtype on leaf tensors (#162815)"
This reverts commit dca73982c53e9f99f96246b5d9ed9bab83c7423f.

Reverted https://github.com/pytorch/pytorch/pull/162815 on behalf of https://github.com/yangw-dev due to break internal test D83850533, see more details below ([comment](https://github.com/pytorch/pytorch/pull/162815#issuecomment-3367498501))
2025-10-03 23:14:28 +00:00
fac6f20ae3 [CI] Add another win shard (#164605)
Since its timing out 0b4f2b46d9/1

the first shard is disproportionately long because of cpp tests, I'm trying to figure that out but for now we can do this or increase the timeout
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164605
Approved by: https://github.com/seemethere, https://github.com/malfet
2025-10-03 22:51:09 +00:00
1894082000 UT/Examples for resharding checkpoint save/loads for distributed tensors with uneven shards. (#160533)
1\  DTensor abstraction on its own, does not support arbitrary length shards in its distributed tensors representation. It supports a single uneven shard, bit it has to be the last shard in the sharding dimension.

2\ However, DCP supports an API called checkpointable. This API allows you to define your custom shardable tensor structure. I have given a UT example ( look for CheckpointableDistTensor). Therefore, one option is to use CheckpointableDistTensor to save/load uneven shards.

3\ While exploring this path, I also noticed that torch.rec module also encountered a similar problem while working with DTensor. They workaround it by implementing Checkpointable API in DTensor and introducing an auxillary structure called LocalShardsWrapper. This is the second option we can use to unblock data loader resharding work.

In summary;
Use LocalShardWrapper + DTensor as the first option to unblock.
Second preference is to use new implementation of Checkpointable API. ( similar to CheckpointbaleDistTensor I have introduced in this example).

Differential Revision: D80182564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160533
Approved by: https://github.com/saumishr
2025-10-03 22:15:02 +00:00
5a66ff4915 [dynamo, 3.14] fix _detect_and_normalize_assert_statement for 3.14 (#164005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164005
Approved by: https://github.com/anijain2305, https://github.com/atalman
2025-10-03 22:07:54 +00:00
abadea70f3 [inductor] thread hint_override in more kernel args (#164494)
ensure hint_override is threaded in benchmarking args

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164494
Approved by: https://github.com/bobrenjc93
2025-10-03 22:07:12 +00:00
f414aa8e0d Add pyrefly suppressions (3/n) (#164588)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

step 1: uncomment lines in the pyrefly.toml file
step 2: run pyrefly check
step 3: add suppressions, clean up unused suppressions
before: https://gist.github.com/maggiemoss/bb31574ac8a59893c9cf52189e67bb2d

after:

 0 errors (1,970 ignored)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164588
Approved by: https://github.com/oulgen
2025-10-03 22:03:03 +00:00
e438db2546 Make custom op alias check consistent (#164576)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164576
Approved by: https://github.com/soulitzer
ghstack dependencies: #164467
2025-10-03 21:42:11 +00:00
10335ffb2c Add pure view support in autograd Function (#164467)
Fix https://github.com/pytorch/pytorch/issues/73604

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164467
Approved by: https://github.com/ezyang, https://github.com/soulitzer
2025-10-03 21:42:11 +00:00
f006aee601 Speed up FP precision lookup (#164044)
This commit simplifies the precision lookup and setting logic
by reducing the number of branches and using a custom hash
function. Fixes #161822. The issue described in #163709 still
persists. This is meant as a short term fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164044
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-03 21:35:20 +00:00
8d53d788fe lint: add .pyi to changed files on .pyi.in changes (#164603)
We were observing issues where the lint on trunk vs. PRs would be different
due to missing .pyi files. This change adds the .pyi files to the changed files
list when .pyi.in files are changed.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164603
Approved by: https://github.com/atalman, https://github.com/malfet, https://github.com/Skylion007
2025-10-03 21:30:54 +00:00
0b4f2b46d9 Revert "[inductor] require shape in TritonCSEVariable (#162275)"
This reverts commit f465ea6752c91498de63eb57439a74f4836e568a.

Reverted https://github.com/pytorch/pytorch/pull/162275 on behalf of https://github.com/yangw-dev due to break interal test, see more details in next comment ([comment](https://github.com/pytorch/pytorch/pull/162275#issuecomment-3367213941))
2025-10-03 21:07:00 +00:00
960c4b9937 [inductor] Enable triton kernels with unbacked inputs (#164509)
Summary:
We need to pass in fallback value to avoid converting symbols to int

original failure log in onefeed Slimper MB - P1973406565
`raise TypeError("Cannot convert symbols to int")`

Test Plan:
if not passing in fallback value -
https://www.internalfb.com/intern/everpaste/?handle=GGeAoh_M11kEGOECAFELOaq8ooRCbswMAAAz
`raise TypeError("Cannot convert symbols to int")`

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:unbacked_symints -- test_triton_kernel_with_unbacked_symint_fallback --print-passing-details --env TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 --env TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(u0, 0)"
```
Buck UI: https://www.internalfb.com/buck2/4d27cd49-770b-40de-8c65-9ee04c5dd687
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149324695031
Network: Up: 0B  Down: 16MiB  (reSessionID-8e8b07a2-e31c-402d-bf6a-ebb92253e654)
Executing actions. Remaining     0/6                                                              5.0s exec time total
Command: test.     Finished 2 cache (100% hit)                                                    5.0s exec time cached (100%)
Time elapsed: 33.8s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D83684260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164509
Approved by: https://github.com/ColinPeppler
2025-10-03 21:05:18 +00:00
1f8ee5da11 [TorchGen] Remove unused variables and function imports (#164538)
This PR removes unused code in torchgen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164538
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-10-03 20:49:36 +00:00
da49a57d34 [ROCm] Enabled JIT UTs on ROCm (#164582)
This PR is to enable the following tests rocm.

test/test_jit.py::TestBackends::test_save_load
test/test_jit.py::TestBackends::test_execution
test/test_jit.py::TestBackends::test_errors
test/test_jit.py::TestCUDA::test_current_stream

Verified that the tests pass on AMD gfx90a and gfx942 arch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164582
Approved by: https://github.com/jeffdaily
2025-10-03 20:16:41 +00:00
8ec8c14ace Revert "[CUDA] Add experimental green context support for SM carveout (#159104)"
This reverts commit 3c59351c6ea2fc29d346903e28e95c5f4d0ccdbb.

Reverted https://github.com/pytorch/pytorch/pull/159104 on behalf of https://github.com/clee2000 due to failed lint, pyfmt not caught pyi file, I think they need special handling since theyre not in the changed files list? ([comment](https://github.com/pytorch/pytorch/pull/159104#issuecomment-3367077208))
2025-10-03 20:15:56 +00:00
2d50678dcc Fix -Wno-duplicate-decl-specifier is valid for C/ObjC but not for C++ (#164552)
Fixes #99715
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164552
Approved by: https://github.com/Skylion007
2025-10-03 20:12:49 +00:00
3ca09d65f1 [ROCm] Enable several distributed UTs (#164390)
Increase the tolerance for the following UTs as there was a slight mismatch seen on MI200.
    - test_data_parallel.py:test_strided_grad_layout
    - test_c10d_nccl.py:test_grad_layout_1devicemodule_1replicaperprocess

Skip for MI200:
    - test_fully_shard_training.py:test_2d_mlp_with_nd_mesh
    - test_2d_composability.py:test_train_parity_2d_mlp
    - test_fully_shard_overlap.py:test_fully_shard_training_overlap

Fixes #159489
Fixes #159488
Fixes #152700
Fixes #125555
Fixes #134139

Working as is on both MI200 and MI300:
Fixes #125991
Fixes #125918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164390
Approved by: https://github.com/jeffdaily
2025-10-03 19:52:51 +00:00
1bb68271b7 Stop building nativert in OSS (#164463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164463
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-10-03 19:41:15 +00:00
9eb89a4ad5 Add missing TypeIs to torch/_inductor/ir.py (#164489)
This should be a TypeIs here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164489
Approved by: https://github.com/mlazos
2025-10-03 19:34:20 +00:00
15d726005d Enable several unit tests on ROCm (#163087)
Code change enables:
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float16
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float32
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_cuda_float64
test_nn::TestNNDeviceTypeCUDA::test_transformerencoderlayer_gelu_cuda_float16
test_linalg::TestLinalgCUDA::test_eigh_svd_illcondition_matrix_input_should_not_crash_cuda_float32
test_linalg::TestLinalgCUDA::test_eigh_svd_illcondition_matrix_input_should_not_crash_cuda_float64
test_ops::TestCommonCUDA::test_complex_half_reference_testing_as_strided_scatter_cuda_complex32

Fixes https://github.com/pytorch/pytorch/issues/134687
Fixes https://github.com/pytorch/pytorch/issues/78205

Closing github issues:
inductor/test_gpu_cpp_wrapper unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157084

test_nn unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157167
Fixes https://github.com/pytorch/pytorch/issues/157119
Fixes https://github.com/pytorch/pytorch/issues/157118
Fixes https://github.com/pytorch/pytorch/issues/157115
Fixes https://github.com/pytorch/pytorch/issues/157081
Fixes https://github.com/pytorch/pytorch/issues/155216
Fixes https://github.com/pytorch/pytorch/issues/157259
Fixes https://github.com/pytorch/pytorch/issues/157166
Fixes https://github.com/pytorch/pytorch/issues/157165
Fixes https://github.com/pytorch/pytorch/issues/157164
Fixes https://github.com/pytorch/pytorch/issues/157117
Fixes https://github.com/pytorch/pytorch/issues/157116
Fixes https://github.com/pytorch/pytorch/issues/157114
Fixes https://github.com/pytorch/pytorch/issues/157113
Fixes https://github.com/pytorch/pytorch/issues/157082
Fixes https://github.com/pytorch/pytorch/issues/157080
Fixes https://github.com/pytorch/pytorch/issues/157079
Fixes https://github.com/pytorch/pytorch/issues/157078

test_linalg unit tests:
Fixes https://github.com/pytorch/pytorch/issues/157427
Fixes https://github.com/pytorch/pytorch/issues/157414
Fixes https://github.com/pytorch/pytorch/issues/157369
Fixes https://github.com/pytorch/pytorch/issues/157349
Fixes https://github.com/pytorch/pytorch/issues/157348
Fixes https://github.com/pytorch/pytorch/issues/157337
Fixes https://github.com/pytorch/pytorch/issues/157336
Fixes https://github.com/pytorch/pytorch/issues/157297
Fixes https://github.com/pytorch/pytorch/issues/157281
Fixes https://github.com/pytorch/pytorch/issues/157260
Fixes https://github.com/pytorch/pytorch/issues/157171
Fixes https://github.com/pytorch/pytorch/issues/157169
Fixes https://github.com/pytorch/pytorch/issues/157168
Fixes https://github.com/pytorch/pytorch/issues/157125
Fixes https://github.com/pytorch/pytorch/issues/157124
Fixes https://github.com/pytorch/pytorch/issues/157123
Fixes https://github.com/pytorch/pytorch/issues/157089
Fixes https://github.com/pytorch/pytorch/issues/157088
Fixes https://github.com/pytorch/pytorch/issues/157087
Fixes https://github.com/pytorch/pytorch/issues/157068
Fixes https://github.com/pytorch/pytorch/issues/157067
Fixes https://github.com/pytorch/pytorch/issues/157066
Fixes https://github.com/pytorch/pytorch/issues/157047
Fixes https://github.com/pytorch/pytorch/issues/157046
Fixes https://github.com/pytorch/pytorch/issues/157045
Fixes https://github.com/pytorch/pytorch/issues/157044
Fixes https://github.com/pytorch/pytorch/issues/156997
Fixes https://github.com/pytorch/pytorch/issues/156996
Fixes https://github.com/pytorch/pytorch/issues/156995
Fixes https://github.com/pytorch/pytorch/issues/156994
Fixes https://github.com/pytorch/pytorch/issues/156993
Fixes https://github.com/pytorch/pytorch/issues/156991
Fixes https://github.com/pytorch/pytorch/issues/156990
Fixes https://github.com/pytorch/pytorch/issues/156989
Fixes https://github.com/pytorch/pytorch/issues/105118
Fixes https://github.com/pytorch/pytorch/issues/157415
Fixes https://github.com/pytorch/pytorch/issues/157282
Fixes https://github.com/pytorch/pytorch/issues/157261
Fixes https://github.com/pytorch/pytorch/issues/157170
Fixes https://github.com/pytorch/pytorch/issues/157126

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163087
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony
2025-10-03 19:30:59 +00:00
16f9bef642 [precompile] Fix guard serialization loading bugs. (#164490)
Summary: Added a set of fixes triggered by fm training job. Overall the theme here is that we should get rid of saved objects as much as possible when they are not used in guard reconstruction. Sometimes for objects that cannot be saved (like local functions) we still try our best to save their closures.

Test Plan:
test_guard_serialization.py
test_lazy_awatiable.py

Differential Revision: D83766926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164490
Approved by: https://github.com/jamesjwu
2025-10-03 19:20:07 +00:00
3c59351c6e [CUDA] Add experimental green context support for SM carveout (#159104)
Low-level PyTorch APIs should be usable/stable enough at this point but we might move the underlying driver API usage a bit from here...

Built on top of @drisspg 's branch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159104
Approved by: https://github.com/ngimel

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-10-03 18:59:12 +00:00
7eb1eb4313 ci: Removing ROCm tests from trunk. (#164585)
Had a conversation with the AMD team today and I think we are all in
agreement that the current state of queueing for AMD is beyond where
we'd like to be for there to be blocking CI for ROCm.

Moving the representative testing jobs for this into the ciflow/rocm
workflow.

I'd love for these to be back in trunk if we can get to a state where
our queueing metrics are below an hour for ROCm infrastructure.

Dashboards:
* ROCm Queueing (>60mins) ([link](https://hud.pytorch.org/queue_time_analysis?dateRange=30&startDate=2025-09-03T16%3A06%3A45.025Z&endDate=2025-10-03T16%3A06%3A45.025Z&granularity=week&chartType=bar&repos=pytorch%2Fpytorch&category=machine_type&machineTypes=linux.rocm.gpu.2&machineTypes=linux.rocm.gpu.4&machineTypes=linux.rocm.gpu.mi250&machineTypes=linux.rocm.gpu.gfx942.1&machineTypes=linux.rocm.gpu.mi250.4&machineTypes=linux.rocm.gpu.gfx942.4&machineTypes=linux.rocm.gpu.mi355.2&machineTypes=linux.rocm.gpu.gfx942.4.test&machineTypes=linux.rocm.gpu.mi250.1&machineTypes=linux.rocm.gpu.gfx942.1.test&machineTypes=linux.rocm.gpu.gfx90a.1&machineTypes=linux.rocm.gpu.gfx90a.4&items=linux.rocm.gpu.2&items=linux.rocm.gpu.4&items=linux.rocm.gpu.mi250&items=linux.rocm.gpu.gfx942.1&items=linux.rocm.gpu.mi250.4&items=linux.rocm.gpu.gfx942.4&items=linux.rocm.gpu.mi355.2&items=linux.rocm.gpu.gfx942.4.test&items=linux.rocm.gpu.mi250.1&items=linux.rocm.gpu.gfx942.1.test&items=linux.rocm.gpu.gfx90a.1&items=linux.rocm.gpu.gfx90a.4))
* NVIDIA queueing (<5mins) ([link](https://hud.pytorch.org/queue_time_analysis?dateRange=30&startDate=2025-09-03T16%3A05%3A08.000Z&endDate=2025-10-03T16%3A05%3A08.000Z&granularity=week&chartType=bar&repos=pytorch%2Fpytorch&category=machine_type&machineTypes=lf.linux.g4dn.4xlarge.nvidia.gpu&machineTypes=linux.g4dn.12xlarge.nvidia.gpu&machineTypes=linux.g4dn.metal.nvidia.gpu&machineTypes=linux.g5.4xlarge.nvidia.gpu&machineTypes=lf.linux.g4dn.12xlarge.nvidia.gpu&machineTypes=lf.linux.g5.12xlarge.nvidia.gpu&machineTypes=lf.linux.g5.4xlarge.nvidia.gpu&machineTypes=lf.linux.g6.4xlarge.experimental.nvidia.gpu&machineTypes=linux.g6.4xlarge.experimental.nvidia.gpu&machineTypes=linux.4xlarge.nvidia.gpu&machineTypes=linux.g5.12xlarge.nvidia.gpu&machineTypes=linux.g4dn.4xlarge.nvidia.gpu&machineTypes=lf.linux.4xlarge.nvidia.gpu&machineTypes=linux.g6.12xlarge.nvidia.gpu&items=lf.linux.g4dn.4xlarge.nvidia.gpu&items=linux.g4dn.12xlarge.nvidia.gpu&items=linux.g4dn.metal.nvidia.gpu&items=linux.g5.4xlarge.nvidia.gpu&items=lf.linux.g4dn.12xlarge.nvidia.gpu&items=lf.linux.g5.12xlarge.nvidia.gpu&items=lf.linux.g5.4xlarge.nvidia.gpu&items=lf.linux.g6.4xlarge.experimental.nvidia.gpu&items=linux.g6.4xlarge.experimental.nvidia.gpu&items=linux.4xlarge.nvidia.gpu&items=linux.g5.12xlarge.nvidia.gpu&items=linux.g4dn.4xlarge.nvidia.gpu&items=lf.linux.4xlarge.nvidia.gpu&items=linux.g6.12xlarge.nvidia.gpu))

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164585
Approved by: https://github.com/malfet, https://github.com/yangw-dev, https://github.com/atalman, https://github.com/jeffdaily
2025-10-03 18:19:24 +00:00
f39789cdab [PyTorch Pinned Allocator] Add support of reserved pinned memory segment to avoid slow paths (#164501)
Summary:
This diff adds the feature of allocating a large pinned memory segment upfront based on the provided config. This large segment is then used to serve all the small pinned memory requests to avoid expensive device level APIs (slow paths).

Example:

PYTORCH_CUDA_ALLOC_CONF=pinned_reserve_segment_size_mb:2048

This reserves a 2GB pinned memory segment for the process and then all incoming small requests are just served from this segment and no cudaHostAlloc/cudaHostRegister apis are being called.

Differential Revision: D83779074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164501
Approved by: https://github.com/yangw-dev
2025-10-03 18:11:27 +00:00
3d9d41c801 Remove old workaround in launch_logcumsumexp_cuda_kernel (#164567)
Remove workaround for CUDA 11.4 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164567
Approved by: https://github.com/Aidyn-A, https://github.com/Skylion007
2025-10-03 18:07:02 +00:00
5b0b4cda4a [dtensor] avoid shape recompilations on DTensorSpec (#163820)
skips DTensorSpec.sizes/strides in metadata guard checks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163820
Approved by: https://github.com/azahed98
2025-10-03 17:18:18 +00:00
2a11ce2c78 Support calling torch.compile inside non-strict export (#164171)
So this fixes at least two issues:
1) When we are invoking inductor backend, we apply pre-grad passes which try to find correct fake mode to use. In the nested case, we will run into clash when there is closure variable in the inductor region because non-strict would have fakified this variable before hand and inner torch.compile would have created a new fresh fake mode. This is not a problem in regular torch.compile because inner torch.compile gets ignored. I don't know if we are supposed to inherit fake mode from parent context in this case. But we can avoid this problem if we just default to eager backend which is fine in this case because the point of export is to capture aten operators. Going to inductor would mean we will lose inner torch.compile ops.
2) There is custom torch function modes in export that track number of torch fns executed and inner compile itself doesn't work because of guard failure as this mode state gets changed. I noticed torch.cond fixes this problem by carefully stashing the torch function mode and defer it in the backend. So the correct thing to do here is just re-use torch.cond implementation unconditionally.

So the things i did for fixing above were:
1) Always default to eager backend when compile is invoked inside export. I needed to make how torch.cond sets up the fresh tracing env into an util that can be shared.
2) The previous eager backend for torch.cond was wrong because the context managers didn't actually persist until the backend is invoked.
3) torch.cond used only disable TorchFunctionMetadata tf mode and stash it for later, but in fact, we should do both TorchFunctionMetadata and PreDispatchTorchFunctionMode.

With above fixes, we are able to export flex attention in export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164171
Approved by: https://github.com/ydwu4
2025-10-03 16:31:07 +00:00
3288fbf374 Change default device to current acclerator (#164399)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164399
Approved by: https://github.com/albanD
2025-10-03 16:15:09 +00:00
fa5306b4f5 Support partial _DynamoCacheEntries when not all backends available (#163521)
Differential Revision: [D82735769](https://our.internmc.facebook.com/intern/diff/D82735769/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163521
Approved by: https://github.com/zhxchen17
2025-10-03 16:14:32 +00:00
5656d45c8f forward fix #164481 (#164578)
PR #164481 added unit test test_scaled_mm_preserves_strides in test/inductor/test_fp8.py. It was missing the adjustment for ROCm's F8 types on MI300.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164578
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-03 15:44:34 +00:00
e40fe634b1 Pin conda version for Docker builds (#164575)
Mitigates https://github.com/pytorch/pytorch/issues/164574
Remove unused CUDA_CHANNEL var - this was used before when we had  pytorch install via conda.

Please note: CUDA 13.0 failures are expected since the CI tries to build against prod and CUDA 13.0 is not available in prod yet.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164575
Approved by: https://github.com/malfet, https://github.com/Camyll
2025-10-03 15:01:35 +00:00
3db2164341 [torchfuzz] add norm operators (#164514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164514
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432, #164434
2025-10-03 14:44:19 +00:00
5bb8f04d3e [torchfuzz] add nn functional ops (#164434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164434
Approved by: https://github.com/pianpwk
ghstack dependencies: #164432
2025-10-03 14:44:19 +00:00
5743d731c1 Use torch.testing.test_close instead of torch.testing.test_allclose (#164539)
Because torch.testing.test_allclose is deprecated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164539
Approved by: https://github.com/mlazos
2025-10-03 14:39:10 +00:00
aed66248a0 [vllm hash update] update the pinned vllm hash (#164319)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164319
Approved by: https://github.com/pytorchbot

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-10-03 12:30:33 +00:00
6c3c9414eb config for dcache + unit tests (#164512)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```

Reviewed By: aorenste

Differential Revision: D83714687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164512
Approved by: https://github.com/jananisriram
2025-10-03 10:52:59 +00:00
eccf561326 Move call to output generated code in inductor (#161615)
This PR moves the call to copy the generated code from `/tmp/...` so that it is still called if attempting to compile the generated code fails. In both cases now, the generated code will be copied across to `torch_compile_debug/run_.../torchinductor/output_code.py` which makes debugging bad generated code easier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161615
Approved by: https://github.com/eellison
2025-10-03 10:23:22 +00:00
ddf8de28c2 Add Rocm to Operator Microbenchmark CI (#164173)
This pull request adds support for running operator microbenchmarks on ROCm (AMD GPU) environments in the CI workflow. The main changes involve introducing new build and test jobs for ROCm in the `.github/workflows/operator_microbenchmark.yml` file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164173
Approved by: https://github.com/huydhn
2025-10-03 07:35:32 +00:00
7617b113ad [torchfuzz] Support EagerVsFullGraphDynamicCompileWithNumericsCheck (#164432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164432
Approved by: https://github.com/pianpwk
2025-10-03 06:42:20 +00:00
2a760dc51e [DeviceMesh] Simplifying internal bookkeeping with CuTe layout (#163213)
We want to refactor the internal bookkeeping of DeviceMesh so that:
Simply the bookkeeping logics and make it generic enough so that it is easy to support new transformations like flatten noncontiguous dim, reshape and unflatten. (We leveraged the CuTe layout). This new layout also let us handle non-contiguous slicing, flatten, transpose possible.

Concretely, in this PR, we do the following:
1. Use the `_MeshLayout` to handle all index operations rather use a map to record mesh dims.
2. Removed `flatten_name_to_root_dims`, because now we can directly get layout from a flattened device mesh.
3. Replaced `_get_slice_mesh_dims` with `_get_slice_mesh_layout`.
4. Use the newly added function `check_overlap` to check layout overlap.
5. Use a new function `to_remapping_tensor` to use layout ranks as indices when the mesh tensor is not representable as CuTe. The reason is that layout acts as a backend of mesh tensor bookkeeping (indexing indices), it needs to be used as indices for remap back to the mesh tensor for new DeviceMesh generation and backend init. For example, in the case of 2K to 4K, the underlying layout is (2K, 1) but the actual value of the mesh tensor is [2K, 2K+1, ....,]. While flattening, slicing, we need to remap the layout back to the new mesh tensor so it maps the actual device allocation. For example, in the 2K to 4K case, if the shape is (1K, 1K) with dim_names ("dp", "tp"). Then when slicing "tp", the mesh tensor should be (2K, 2K+1, ..., 3K-1) or (3K, 3K+1, ... 4K-1). not the global ranks generated from the layout. (1K, 1).

Verified that loss curve is very close for DeepSeekV3 on torchtitan, note that exact same match is challenging because even if we run the baseline twice, the loss curve does not exactly match.

<img width="1113" height="490" alt="image" src="https://github.com/user-attachments/assets/7877b5a4-337e-4ad8-b878-2378f4f0f38d" />

The PR looks big indeed but we don't change any existing behavior of DeviceMesh, so it is a pure refactor.

With this refactoring we also enabled the slicing and flatten of non-contiguous dims of a device mesh which is hard to implement without cute layout.

This is a continue of https://github.com/pytorch/pytorch/pull/161106 (original one got messed with EasyCLA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163213
Approved by: https://github.com/lw, https://github.com/fegin
2025-10-03 05:51:28 +00:00
6c209bfc5c [cutlass-4][take 2] upgrade to cutlass 4.2.1 (#164159)
Test Plan: Sandcastle

Differential Revision: D83492704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164159
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-10-03 03:47:59 +00:00
1051c1de5c Add pyrefly suppressions 2/n (#164513)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
dmypy restart && python3 scripts/lintrunner.py -a
pyrefly check

---
step 1: uncomment lines in the `pyrefly.toml` file
before: https://gist.github.com/maggiemoss/911b4d0bc88bf8cf3ab91f67184e9d46

after:
```
 INFO Checking project configured at `/Users/maggiemoss/python_projects/pytorch/pyrefly.toml`
 INFO 0 errors (1,152 ignored)
 ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164513
Approved by: https://github.com/oulgen
2025-10-03 02:46:13 +00:00
d1cbb74fb1 multimem reduce (#164517)
Modified `multimem_one_shot_all_reduce_out` function to accept a `root` argument, making it a `multimem_reduce` op.

The original `multimem_one_shot_all_reduce` op becomes a caller of the `multimem_reduce`, with each rank providing its own rank id as root.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164517
Approved by: https://github.com/ngimel
2025-10-03 02:41:10 +00:00
91c4db76cb fix flex attention eager: dont round down scores to low-precision (closes #163588) (#163986)
Fixes: https://github.com/pytorch/pytorch/issues/163588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163986
Approved by: https://github.com/drisspg, https://github.com/mlazos
2025-10-03 01:09:59 +00:00
4691fe6070 remove unnecessary registration (#164481)
scaled_mm already had `needs_exact_strides` in its op registration. also added a test showing these strides are being respected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164481
Approved by: https://github.com/drisspg, https://github.com/mlazos
2025-10-03 01:03:12 +00:00
ef50c6e3e3 [MPS] Add backward pass for embedding_bag (#163931)
Fixes #162270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163931
Approved by: https://github.com/malfet
2025-10-03 00:48:38 +00:00
86474ce996 Update mask dtype (#164472)
Differential Revision: [D83781684](https://our.internmc.facebook.com/intern/diff/D83781684)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164472
Approved by: https://github.com/bdhirsh
2025-10-03 00:19:36 +00:00
18e18488e8 [6/N] Apply ruff UP035 rule (#164438)
Continued code migration to enable ruff UP035. Most changes are about moving `Callable` from typing to from collections.abc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164438
Approved by: https://github.com/ezyang
2025-10-03 00:15:32 +00:00
f7082e92b3 [cuBLAS] update cuBLAS determinism docs, remove workspace requirement checks (#161749)
Since CUDA 11.x (need to update the docs for this, current PR is saying 12.2 which is incorrect) we've been allocating cuBLAS workspaces explicitly per handle/stream combination https://github.com/pytorch/pytorch/pull/85447

According to the cuBLAS documentation, this appears to be sufficient for determinism without any explicit workspace requirements to e.g., `:4096:8` or `:16:8` as was previously expressed in PyTorch docs https://docs.nvidia.com/cuda/cublas/#results-reproducibility

Planning to add an explicit determinism test as well...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/161749
Approved by: https://github.com/ngimel
2025-10-03 00:09:47 +00:00
95a053284c Fix vllm build issue (#164361)
Fixes #ISSUE_NUMBER
unstable https://github.com/pytorch/pytorch/issues/164362
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164361
Approved by: https://github.com/huydhn

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-10-02 23:34:21 +00:00
c7e30ae4dd MX: Remove redundant PLATFORM_SUPPORTS_MX_GEMM constant (#164320)
Deleted duplicate definition of PLATFORM_SUPPORTS_MX_GEMM, was introduced in https://github.com/pytorch/pytorch/pull/162209
Also, adjusted BLOCK_SIZE and fp4_scaling_dtype in test_matmul_cuda.py to enable test_blockwise_nvfp4_compile on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164320
Approved by: https://github.com/jeffdaily
2025-10-02 23:30:56 +00:00
dca73982c5 Support setting grad_dtype on leaf tensors (#162815)
`grad_dtype` is a new attribute on Tensor to control gradient dtype:
- Access/setting is leaf-only.
- grad_dtype is respected when (1) when assigning to .grad, and (2) in the engine after the previous node produces incoming gradients for AccumulateGrad. (See table below for details)
- Not setting grad_dtype preserves the current behavior. Accessing it returns `t.dtype`
- `grad_dtype` cannot be set when there is already a `.grad` present and the dtypes conflict.

| `grad_dtype` setting | Setting `.grad` manually | Incoming gradient from autograd engine |
|-----------------------|--------------------------|-----------------------------------------|
| **Default (tensor’s dtype)** | `.grad` must match tensor’s dtype | Engine casts incoming grad to tensor’s dtype |
| **Set to specific dtype** | `.grad` must match that dtype | Engine casts incoming grad to the specified dtype |
| **Set to `None`** | `.grad` may be any dtype | Engine does not cast; accepts incoming grad dtype as-is |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162815
Approved by: https://github.com/albanD
2025-10-02 23:09:07 +00:00
43848b71d9 Improved support for autotuning in wrapper_fxir (#164132)
Summary:
- correct dtype propagation
- allow more more options to be passed to compiler

Test Plan: in follow up change

Differential Revision: D83367909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164132
Approved by: https://github.com/jansel
2025-10-02 22:54:22 +00:00
15c8bdcc5e Fix FloorDiv should not generate non integer rationals (due to sympy bug) (#164398)
FloorDiv eval have this optimization
```
  # Expands (x + y) // b into x // b + y // b.
  # This only works if floor is an identity, i.e. x / b is an integer.
 ```

 Before this PR this optimization would generate a result in an expression like this. Duo to a bug in sympy.
 ```
Mul(Rational(1, 22), Add(Mul(Integer(24), Symbol('s37', integer=True, positive=True)), Integer(672)), FloorDiv(Mul(Symbol('s14', integer=True, positive=True), Symbol('s46', integer=True, positive=True)), Integer(2016)))
 ```

 This is because in sympy an expression can have .is_integer =True yet have 1/22 in it!
 This PR ensure we do not generate that by simply opting out if this optimization if we end
 up with quotient that have such rational.

  Fix
  https://github.com/pytorch/pytorch/issues/164385,
  https://github.com/pytorch/pytorch/issues/154996
  https://github.com/pytorch/pytorch/issues/153375
  https://github.com/pytorch/pytorch/issues/164063
and internal user issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164398
Approved by: https://github.com/jansel, https://github.com/isuruf
2025-10-02 22:51:03 +00:00
22e219d996 Revert "[DeviceMesh] Simplifying internal bookkeeping with CuTe layout (#163213)"
This reverts commit b0985144b59db8fb20964829b5e0a9d2f9a3f0d6.

Reverted https://github.com/pytorch/pytorch/pull/163213 on behalf of https://github.com/yangw-dev due to caused internal test failure ([comment](https://github.com/pytorch/pytorch/pull/163213#issuecomment-3363414435))
2025-10-02 22:22:26 +00:00
bdc0a421d7 Stop parsing command line arguments every time common_utils is imported. (#156703)
Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs:

https://github.com/pytorch/pytorch/pull/154612
https://github.com/pytorch/pytorch/pull/154628
https://github.com/pytorch/pytorch/pull/154715
https://github.com/pytorch/pytorch/pull/154716
https://github.com/pytorch/pytorch/pull/154725
https://github.com/pytorch/pytorch/pull/154728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703
Approved by: https://github.com/clee2000
2025-10-02 22:22:04 +00:00
ece5e0f01b Fake process group Direct construction error (#163665)
Fixes #162129. Added validation in _rank_not_in_group() to check if ```FakeProcessGroup``` is properly initialized before use, raising a clear error message if ```torch.distributed.init_process_group(backend='fake')``` hasn't been called first.
This prevents silent failures and ensures proper dispatch system integration for all distributed operations.

Added test case test_fake_process_group_direct_usage_error() that validates the error is raised for ```all_reduce``` and ```all_to_all_single``` operations.

Please let me know if additional distributed operators should be tested or if any other updates are needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163665
Approved by: https://github.com/ezyang
2025-10-02 22:19:26 +00:00
a34797e031 Revert "Add provenance to inductor IR nodes created after graph.run (#164255)"
This reverts commit b9e73e639e36f3aa628752161711e68878231b30.

Reverted https://github.com/pytorch/pytorch/pull/164255 on behalf of https://github.com/jeffdaily due to broke rocm; inductor/test_provenance_tracing.py::TestProvenanceTracingStackTraces::test_deferred_triton_kernels [GH job link](https://github.com/pytorch/pytorch/actions/runs/18200790301/job/51821738132) [HUD commit link](b9e73e639e) ([comment](https://github.com/pytorch/pytorch/pull/164255#issuecomment-3363360088))
2025-10-02 22:01:41 +00:00
f465ea6752 [inductor] require shape in TritonCSEVariable (#162275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162275
Approved by: https://github.com/mlazos
ghstack dependencies: #164158
2025-10-02 21:52:09 +00:00
a8edccfbf4 [inductor] fix TestTemplateRender in select_algorithm (#164158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164158
Approved by: https://github.com/mlazos
2025-10-02 21:52:09 +00:00
6389658ec6 Fix type hints in PrepareModuleInput and PrepareModuleInputOutput (#164482)
Fixes #161646

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164482
Approved by: https://github.com/Skylion007
2025-10-02 21:40:43 +00:00
cc71ab86a6 [DTensor] raise error if the local_tensor argument passed to DTensor.from_local is a DTensor (#164496)
**Summary**
Raise error when the `local_tensor` argument passed to `DTensor.from_local` is
a DTensor, this prevents users from accidentally calling `from_local` over a DTensor
object.

The error message is organized in this way:
```
the local_tensor argument only accepts torch.Tensor but got <class 'torch.distributed.tensor.DTensor'> value.
```

**Test**
`pytest test/distributed/tensor/test_dtensor.py -k test_from_local`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164496
Approved by: https://github.com/ezyang
2025-10-02 21:25:01 +00:00
2a7c486750 Revert "Speed up FP precision lookup (#164044)"
This reverts commit 723ba213932bb1eca90109e003250ebb0da45eb1.

Reverted https://github.com/pytorch/pytorch/pull/164044 on behalf of https://github.com/yangw-dev due to broke internal build In file included from xplat/caffe2/aten/src/ATen/DeviceAccelerator.cpp:1: xplat/caffe2/aten/src/ATen/Context.h:502:38: error: shift count >= width of type [-Werror,-Wshift-count-overflow] 502 | return std::hash<size_t>{}((k1 << 32) | k2); ([comment](https://github.com/pytorch/pytorch/pull/164044#issuecomment-3363016702))
2025-10-02 21:00:44 +00:00
5f18f240de Add initial suppressions for pyrefly (#164177)
Adds suppressions to pyrefly will typecheck clean: https://github.com/pytorch/pytorch/issues/163283

Test plan:
`python3 scripts/lintrunner.py`
`pyrefly check`

---

Pyrefly check before: https://gist.github.com/maggiemoss/3a0aa0b6cdda0e449cd5743d5fce2c60
After:

```
 INFO Checking project configured at `/Users/maggiemoss/python_projects/pytorch/pyrefly.toml`
 INFO 0 errors (1,063 ignored)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164177
Approved by: https://github.com/Lucaskabela
2025-10-02 20:57:41 +00:00
6b7970192f [ROCm][CI] fix test_cudnn_convolution_relu_cuda (#164466)
Fixes #162816.
Test was comparing output of conv vs fused conv but inputs were different memory formats. Also fix test_cudnn_convolution_add_relu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164466
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-02 20:36:54 +00:00
115af42e9d Fix readibility checks in TIDY and apply them (#164475)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164475
Approved by: https://github.com/albanD, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-10-02 20:34:49 +00:00
5f775bdfb7 Fix THP_PyObject_VirtualFree return type (#163763)
# Motivation
`void THP_PyObject_VirtualFree` should have no return value; otherwise, it would raise a build warning
```bash
C:\Users\guangyey\pytorch\torch\csrc\dynamo\cpython_defs.c(264): warning C4098: 'THP_PyObject_VirtualFree': 'void' function returning a value
```
# Additional Context
Refer to
c4f21d7c7c/Include/cpython/objimpl.h (L59-L68)
PyObjectArenaAllocator::free is defined with `void` return type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163763
Approved by: https://github.com/albanD, https://github.com/williamwen42
2025-10-02 20:21:53 +00:00
8c54101933 add tensor subclass printing support in fx/graph.py (#164403)
it was previously quite misleading since it looks like the inputs to the
dynamo graph are plain tensors when in reality they are tensor subclasses

before
```
class GraphModule(torch.nn.Module):
    def forward(self, L_input_batch_inputs_: "i64[2, 512][512, 1]cuda:0", L_self_parameters_weight_: "f32[202048, 256][256, 1]cuda:0"):
```

after
```
    class GraphModule(torch.nn.Module):
        def forward(self, L_input_batch_inputs_: "DTensor(i64[2, 512][512, 1]cuda:0)", L_self_parameters_weight_: "DTensor(f32[202048, 256][256, 1]cuda:0)"):
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164403
Approved by: https://github.com/ezyang
2025-10-02 20:06:12 +00:00
c45d56dd00 typo corrected in ivalue.cpp's comment (#164485)
Fixes #164483

typo corrected in ivalue.cpp's comment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164485
Approved by: https://github.com/Skylion007
2025-10-02 20:01:17 +00:00
33b17bc619 Remove old CUDA version checks (#164199)
Remove some version check code for CUDA <12.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164199
Approved by: https://github.com/ezyang
2025-10-02 19:55:47 +00:00
22b1710252 Use posix_fallocate() to reserve disk space for shared memory (#161910)
Shared memory is allocated by creating a file in /dev/shm (by default) that can run out of space. Pytorch reserves the file size by calling ftruncate() that creates a sparse file, so it succeeds even if sufficient disk space is not available.

This could lead to a situation when a shared memory region is successfully created but a subsequent access to a shared memory page results in SIGBUS due to the disk being full.

Using posix_fallocate() instead of ftruncate() eliminates this problem because the former syscall always allocates space and it returns an error if the disk is full.

Related to https://github.com/pytorch/pytorch/issues/5040
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161910
Approved by: https://github.com/mikaylagawarecki
2025-10-02 19:12:57 +00:00
4661200125 [RELAND v2] Close some sources of fake tensors (#164372)
Changelog:

1. When we run into an operation we didn't proxy, we end up emitting fake constants. We error under a config and we disable the config for some internal users. The reason we want to error is this signals a coverage problem we need to address but at the same time, we don't wnat to be disruptive to already working flows.

2. Previous attribute mutation detection logic in non-strict didn't account for nested module structure. This fixes silent incorrectness issue of exporting esm and qwen in non-strict and some torchbench models like levit_128 and demucs.

3. Previous logic didn't work on the cases where we mutate a container attribute as the previous approach used to pytree over old and new attributes resulting in length mismatch. We gracefully handle this now.

Differential Revision: [D83673054](https://our.internmc.facebook.com/intern/diff/D83673054)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164372
Approved by: https://github.com/avikchaudhuri
2025-10-02 18:58:52 +00:00
6a31f42da4 Fix NestedTensor max/min operations for integer dtypes. (#162273)
Fixes: https://github.com/pytorch/pytorch/issues/162049

### Summary

max_dim and min_dim functions incorrectly used torch.finfo()
for all dtypes, causing TypeError for integer tensors.

### Changes

- Use torch.iinfo() for integer dtypes instead of torch.finfo().
- Add CPU test: `test_jagged_max_min_dtypes` covering `int8, int16, int32, int64, uint8, float16, bfloat16, float32 and float64`

### Testing

Before Fix:

`python -m pytest test/test_nestedtensor.py -k "test_jagged_max_min_dtypes" -v`

Output:

```
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_bfloat16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float32 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float64 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0006s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int16 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0005s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int32 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0005s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int64 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0004s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int8 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
FAILED [0.0004s] test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_uint8 - TypeError: torch.finfo() requires a floating point input type. Use torch.iinfo to handle 'torch.finfo'
```

After Fix:

`python -m pytest test/test_nestedtensor.py -k "test_jagged_max_min_dtypes" -v`

Output:

```
Running 9 items in this shard

test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_bfloat16 PASSED [0.0086s]                                                                                                                   [ 11%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float16 PASSED [0.0011s]                                                                                                                    [ 22%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float32 PASSED [0.0011s]                                                                                                                    [ 33%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_float64 PASSED [0.0011s]                                                                                                                    [ 44%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int16 PASSED [0.0009s]                                                                                                                      [ 55%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int32 PASSED [0.0010s]                                                                                                                      [ 66%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int64 PASSED [0.0010s]                                                                                                                      [ 77%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_int8 PASSED [0.0010s]                                                                                                                       [ 88%]
test/test_nestedtensor.py::TestNestedTensorDeviceTypeCPU::test_jagged_max_min_dtypes_cpu_uint8 PASSED [0.0011s]                                                                                                                       [100%]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162273
Approved by: https://github.com/Skylion007, https://github.com/jbschlosser
2025-10-02 18:46:27 +00:00
c6a6c80a73 Add Aidyn-A to CUDA codeowners (#164436)
Adding myself to "CUDA and CUDA math libraries" section.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164436
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy
2025-10-02 18:34:10 +00:00
bf717ce346 [AOTI win] Add ABI stable method for updating constant buffer (#163819)
Add `struct AOTInductorConstantMapEntry` to represent the constant map in AOTI Model. We cannot use `std::unordered_map` for cross-compilation, because it is not ABI stable.

it will be tested when we test `update_user_managed_constant_buffer` for windows cross-compilation

Example usage:

```
        // Load constants. Create random constants here.
        auto* fc1_w = new slim::SlimTensor(slim::empty({16, 10}, c10::kFloat, c10::Device(c10::kCUDA, 0)));
        fc1_w->fill_(1.0);

.....

        // Build pairs
        std::vector<AOTInductorConstantPair> constants{
            {"fc1_weight", fc1_w},
            {"fc1_bias",   fc1_b},
            {"fc2_weight", fc2_w},
            {"fc2_bias",   fc2_b},
        };

        // Call runtime (pass raw pointer + size)
        update_user_managed_constant_buffer_abi(
            container_handle,
            constants.data(),
            constants.size(),
            /*use_inactive=*/false,
            /*validate_full_update=*/true);
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163819
Approved by: https://github.com/desertfire
2025-10-02 18:31:00 +00:00
f6f7676756 Revert "C++-accessible Placements via pybind11 (#163030)"
This reverts commit 3e03deab6f3c268c85c8efd9546e28cdda0fa4cc.

Reverted https://github.com/pytorch/pytorch/pull/163030 on behalf of https://github.com/swolchok due to doesn't pass pyre ([comment](https://github.com/pytorch/pytorch/pull/163030#issuecomment-3362450379))
2025-10-02 18:25:24 +00:00
e6d4b26776 Update torch.rst (#164408)
Corrected grammatical mistake

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164408
Approved by: https://github.com/mikaylagawarecki
2025-10-02 18:12:47 +00:00
6bb021c125 Revert "Use TMA loads always for Triton grouped MM kernel (#164256)"
This reverts commit b1033789fea2bc82901eafed498a5252985b80e9.

Reverted https://github.com/pytorch/pytorch/pull/164256 on behalf of https://github.com/yangw-dev due to  failed internal test: (pytorch.tritonbench.test.test_gpu.main.TestTritonbenchGpu) Error Details: torch._inductor.exc.InductorError: LoweringException: NoValidChoicesError: No choices to select. Provided reason: All choices failed to compile for backend. please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice.  ([comment](https://github.com/pytorch/pytorch/pull/164256#issuecomment-3362359624))
2025-10-02 17:55:37 +00:00
b9e73e639e Add provenance to inductor IR nodes created after graph.run (#164255)
Summary:
as title

- Some IR nodes are created during `finalize_multi_template_buffers()` in Scheduler. This PR adds provenance (`origin_node` and `origins`) for those nodes.

- Extract `assign_origin_node` function

Differential Revision: D82871244

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164255
Approved by: https://github.com/mlazos
2025-10-02 17:32:46 +00:00
0319556a35 Revert "[vision hash update] update the pinned vision hash (#154694)"
This reverts commit bcafea5c92ca2ee1b0dc8f6d8b62ecabb6f40228.

Reverted https://github.com/pytorch/pytorch/pull/154694 on behalf of https://github.com/yangw-dev due to break the unittest for inductor with improved, update benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv, see failure example https://github.com/pytorch/pytorch/actions/runs/18185852421/job/51776537817 ([comment](https://github.com/pytorch/pytorch/pull/154694#issuecomment-3362285901))
2025-10-02 17:32:04 +00:00
f4cf75688f Add CUDA release architecture matrix (#164471)
We should surface the CUDA architecture matrix to make things more transparent. I believe this can later become its own page where we will publish supported matrix for each release.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164471
Approved by: https://github.com/Camyll
2025-10-02 16:59:48 +00:00
39189592fd Revert "Stop parsing command line arguments every time common_utils is imported. (#156703)"
This reverts commit ac7b4e7fe4d233dcd7f6343d42b4fa3d64bce548.

Reverted https://github.com/pytorch/pytorch/pull/156703 on behalf of https://github.com/clee2000 due to failing internally D80206253, see above comment for details ([comment](https://github.com/pytorch/pytorch/pull/156703#issuecomment-3362156908))
2025-10-02 16:54:22 +00:00
235b995ce1 Make sure Windows CUDA 12.8 build follow same arches as Linux builds (#164470)
I believe ``set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0`` is the one thats actually used. Hence remove 6.1  to align the support with Linux support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164470
Approved by: https://github.com/tinglvv, https://github.com/nWEIdia, https://github.com/Camyll
2025-10-02 16:34:42 +00:00
ac7b4e7fe4 Stop parsing command line arguments every time common_utils is imported. (#156703)
Last PR in the series to re-submit https://github.com/pytorch/pytorch/pull/134592 as smaller PRs:

https://github.com/pytorch/pytorch/pull/154612
https://github.com/pytorch/pytorch/pull/154628
https://github.com/pytorch/pytorch/pull/154715
https://github.com/pytorch/pytorch/pull/154716
https://github.com/pytorch/pytorch/pull/154725
https://github.com/pytorch/pytorch/pull/154728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156703
Approved by: https://github.com/clee2000
2025-10-02 15:48:47 +00:00
c6329524d8 Revert "Add magic TORCH_MAKE_PYBIND_ENUM_FASTER macro (#163527)"
This reverts commit 50c0550f5a5b1e35885d892081a7d5115d8b4489.

Reverted https://github.com/pytorch/pytorch/pull/163527 on behalf of https://github.com/swolchok due to breaking import torch in debug builds, see #164297 ([comment](https://github.com/pytorch/pytorch/pull/163527#issuecomment-3361919142))
2025-10-02 15:42:42 +00:00
b0985144b5 [DeviceMesh] Simplifying internal bookkeeping with CuTe layout (#163213)
We want to refactor the internal bookkeeping of DeviceMesh so that:
Simply the bookkeeping logics and make it generic enough so that it is easy to support new transformations like flatten noncontiguous dim, reshape and unflatten. (We leveraged the CuTe layout). This new layout also let us handle non-contiguous slicing, flatten, transpose possible.

Concretely, in this PR, we do the following:
1. Use the `_MeshLayout` to handle all index operations rather use a map to record mesh dims.
2. Removed `flatten_name_to_root_dims`, because now we can directly get layout from a flattened device mesh.
3. Replaced `_get_slice_mesh_dims` with `_get_slice_mesh_layout`.
4. Use the newly added function `check_overlap` to check layout overlap.
5. Use a new function `to_remapping_tensor` to use layout ranks as indices when the mesh tensor is not representable as CuTe. The reason is that layout acts as a backend of mesh tensor bookkeeping (indexing indices), it needs to be used as indices for remap back to the mesh tensor for new DeviceMesh generation and backend init. For example, in the case of 2K to 4K, the underlying layout is (2K, 1) but the actual value of the mesh tensor is [2K, 2K+1, ....,]. While flattening, slicing, we need to remap the layout back to the new mesh tensor so it maps the actual device allocation. For example, in the 2K to 4K case, if the shape is (1K, 1K) with dim_names ("dp", "tp"). Then when slicing "tp", the mesh tensor should be (2K, 2K+1, ..., 3K-1) or (3K, 3K+1, ... 4K-1). not the global ranks generated from the layout. (1K, 1).

Verified that loss curve is very close for DeepSeekV3 on torchtitan, note that exact same match is challenging because even if we run the baseline twice, the loss curve does not exactly match.

<img width="1113" height="490" alt="image" src="https://github.com/user-attachments/assets/7877b5a4-337e-4ad8-b878-2378f4f0f38d" />

The PR looks big indeed but we don't change any existing behavior of DeviceMesh, so it is a pure refactor.

With this refactoring we also enabled the slicing and flatten of non-contiguous dims of a device mesh which is hard to implement without cute layout.

This is a continue of https://github.com/pytorch/pytorch/pull/161106 (original one got messed with EasyCLA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163213
Approved by: https://github.com/lw, https://github.com/fegin
2025-10-02 15:42:03 +00:00
7cfecd76b2 Revert "Improve repeat op to a single copy (#163842)"
This reverts commit 590224f83c8d575b52c6bc40a984132fa593256e.

Reverted https://github.com/pytorch/pytorch/pull/163842 on behalf of https://github.com/yangw-dev due to internal test failed: RuntimeError: false INTERNAL ASSERT FAILED at aten/src/ATen/quantized/Quantizer.cpp:441, . cannot call qscheme on UnknownQuantizer please reach out folks who have internal access for furthur debugging. ([comment](https://github.com/pytorch/pytorch/pull/163842#issuecomment-3361746041))
2025-10-02 15:22:19 +00:00
bac0f289a3 Add methods to access data and unpack_hook on SavedVariable (#164358)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164358
Approved by: https://github.com/albanD
2025-10-02 13:05:16 +00:00
39c340ec9e Add failing bitwise equivalence UT for aot_eager on rms_norm (#164280)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164280
Approved by: https://github.com/albanD
2025-10-02 09:05:28 +00:00
cfd46d13e6 Fix SAC + Flex issue (#164421)
# Summary

This happends when flex_attention is not tagged with the ` CheckpointPolicy.MUST_SAVE` policy. This causes the lse to be unrealized. I think in general this probably not the best policy but we shoudn't error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164421
Approved by: https://github.com/Skylion007
2025-10-02 09:02:17 +00:00
0e5773b7fa [dynamo][export] Do not graph break on torch.autograd._profiler_enabled for export (#164418)
Actually we would like to not graph break even in the case of Dynamo. But there is a weird-unsolved bug with Kineto + Dynamo when there are distributed jobs that lead to NCCL timeouts. This bug is a rare edege case, but we have not been able to root cause it yet.

But for export, we do not anticipate JIT tracing in distributed job training and therefore this PR is safe for export.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164418
Approved by: https://github.com/StrongerXi, https://github.com/williamwen42
2025-10-02 09:00:00 +00:00
2c2e1268b7 [inductor] Handle patterns where input/output nodes are the same (#163994)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163994
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-10-02 08:37:55 +00:00
00f0365b95 [torchfuzz] add test suite of fuzzer repros that we xfail (#164430)
i'll add the rest of the repros once in a follow up PR once we agree on a good test harness
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164430
Approved by: https://github.com/ezyang
2025-10-02 08:05:11 +00:00
6bb586eafd [PyTorch / Sigrid GPU] Fixes in pinned stats collection and add new ODS pinned memory stats (#164412)
We do some fixes in pinned memory allocation stats collection and better differentiate between active vs allocated bytes.
Reviewed By: bbus, sayitmemory

Differential Revision: D83162346

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164412
Approved by: https://github.com/mradmila
2025-10-02 08:04:05 +00:00
9697a7ce9e Better path handling for nightly setup tool (#164215)
Resolves https://github.com/pytorch/pytorch/issues/164010#issuecomment-3349283789, cc @filipviz

Previously, the `checkout` subcommand would reuse the `venv`, while the `pull` subcommand would remove and recreate a fresh new `venv` (without prompting before deleting).

This PR:

- Keep and reuse the existing `venv` by default (both `pull` and `checkout`).
- Add a new `--fresh` option to delete and recreate a fresh new `venv`.
- Prompt the user for confirmation (add a new `--yes` option) before deleting the existing prefix path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164215
Approved by: https://github.com/ezyang, https://github.com/malfet
ghstack dependencies: #162324, #164214
2025-10-02 07:59:17 +00:00
27eb36debb DebugMode add ignore_compile_internals (#164205)
Fixes #164143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164205
Approved by: https://github.com/albanD
2025-10-02 07:39:54 +00:00
a43c4c3972 [5/N] Apply ruff UP035 rule (#164423)
Continued code migration to enable ruff `UP035`. Most changes are about moving `Callable` from `typing` to `from collections.abc`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164423
Approved by: https://github.com/ezyang
2025-10-02 07:31:11 +00:00
bcafea5c92 [vision hash update] update the pinned vision hash (#154694)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154694
Approved by: https://github.com/pytorchbot

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-10-02 07:02:40 +00:00
3924f784ba unbacked reshape_copy (#164336)
address https://github.com/pytorch/pytorch/issues/162110
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164336
Approved by: https://github.com/ColinPeppler
2025-10-02 06:50:48 +00:00
93e833de0f [inductor] separate preamble from main work in compile_fx (#164169)
A couple minor things to clean up the structure of `compile_fx` before we hit pre grad passes:
1. After patching config and recursively calling `compile_fx`, we don't need the patches any more. We make the subsequent logic call a `_maybe_wrap_and_compile_fx_main` (both when cpp wrapper exists and doesn't).
2. There's some recursive wrapping that happens on inputs and outputs before hitting pre grad passes, which are now also separated out before calling a `_compile_fx_main`, where actual work finally happens.

These also happen to fix a couple of TODOs in the old code.

Differential Revision: D83500704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164169
Approved by: https://github.com/zhxchen17
2025-10-02 05:44:31 +00:00
14791ea947 [inductor] teach bisector to look at pre_grad passes (#164250)
Bisector was not aware of pre-grad passes. Now that pre-grad passes use their own graph transformer observer subsystem, it is possible to disable these passes in the bisector.

Differential Revision: D83573614

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164250
Approved by: https://github.com/eellison, https://github.com/mlazos
2025-10-02 05:42:18 +00:00
702f6e703b [MTIA] Enable deserialization for FP8 checkpoint loading (#163559)
Summary: It looks like loading FP8 checkpoints goes through that path which wasn't enabled for MTIA beforehand, whereas loading BF16 checkpoints didn't.

Differential Revision: D82997140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163559
Approved by: https://github.com/mikaylagawarecki
2025-10-02 04:18:46 +00:00
39b31a6bfd [torchfuzz] keep track of operator stats (#164334)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164334
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034, #164209, #164211, #164210, #164397, #164284
2025-10-02 03:48:07 +00:00
0fbe3f19c7 [torchfuzz] add matmuls (#164284)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164284
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034, #164209, #164211, #164210, #164397
2025-10-02 03:33:10 +00:00
144378615a [torchfuzz] make fuzzer deterministic (#164397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164397
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034, #164209, #164211, #164210
2025-10-02 03:10:30 +00:00
5dbae1eae2 Fix unbacked replacement where LHS is purely backed expr and RHS is unbacked expr (#164013)
## Scenario
- If there's a `torch._check(backed_expr == unbacked_symbol)`
- then we should replace unbacked_symbol for backed_expr
- currently, we don't do that when generating inputs for autotune_at_compile_time

## Error traceback
```
$ python test/inductor/test_aot_inductor.py -k test_size_with_unbacked_add_expr_transitive
  ...
  File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1696, in fx_codegen_and_compile
    return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/compile_fx.py", line 1187, in codegen_and_compile
    dynamo_utils.preserve_rng_state(),
  File "/home/colinpeppler/.conda/envs/pytorch/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 2236, in preserve_rng_state
    torch.cuda.set_rng_state(cuda_rng_state)  # type: ignore[possibly-undefined]
  File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 79, in set_rng_state
    _lazy_call(cb)
  File "/data/users/colinpeppler/pytorch/torch/cuda/__init__.py", line 341, in _lazy_call
    callable()
  File "/data/users/colinpeppler/pytorch/torch/cuda/random.py", line 77, in cb
    default_generator.set_state(new_state)
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
```

## Bad autotuning input generation
```
# assume unbacked_symint_fallback = 16
# we generate too small of an input (16)
buf11 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256))
triton_poi_fused_ones_1.run(buf11, 4096, stream=stream0)

stream0 = get_raw_stream(0)
buf12 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256))
buf13 = generate_example_value((16, 256), (256, 1), 'cuda:0', torch.float32, 0, (16, 256))
add_kernel_1.run(buf11, buf12, buf13, 4096, 16, 1, 1, stream=stream0)
del buf11, buf12

stream0 = get_raw_stream(0)
buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256))
triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0)
```

## Good autotuning input generation
```
# notice we generate with the proper size now (10500)
buf11 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256))
triton_poi_fused_ones_1.run(buf11, 2688000, stream=stream0)

stream0 = get_raw_stream(0)
buf12 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256))
buf13 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256))
add_kernel_1.run(buf11, buf12, buf13, 2688000, 10500, 1, 1, stream=stream0)
del buf11, buf12

stream0 = get_raw_stream(0)
buf15 = generate_example_value((10500, 256), (256, 1), 'cuda:0', torch.float32, 0, (10500, 256))
triton_poi_fused_add_mul_2.run(buf2, buf13, buf15, 2688000, stream=stream0)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164013
Approved by: https://github.com/cp2923, https://github.com/laithsakka
2025-10-02 02:40:54 +00:00
3e03deab6f C++-accessible Placements via pybind11 (#163030)
This makes Placement data representation available in C++ via pybind11.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163030
Approved by: https://github.com/ezyang
2025-10-02 02:38:23 +00:00
349e9e922d [cutass backend] remove cutlass presets (#164380)
Differential Revision: [D83674898](https://our.internmc.facebook.com/intern/diff/D83674898/)

Changes made by claude code (need to remove test too)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164380
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2025-10-02 01:26:00 +00:00
8b29c59844 [CI][CUDA] Fix distributed tests for b200 (#164345)
This PR fixes the tests that were encountered in #159323.
Namely it fixes #162746 and #162745.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164345
Approved by: https://github.com/eqy
2025-10-02 01:13:49 +00:00
53860ef4e1 Better error handling in torch/csrc/jit/codegen/* (#163948)
Refactor error handling by using TORCH_CHECK for improved clarity in constants and scope management in torch/csrc/jit/codegen/*

Fixes some parts of ISSUE #148114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163948
Approved by: https://github.com/cyyever, https://github.com/FFFrog, https://github.com/albanD
2025-10-02 01:10:09 +00:00
723ba21393 Speed up FP precision lookup (#164044)
This commit simplifies the precision lookup and setting logic
by reducing the number of branches and using a custom hash
function. Fixes #161822. The issue described in #163709 still
persists. This is meant as a short term fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164044
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-02 00:59:19 +00:00
a10207e61b Revert "[DCP] Decrease checkpoint background process Gloo pg init timeout (#162760)"
This reverts commit 0925c644edafbb6a8ff42fef5f3bd48b6042fad3.

Reverted https://github.com/pytorch/pytorch/pull/162760 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162760#issuecomment-3358630631))
2025-10-02 00:44:44 +00:00
ffda8e5ddf [inductor] log kernel autotuning result to a csv (#164191)
Example output: https://gist.github.com/shunting314/2d646c6b6cd9a79fff7a35ffee82baed
```
for each model:
  for each triton kernel:
     for each triton config:
        the csv contains a line for the latency and pointer to find the kernel module in the file system
```

Would use this to try to come up with heuristics to pick a single config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164191
Approved by: https://github.com/jansel, https://github.com/mlazos
2025-10-02 00:25:34 +00:00
1a5d023a5b Add B200 to Operator Microbenchmark CI (#164288)
Add B200 to operator microbenchmarks nightly run
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164288
Approved by: https://github.com/huydhn
2025-10-01 23:56:34 +00:00
566ea4e86a Work Around exposing statically linked libstdc++ CXX11 ABI strong symbols (#163980)
Work Around for: https://github.com/pytorch/pytorch/issues/133437

Test plan:
1. Build whl in CI
2. Download
3. Run ``nm -D libtorch_cpu.so | grep "recursive_directory_iterator"``

Test with check_binary_symbols.py:

Success:
```
num_cxx11_symbols: 2326
num_pre_cxx11_symbols: 0
lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
num_statically_linked_symbols (T): 0
```

Fail when using "W" instead of "T" as type calling ``cxx11_statically_linked_symbols = grep_symbols(
        lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="W"
    )`` :
```
num_cxx11_symbols: 2326
num_pre_cxx11_symbols: 0
lib: /home/ec2-user/github/variant-repack/.venv/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so
num_statically_linked_symbols (T): 20
Traceback (most recent call last):
  File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 130, in <module>
    main()
  File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 126, in main
    check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path)
  File "/home/ec2-user/github/variant-repack/test/pytorch/.ci/pytorch/smoke_test/check_binary_symbolsc.py", line 95, in check_lib_statically_linked_libstdc_cxx_abi_symbols
    raise RuntimeError(
RuntimeError: Found statically linked libstdc++ symbols (recursive_directory_iterator), but there shouldn't be any, see: ['std::filesystem::__cxx11::recursive_directory_iterator::recursion_pending() const', 'std::filesystem::__cxx11::recursive_directory_iterator::depth() const', 'std::filesystem::__cxx11::recursive_directory_iterator::options() const', 'std::filesystem::__cxx11::recursive_directory_iterator::operator*() const', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::operator bool() const', 'std::filesystem::__cxx11::recursive_directory_iterator::disable_recursion_pending()', 'std::filesystem::__cxx11::recursive_directory_iterator::pop(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::pop()', 'std::filesystem::__cxx11::recursive_directory_iterator::increment(std::error_code&)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::__cxx11::recursive_directory_iterator::recursive_directory_iterator(std::filesystem::__cxx11::path const&, std::filesystem::directory_options, std::error_code*)', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::~recursive_directory_iterator()', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator&&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator=(std::filesystem::__cxx11::recursive_directory_iterator const&)', 'std::filesystem::__cxx11::recursive_directory_iterator::operator++()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>&&)', 'std::__shared_ptr<std::filesystem::__cxx11::recursive_directory_iterator::_Dir_stack, (__gnu_cxx::_Lock_policy)2>::__shared_ptr()']
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163980
Approved by: https://github.com/isuruf, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-10-01 23:17:30 +00:00
9065364995 Add xfailing test case for inplace mutation of local DTensor (#164355)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164355
Approved by: https://github.com/albanD
2025-10-01 23:16:26 +00:00
6eb8d9671b Enable torch.nn.functional.batch_norm in test_export_opinfo (#164261)
Summary:
There are actually 2 `nn.functional.batch_norm` in op_db. See https://github.com/pytorch/pytorch/blob/main/torch/testing/_internal/common_methods_invocations.py#L16797-L16831

So previously the test failed at `assert len(ops)==1`

Test Plan: python test/export/test_export_opinfo.py TestExportOnFakeCudaCUDA

Differential Revision: D83581427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164261
Approved by: https://github.com/SherlockNoMad
2025-10-01 21:56:08 +00:00
b5c4f46bb9 Add functions to setup PrivateUse1 as a python backend device. (#157859)
Fixes #156052 and #156444.

This PR setup the privateuseone key in Python to be used as a python backend for pytorch.
Meaning that, after calling `setup_privateuseone_for_python_backend('npy')`, one can use a subclass to with that device to hold arbitrary python data as "device data" and use `torch.library` to register ops that takes that Tensor.

Changes done in this PR:

1. Register an vanilla Device Guard: I extended NoOpDeviceGuard to have allow device index of 0 and to not raise errors when event related functions are accessed. If I don't do those, when calling backward I would get errors. (CPU backend uses NoOpDeviceGuard just fine, although there seems to be special treatment of CPU in the autograd engine.
2. Tensor subclass allows not having `__torch_dispatch__` if the device is not CUDA or CPU. The comment of the check suggests it was to avoid segfault when calling into ops that expects a storage. Here we have a different device so will not call into those ops.
3. python function that invokes the other incantations to setup the privateusekey backend.

This took inspiration of https://github.com/bdhirsh/pytorch_open_registration_example and https://github.com/tinygrad/tinygrad/blob/master/extra/torch_backend/wrapped_tensor.cpp; great thanks to @bdhirsh and @geohot.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157859
Approved by: https://github.com/albanD
2025-10-01 21:32:59 +00:00
773c6762b8 [CD][CUDA13][NCCL] Fix nccl version typo for cu13 (#164383)
https://pypi.org/project/nvidia-nccl-cu13/#history does not have 2.27.5 but 2.27.7+.
Companion PR: https://github.com/pytorch/pytorch/pull/164352

Fixes a potential binary breakage due to non-existence of referenced NCCL cu13 version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164383
Approved by: https://github.com/tinglvv, https://github.com/Skylion007, https://github.com/atalman
2025-10-01 21:32:25 +00:00
7320f44cdc Skip windows unittest in fbcode (#164363)
Summary: as title

Test Plan:
```
buck run fbcode//caffe2/test/inductor:aot_inductor_windows
```

Differential Revision: D83664801

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164363
Approved by: https://github.com/angelayi
2025-10-01 20:18:19 +00:00
e5c0e6b5e3 [testing] Better short job name during upload additional stats (#164287)
I think we usually we leave the ` / test` in for clarity
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164287
Approved by: https://github.com/atalman, https://github.com/malfet
2025-10-01 19:56:20 +00:00
7304b9e7d2 [ROCm] fix carveout feature (#164303)
Fixes #164271.

Carveout had been applied with an opposite bitmask. Besides being incorrect, this lead to flaky unit test behavior due to carveout being too high.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164303
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-01 19:25:41 +00:00
315ffdc1e4 [4/N] Apply ruff UP035 rule to python code (#164206)
Follows #164104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164206
Approved by: https://github.com/albanD
2025-10-01 19:05:53 +00:00
8c590cab9d [inductor] add a runtime assert for triton shapes (#164242)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164242
Approved by: https://github.com/eellison, https://github.com/mlazos
ghstack dependencies: #164241
2025-10-01 18:55:33 +00:00
9357c31b53 [inductor] Fix constant shape for float constants (#164241)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164241
Approved by: https://github.com/mlazos
2025-10-01 18:55:33 +00:00
f63d16c6a9 Make viable/strict updatable again (#164374)
To allow viable/strict to move forward, after https://github.com/pytorch/pytorch/pull/164260 was landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164374
Approved by: https://github.com/seemethere
2025-10-01 18:09:07 +00:00
8dfc8efffd [export] Preserve nn_module_stack for aliased nn modules (#164311)
Preparing for install_free_tensors flag.

Thanks to @tugsbayasgalan in coming up with the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164311
Approved by: https://github.com/tugsbayasgalan
2025-10-01 18:04:33 +00:00
3ffaab3bc8 [Replicate][Pipeline Parallelism] integration of new replicate function with pipeline parallelism (#164031)
**Summary:** In order to test numerics for replicate + pp, stage.py needs to be able to call replicate's backward manually as pipeline parallelism doesn't have this feature.

**Test Case**
1.  pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164031
Approved by: https://github.com/weifengpy, https://github.com/H-Huang
ghstack dependencies: #163897
2025-10-01 18:01:16 +00:00
ebd0707578 [SymmMem] Add get_nbi the nonblocking version (#163540)
```Py
@triton.jit
def foo(dest, src):
    nvshmem.get_nbi(dest, src, 100, 0)
    # Some independent computation which overlaps with the get operation
    ...
    # Wait for completion of the get operation
    nvshmem.quiet()
```

Allows us to overlap comm and compute in the same kernel, instead of two kernels + signals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163540
Approved by: https://github.com/ngimel, https://github.com/fegin
2025-10-01 17:50:24 +00:00
76ddbc2bbb Add option to FakeProcessGroup to raise error if comms are invoked. (#162841)
The current behavior is to do "nothing", which means you will corrupt
data.  If you're doing something similar to LocalTensor, where you're
overriding the behavior of collectives to do something numerically,
this can be unwelcome behavior.  If you can error when this happens
it can help prevent silent numerical incorrectness.

Authored with claude code.

Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162841
Approved by: https://github.com/dcci
2025-10-01 17:48:19 +00:00
69c5c08a01 Revert "[dynamo, 3.14] fix _detect_and_normalize_assert_statement for 3.14 (#164005)"
This reverts commit 5ed4672477c71492a2f41ac0395dd0630446d6a5.

Reverted https://github.com/pytorch/pytorch/pull/164005 on behalf of https://github.com/williamwen42 due to broke some tests e.g. https://github.com/meta-pytorch/autoparallel/actions/runs/18167350261/job/51719783636?pr=179 ([comment](https://github.com/pytorch/pytorch/pull/164005#issuecomment-3357433475))
2025-10-01 17:47:22 +00:00
3dab36bdb4 [FSDP][Replicate] created ReplicateModule and changed replicate to use it instead of FSDPModule (#163897)
**Summary:** In order to minimize the code copied from FSDP to make replicate work, I made all replicated modules FSDPModule. While this was sufficient originally, there are changes to codebase like below that require us to differentiate between a FSDPModule and a ReplicateModule so that we can access replicate_state or fsdp_state: https://www.internalfb.com/code/fbsource/[a9a8e5102052]/fbcode/caffe2/torch/distributed/pipelining/stage.py?lines=629-666.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163897
Approved by: https://github.com/weifengpy
2025-10-01 17:30:10 +00:00
1288c6d8bb Enable keep-going for trunk tags (#164307)
Tags like `trunk/{sha}` are used to re-run signals by [autorevert project](https://github.com/pytorch/test-infra/blob/main/aws/lambda/pytorch-auto-revert/README.md).

We need to have `keep-going` enabled for those reruns, so that they surface all test failures, not just the first one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164307
Approved by: https://github.com/clee2000
2025-10-01 17:21:43 +00:00
80ed522910 [export] support unbacked stack (#163867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163867
Approved by: https://github.com/laithsakka
2025-10-01 16:48:46 +00:00
f7ab8a2710 [1/N] Fix ruff warnings (#164333)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164333
Approved by: https://github.com/albanD
2025-10-01 16:48:32 +00:00
e419dc6d08 [PP] Customize pipeline's submod name (#164037)
Changing PP submodules' name from `submod_i` to `submod_pp_i` to distinguish from the submodule created by HOP.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164037
Approved by: https://github.com/H-Huang
ghstack dependencies: #164045, #164035
2025-10-01 16:29:19 +00:00
5f868ca110 [fx] Allow customization of submod name in split graph (#164035)
Fixes #164030: HOP and pipelining both name things submod_i
by adding an optional argument `partition_affix` to `split_module` API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164035
Approved by: https://github.com/ezyang
ghstack dependencies: #164045
2025-10-01 16:26:14 +00:00
20edc5b26a Revert "Add num_store to inductor_meta and use it to scale persistent reduction x block (#162446)"
This reverts commit 22c5e8c17c7551c9dd2855589ae774c1e147343a.

Reverted https://github.com/pytorch/pytorch/pull/162446 on behalf of https://github.com/PaulZhang12 due to perf regression in https://github.com/pytorch/pytorch/issues/164301#issuecomment-3354028620 ([comment](https://github.com/pytorch/pytorch/pull/162446#issuecomment-3357164274))
2025-10-01 16:23:03 +00:00
59a86cb137 Revert "[fx] Allow customization of submod name in split graph (#164035)"
This reverts commit 615da7b95ef22ec0fa07f296dcb103d7d5aeda34.

Reverted https://github.com/pytorch/pytorch/pull/164035 on behalf of https://github.com/yangw-dev due to internal build failed Buck build failed for this target, and is likely caused by your changes. ([comment](https://github.com/pytorch/pytorch/pull/164035#issuecomment-3357113348))
2025-10-01 16:09:50 +00:00
36a37b81cd Revert "[PP] Customize pipeline's submod name (#164037)"
This reverts commit 704cd771f6a63abf9498934aeb7f3079ab9e2232.

Reverted https://github.com/pytorch/pytorch/pull/164037 on behalf of https://github.com/yangw-dev due to internal build failed Buck build failed for this target, and is likely caused by your changes. ([comment](https://github.com/pytorch/pytorch/pull/164035#issuecomment-3357113348))
2025-10-01 16:09:50 +00:00
2610746375 Revert nccl upgrade back to 2.27.5 (#164352)
Revert https://github.com/pytorch/pytorch/pull/162351 as it breaks H100
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164352
Approved by: https://github.com/atalman, https://github.com/malfet
2025-10-01 15:27:40 +00:00
b1033789fe Use TMA loads always for Triton grouped MM kernel (#164256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164256
Approved by: https://github.com/ngimel
ghstack dependencies: #163895
2025-10-01 15:24:51 +00:00
07d896fa48 Revert "CUDACachingHostAllocatorImpl skip event query during capture (#164001)"
This reverts commit 4cf29004749714670fee9e7e3776778faf5ced25.

Reverted https://github.com/pytorch/pytorch/pull/164001 on behalf of https://github.com/yangw-dev due to failed internal error with multiple errors found: Not equal to tolerance rtol=0.1, atol=0.1.. ([comment](https://github.com/pytorch/pytorch/pull/164001#issuecomment-3356894787))
2025-10-01 15:11:21 +00:00
31681bcacc [PyTorch] Pull ARM's box-cox (#164152)
Summary:
ARM has provided with an SVE128 box-cox implementation.

It uses the same underlying algorithm as the previous version, but it has better log and exp implementations.
These supplied mathematical functions have switches to adjust the precision/speed trade-off.

We've noted a slight precision improvement, while also about a 5% peroformance increase

Before:

ZeroLambda1                                                61.66ns    16.22M
NonZeroLambda1                                            125.73ns     7.95M
NonZeroLambdaManyColumns                                    1.84ms    542.11
NonZeroLambdaEigenColumnar                                262.31us     3.81K
NonZeroLambdaEigenRowMajor                                275.17us     3.63K
NonZeroLambdaWithPyTorchColumnar                           97.43us    10.26K
NonZeroLambdaWithPyTorchRowMajor                           90.82us    11.01K
NonZeroLambdaWithPyTorchRowMajorFullBatch                  96.96us    10.31K
NonZeroLambdaBatch                                        151.84us     6.59K

After:

ZeroLambda1                                                57.85ns    17.29M
NonZeroLambda1                                            118.85ns     8.41M
NonZeroLambdaManyColumns                                    1.82ms    548.16
NonZeroLambdaEigenColumnar                                261.67us     3.82K
NonZeroLambdaEigenRowMajor                                274.53us     3.64K
NonZeroLambdaWithPyTorchColumnar                           89.12us    11.22K
NonZeroLambdaWithPyTorchRowMajor                           83.49us    11.98K
NonZeroLambdaWithPyTorchRowMajorFullBatch                  88.79us    11.26K
NonZeroLambdaBatch                                        144.74us     6.91K

Test Plan:
Correctness:

buck2 test @//mode/opt //koski/functions_contrib/df4ai/tests:batch_box_cox_test

Performance:

buck2 run @//mode/opt //koski/functions_contrib/df4ai/benchmark:boxcox_benchmark

Differential Revision:
D83485704

Privacy Context Container: L1196524

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164152
Approved by: https://github.com/ezyang
2025-10-01 15:00:03 +00:00
e901866dd7 Add a RECORD_FUNCTION for Python fallback so it shows in profile (#160573)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160573
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2025-10-01 14:10:44 +00:00
70d1043bdf Fix non-TMA loads in grouped MM Triton kernel (#163895)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163895
Approved by: https://github.com/lezcano
2025-10-01 12:21:13 +00:00
69fa26d9b4 Triton 3.5.x pin update (#164268)
Updates triton pin to latest: https://github.com/triton-lang/triton/commits/release/3.5.x/

This updates contains 2 cherry-pick to remove Python 3.9 from list of supported python versions:
https://github.com/triton-lang/triton/pull/8288
https://github.com/triton-lang/triton/pull/8287
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164268
Approved by: https://github.com/aakhundov
2025-10-01 11:41:50 +00:00
d9c80ef97d Build and Install Arm Compute Library in manylinux docker image (#159737)
----

This PR will be part of a series of PR's that aims to remove `.ci/aarch64_linux` folder entirely, such that Aarch64 manylinux build happens as part of `.ci/manywheel/build.sh`, the same as other platforms.

In this PR:

- We prebuild + install Arm Compute Library in the manylinux docker image ( at /acl ), instead of a build time for every pytorch build.  Also updated jammy install path to be /acl too.
- We can therefore remove build_ArmComputeLibrary functions from the ci build scripts.
- There is also some refactoring of install_openblas.sh and install_acl.sh to align them together ( similar formatting, similar variable names, same place for version number update )
- We had 2 places to define openblas version, this has been reduced to 1 now ( install_openblas.sh ).
- ACL_VERSION and OPENBLAS_VERSION are now able to be overriden at build.sh level for developers, but there is only 1 version of each hardcoded for ci.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159737
Approved by: https://github.com/seemethere, https://github.com/aditew01
2025-10-01 11:33:51 +00:00
ac1bc51608 [dynamo] do not pop from framelocals dict in Python 3.10 (#164316)
Followup to https://github.com/pytorch/pytorch/pull/164038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164316
Approved by: https://github.com/anijain2305
2025-10-01 10:20:46 +00:00
ed90040d33 Releases multicast object before releasing mapped buffers in CUDASymmetricMemory (#163750)
Fixes: https://github.com/pytorch/pytorch/issues/162429. In B200, cuMulticastUnbind can error if the mapped buffers are free'd before the multicast object is free'd. The only documentation I could find is here: e11d7f77c1/src/transport/nvls.cc (L113).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163750
Approved by: https://github.com/ngimel, https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/nWEIdia, https://github.com/cyyever
ghstack dependencies: #163575
2025-10-01 09:07:48 +00:00
4dab208d97 Adds Issue#153109 as a test for CUDAPluggableAllocator (#163575)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163575
Approved by: https://github.com/ngimel
2025-10-01 09:07:48 +00:00
9fd53a2bdc Register MTIA kernel for all_all_out (#164293)
Reviewed By: srsuryadev

Differential Revision: D83517879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164293
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-10-01 09:05:08 +00:00
17ab99463a [Easy] Add notes for setting up dev venv with specific Python version (#164214)
Resolves https://github.com/pytorch/pytorch/issues/164010#issuecomment-3340751377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164214
Approved by: https://github.com/ezyang
ghstack dependencies: #162324
2025-10-01 08:25:13 +00:00
eca6ac2293 [BE][Easy] update CUDA and ROCm sources in nightly tool (#162324)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162324
Approved by: https://github.com/ezyang
2025-10-01 08:25:13 +00:00
12d4cb0122 Suppress FutureWarnings in torch.distributed.algorithms.ddp_comm_hooks (#163939)
Fixes #163938

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163939
Approved by: https://github.com/cyyever, https://github.com/kwen2501
2025-10-01 07:51:12 +00:00
590224f83c Improve repeat op to a single copy (#163842)
In #163455 , the `reshape` was not a pure view op.

The `permute` before it created an non-contiguous tensor, which would trigger a data copy during the reshape.

This PR improved the implementation by remove the `urtensor` intermediate tensor completely.
By simply expanding the `xtensor` would achieve the `repeat` effect.

Before this PR, there were two data copies (in `urtensor.copy_` and `urtensor.reshape`).
Now, there is only one data copy in the `.copy_()`.
Reshape would not copy data because it is on a contiguous tensor.

One more note is that we do want at one copy because we want to duplicate the elements for the repeats.
User can inplace modify single elements without afffecting others.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163842
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-10-01 06:27:53 +00:00
cc8b14d09a [2/N] Simplify "in" operation for containers of a single item (#164323)
These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164323
Approved by: https://github.com/justinchuby, https://github.com/Skylion007
2025-10-01 05:39:11 +00:00
96c3b9e275 [dynamo] Use strings instead of modules for fqn info tracking (#164272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164272
Approved by: https://github.com/Skylion007, https://github.com/williamwen42, https://github.com/mlazos
2025-10-01 04:22:57 +00:00
9ddfc59b9b [BE] Delete stale non-ephemeral runners workarounds (#164285)
As all Win runners are ephemeral, no need to cleanup leftover processes
or uninstall PyTorch at the end of the test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164285
Approved by: https://github.com/Skylion007
2025-10-01 03:47:36 +00:00
6d4dfa0878 [CI] Push viable/strict/${time} tags (#164183)
Every time viable strict is updated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164183
Approved by: https://github.com/seemethere
2025-10-01 03:41:10 +00:00
11ccb95ccb [PyTorch Pinned Allocator] Pinned memory stats and perf fixes around allocating blocks (#163777)
Summary: This diff adds bucket stats for pinned memory and also a perf fix to not check for sizes when background thread is enabled

Differential Revision: D83162186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163777
Approved by: https://github.com/bbus
2025-10-01 03:28:58 +00:00
bd0907dc4c [BE][CI] Unify requirments (#163396)
Both Linux, Windows and MacOS CI workflows should use `.ci/docker/requirements-ci.txt`
TODOS:
 - Investigate why `choco install cmake` is needed to successfully detect MKL
 - Move `psutil` installation from specific scripts into requirements-ci.txt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163396
Approved by: https://github.com/Skylion007
2025-10-01 03:28:48 +00:00
8bb71c07c4 Skip symmetric memory tests calling _scaled_mm on CCC < 8.9 (#164251)
This avoids them failing on e.g. A100 GPUs with
> RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164251
Approved by: https://github.com/Skylion007, https://github.com/kwen2501
2025-10-01 03:26:21 +00:00
fa90090735 Use dataclass features in two classes (#164221)
This PR completes two TODO items by using features of `dataclass`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164221
Approved by: https://github.com/Skylion007, https://github.com/mlazos

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-10-01 03:20:39 +00:00
591997490a [BE][Easy]: Add prims common TypeGuard (#164263)
Slightly improves typing by adding a TypeGuard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164263
Approved by: https://github.com/albanD
2025-10-01 03:13:10 +00:00
531f3bf5e1 Adding check for square matrix for input tensor in matrix_exp backwar… (#163357)
…d op.

Fixes #146796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163357
Approved by: https://github.com/lezcano
2025-10-01 03:12:30 +00:00
2a5ce2feb4 Add algorithm in header (#164295)
Fixes #163307. Added ```#include <algorithm>``` to vulkan QueryPool for the std::for_each call

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164295
Approved by: https://github.com/Skylion007
2025-10-01 03:09:50 +00:00
3787a5a60e [export] Explicitly passing requires_grad to nn.Parameter() in deserialization (#164290)
Summary: `nn.Parameter()` by default has `requires_grad=True` and would cause issues when there are non-float parameters.

Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_non_float_weight

Differential Revision: D83598796

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164290
Approved by: https://github.com/angelayi
2025-10-01 02:55:20 +00:00
c66d18d24d [dynamo][sac] Support functools partial context_fn for sac (#164308)
Fixes https://github.com/pytorch/pytorch/issues/164300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164308
Approved by: https://github.com/Lucaskabela, https://github.com/soulitzer
2025-10-01 02:47:55 +00:00
e0f118585f skip non memory deps in memory estimator (#164294)
Differential Revision: [D83601030](https://our.internmc.facebook.com/intern/diff/D83601030)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164294
Approved by: https://github.com/mlazos
2025-10-01 02:44:58 +00:00
10a005e87f [torchfuzz] add layout operators (#164210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164210
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034, #164209, #164211
2025-10-01 02:33:19 +00:00
1f3995cdc8 [torchfuzz] raise if Operator abstract method is not implemented (#164211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164211
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034, #164209
2025-10-01 02:33:19 +00:00
abfcce58a4 [torchfuzz] remove erroneous can_produce check (#164209)
can_produce is an abstract method that always return false
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164209
Approved by: https://github.com/pianpwk
ghstack dependencies: #164034
2025-10-01 02:33:19 +00:00
5b1c39f5a1 Add smoke tests to verify that stable ABI FA3 wheel runs w/ newer torch (#163782)
Passing CI: https://github.com/pytorch/pytorch/actions/runs/18141589975/job/51635340255?pr=163782

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163782
Approved by: https://github.com/huydhn, https://github.com/mikaylagawarecki
2025-10-01 02:30:38 +00:00
8df3f2fa98 Revert new-test part of #163829 (#164259)
Summary:

New test sizes for `test_scaled_mm_vs_emulated_block_wise` all fail with

```
RuntimeError: Invalid scaling configuration
```

Disable these new tests for now (the remaining test is a parametrized
version of the original test case)

Test Plan:

`pytest test/test_scaled_matmul_cuda.py`

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164259
Approved by: https://github.com/jananisriram
ghstack dependencies: #164266
2025-10-01 02:23:21 +00:00
7a9119948e Split scaled-mm tests into separate file (#164266)
Summary:

* Split scaled-mm-specific tests into `test/test_scaled_matmul.py`

Test Plan:

```
pytest test/test_matmul_cuda.py
pytest test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164266
Approved by: https://github.com/Skylion007, https://github.com/albanD
2025-10-01 02:23:21 +00:00
28c1d2f81b [aoti] AOTI mingw cross compilation (#163188)
To run this, you need to install `mingw64-gcc-c++` and download windows cuda library toolkit.

See design doc and demo instructions in https://docs.google.com/document/d/1iDaChqA5nNKkBFTzsdkmoomvQlXHbnlb1Z4yEp7xaJA/edit?tab=t.0

If cross_platform_target is windows, we do the following:

- do not link to `sleef`. This can be improved in the future if we need it. Currently I avoid it because that requires extra setup on the linux side
- Use `mingw64-gcc-c++` to compile
- Use `WINDOWS_CUDA_HOME` instead of `CUDA_HOME` when linking to cuda

```
 python test/inductor/test_aot_inductor_windows.py -k so
 ```

 Other changes:
 - de-couples compile_standalone config and dynamic link flag
 - create a new aot_inductor_mode config module, which is used to control configs in aot_inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163188
Approved by: https://github.com/desertfire
2025-10-01 02:22:06 +00:00
c4bbc6433e [PyTorch CCA] Add an API to get expandable segment sizes (#163771)
Summary: This diffs add an API to query expandable segment size for each stream so that we can use this info to warmup the segment in advance, so we dont incur any performance penalty during steady state inference for new CUDA memory allocations.

Differential Revision: D76447308

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163771
Approved by: https://github.com/bbus
2025-10-01 02:16:58 +00:00
f583c3f80a Update (base update)
[ghstack-poisoned]
2025-09-29 21:45:28 +08:00
1562 changed files with 29400 additions and 20017 deletions

View File

@ -13,49 +13,6 @@ def list_dir(path: str) -> list[str]:
return check_output(["ls", "-1", path]).decode().split("\n")
def build_ArmComputeLibrary() -> None:
"""
Using ArmComputeLibrary for aarch64 PyTorch
"""
print("Building Arm Compute Library")
acl_build_flags = [
"debug=0",
"neon=1",
"opencl=0",
"os=linux",
"openmp=1",
"cppthreads=0",
"arch=armv8a",
"multi_isa=1",
"fixed_format_kernels=1",
"build=native",
]
acl_install_dir = "/acl"
acl_checkout_dir = os.getenv("ACL_SOURCE_DIR", "ComputeLibrary")
if os.path.isdir(acl_install_dir):
shutil.rmtree(acl_install_dir)
if not os.path.isdir(acl_checkout_dir) or not len(os.listdir(acl_checkout_dir)):
check_call(
[
"git",
"clone",
"https://github.com/ARM-software/ComputeLibrary.git",
"-b",
"v25.02",
"--depth",
"1",
"--shallow-submodules",
]
)
check_call(
["scons", "Werror=1", f"-j{os.cpu_count()}"] + acl_build_flags,
cwd=acl_checkout_dir,
)
for d in ["arm_compute", "include", "utils", "support", "src", "build"]:
shutil.copytree(f"{acl_checkout_dir}/{d}", f"{acl_install_dir}/{d}")
def replace_tag(filename) -> None:
with open(filename) as f:
lines = f.readlines()
@ -356,19 +313,13 @@ if __name__ == "__main__":
build_vars += f"BUILD_TEST=0 PYTORCH_BUILD_VERSION={branch[1 : branch.find('-')]} PYTORCH_BUILD_NUMBER=1 "
if enable_mkldnn:
build_ArmComputeLibrary()
print("build pytorch with mkldnn+acl backend")
build_vars += (
"USE_MKLDNN=ON USE_MKLDNN_ACL=ON "
"ACL_ROOT_DIR=/acl "
"LD_LIBRARY_PATH=/pytorch/build/lib:/acl/build:$LD_LIBRARY_PATH "
"ACL_INCLUDE_DIR=/acl/build "
"ACL_LIBRARY=/acl/build "
)
build_vars += "USE_MKLDNN=ON USE_MKLDNN_ACL=ON "
build_vars += "ACL_ROOT_DIR=/acl "
if enable_cuda:
build_vars += "BLAS=NVPL "
else:
build_vars += "BLAS=OpenBLAS OpenBLAS_HOME=/OpenBLAS "
build_vars += "BLAS=OpenBLAS OpenBLAS_HOME=/opt/OpenBLAS "
else:
print("build pytorch without mkldnn backend")

View File

@ -299,40 +299,6 @@ def install_condaforge_python(host: RemoteHost, python_version="3.8") -> None:
)
def build_OpenBLAS(host: RemoteHost, git_clone_flags: str = "") -> None:
print("Building OpenBLAS")
host.run_cmd(
f"git clone https://github.com/xianyi/OpenBLAS -b v0.3.28 {git_clone_flags}"
)
make_flags = "NUM_THREADS=64 USE_OPENMP=1 NO_SHARED=1 DYNAMIC_ARCH=1 TARGET=ARMV8"
host.run_cmd(
f"pushd OpenBLAS && make {make_flags} -j8 && sudo make {make_flags} install && popd && rm -rf OpenBLAS"
)
def build_ArmComputeLibrary(host: RemoteHost, git_clone_flags: str = "") -> None:
print("Building Arm Compute Library")
acl_build_flags = " ".join(
[
"debug=0",
"neon=1",
"opencl=0",
"os=linux",
"openmp=1",
"cppthreads=0",
"arch=armv8a",
"multi_isa=1",
"fixed_format_kernels=1",
"build=native",
]
)
host.run_cmd(
f"git clone https://github.com/ARM-software/ComputeLibrary.git -b v25.02 {git_clone_flags}"
)
host.run_cmd(f"cd ComputeLibrary && scons Werror=1 -j8 {acl_build_flags}")
def embed_libgomp(host: RemoteHost, use_conda, wheel_name) -> None:
host.run_cmd("pip3 install auditwheel")
host.run_cmd(
@ -700,7 +666,6 @@ def start_build(
configure_system(
host, compiler=compiler, use_conda=use_conda, python_version=python_version
)
build_OpenBLAS(host, git_clone_flags)
if host.using_docker():
print("Move libgfortant.a into a standard location")
@ -723,6 +688,8 @@ def start_build(
f"git clone --recurse-submodules -b {branch} https://github.com/pytorch/pytorch {git_clone_flags}"
)
host.run_cmd("pytorch/.ci/docker/common/install_openblas.sh")
print("Building PyTorch wheel")
build_opts = ""
if pytorch_build_number is not None:
@ -743,16 +710,18 @@ def start_build(
if host.using_docker():
build_vars += " CMAKE_SHARED_LINKER_FLAGS=-Wl,-z,max-page-size=0x10000"
if enable_mkldnn:
build_ArmComputeLibrary(host, git_clone_flags)
host.run_cmd("pytorch/.ci/docker/common/install_acl.sh")
print("build pytorch with mkldnn+acl backend")
build_vars += " USE_MKLDNN=ON USE_MKLDNN_ACL=ON"
build_vars += " BLAS=OpenBLAS"
build_vars += " OpenBLAS_HOME=/opt/OpenBLAS"
build_vars += " ACL_ROOT_DIR=/acl"
host.run_cmd(
f"cd $HOME/pytorch && export ACL_ROOT_DIR=$HOME/ComputeLibrary && "
f"{build_vars} python3 -m build --wheel --no-isolation{build_opts}"
f"cd $HOME/pytorch && {build_vars} python3 -m build --wheel --no-isolation{build_opts}"
)
print("Repair the wheel")
pytorch_wheel_name = host.list_dir("pytorch/dist")[0]
ld_library_path = "$HOME/acl/build:$HOME/pytorch/build/lib"
ld_library_path = "/acl/build:$HOME/pytorch/build/lib"
host.run_cmd(
f"export LD_LIBRARY_PATH={ld_library_path} && auditwheel repair $HOME/pytorch/dist/{pytorch_wheel_name}"
)
@ -908,7 +877,7 @@ def terminate_instances(instance_type: str) -> None:
def parse_arguments():
from argparse import ArgumentParser
parser = ArgumentParser("Builid and test AARCH64 wheels using EC2")
parser = ArgumentParser("Build and test AARCH64 wheels using EC2")
parser.add_argument("--key-name", type=str)
parser.add_argument("--debug", action="store_true")
parser.add_argument("--build-only", action="store_true")

View File

@ -37,9 +37,9 @@ case ${DOCKER_TAG_PREFIX} in
rocm*)
BASE_TARGET=rocm
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
# add gfx950, gfx115x conditionally starting in ROCm 7.0
if [[ "$ROCM_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950;gfx1150;gfx1151"
fi
EXTRA_BUILD_ARGS="${EXTRA_BUILD_ARGS} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;

View File

@ -344,7 +344,7 @@ docker build \
--build-arg "NINJA_VERSION=${NINJA_VERSION:-}" \
--build-arg "KATEX=${KATEX:-}" \
--build-arg "ROCM_VERSION=${ROCM_VERSION:-}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a;gfx942}" \
--build-arg "PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH:-gfx90a;gfx942;gfx1100}" \
--build-arg "IMAGE_NAME=${IMAGE_NAME}" \
--build-arg "UCX_COMMIT=${UCX_COMMIT}" \
--build-arg "UCC_COMMIT=${UCC_COMMIT}" \

View File

@ -1 +1 @@
e0dda9059d082537cee36be6c5e4fe3b18c880c0
deb42f2a8e48f5032b4a98ee781a15fa87a157cf

View File

@ -1 +1 @@
v2.28.3-1
v2.27.5-1

View File

@ -1 +1 @@
v2.28.3-1
v2.27.7-1

View File

@ -1 +1 @@
bbb06c0334a6772b92d24bde54956e675c8c6604
7416ffcb92cdbe98d9f97e4e6f95247e46dfc9fd

27
.ci/docker/common/install_acl.sh Normal file → Executable file
View File

@ -1,16 +1,27 @@
set -euo pipefail
#!/bin/bash
# Script used only in CD pipeline
readonly version=v25.02
readonly src_host=https://github.com/ARM-software
readonly src_repo=ComputeLibrary
set -eux
ACL_VERSION=${ACL_VERSION:-"v25.02"}
ACL_INSTALL_DIR="/acl"
# Clone ACL
[[ ! -d ${src_repo} ]] && git clone ${src_host}/${src_repo}.git
cd ${src_repo}
git checkout $version
git clone https://github.com/ARM-software/ComputeLibrary.git -b "${ACL_VERSION}" --depth 1 --shallow-submodules
ACL_CHECKOUT_DIR="ComputeLibrary"
# Build with scons
pushd $ACL_CHECKOUT_DIR
scons -j8 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 \
os=linux arch=armv8a build=native multi_isa=1 \
fixed_format_kernels=1 openmp=1 cppthreads=0
popd
# Install ACL
sudo mkdir -p ${ACL_INSTALL_DIR}
for d in arm_compute include utils support src build
do
sudo cp -r ${ACL_CHECKOUT_DIR}/${d} ${ACL_INSTALL_DIR}/${d}
done
rm -rf $ACL_CHECKOUT_DIR

View File

@ -19,8 +19,8 @@ pip_install \
transformers==4.36.2
pip_install coloredlogs packaging
pip_install onnxruntime==1.22.1
pip_install onnxscript==0.4.0
pip_install onnxruntime==1.23.0
pip_install onnxscript==0.5.3
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

12
.ci/docker/common/install_openblas.sh Normal file → Executable file
View File

@ -3,8 +3,10 @@
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION:-v0.3.30}" --depth 1 --shallow-submodules
OPENBLAS_VERSION=${OPENBLAS_VERSION:-"v0.3.30"}
# Clone OpenBLAS
git clone https://github.com/OpenMathLib/OpenBLAS.git -b "${OPENBLAS_VERSION}" --depth 1 --shallow-submodules
OPENBLAS_CHECKOUT_DIR="OpenBLAS"
OPENBLAS_BUILD_FLAGS="
@ -17,5 +19,7 @@ CFLAGS=-O3
BUILD_BFLOAT16=1
"
make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} -C $OPENBLAS_CHECKOUT_DIR
sudo make install -C $OPENBLAS_CHECKOUT_DIR
rm -rf $OPENBLAS_CHECKOUT_DIR

View File

@ -0,0 +1,9 @@
#!/bin/bash
set -xe
# Script used in Linux x86 and aarch64 CD pipeline
# Workaround for exposing statically linked libstdc++ CXX11 ABI symbols.
# see: https://github.com/pytorch/pytorch/issues/133437
LIBNONSHARED=$(gcc -print-file-name=libstdc++_nonshared.a)
nm -g $LIBNONSHARED | grep " T " | grep recursive_directory_iterator | cut -c 20- > weaken-symbols.txt
objcopy --weaken-symbols weaken-symbols.txt $LIBNONSHARED $LIBNONSHARED

View File

@ -46,9 +46,9 @@ case ${DOCKER_TAG_PREFIX} in
BASE_TARGET=rocm
GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
# add gfx950, gfx115x conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950;gfx1150;gfx1151"
fi
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg ROCM_VERSION=${GPU_ARCH_VERSION}"
;;

View File

@ -130,7 +130,8 @@ ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/op
RUN for cpython_version in "cp312-cp312" "cp313-cp313" "cp313-cp313t"; do \
/opt/python/${cpython_version}/bin/python -m pip install setuptools wheel; \
done;
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh
# cmake-3.18.4 from pip; force in case cmake3 already exists
RUN yum install -y python3-pip && \

View File

@ -62,6 +62,13 @@ ARG OPENBLAS_VERSION
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
# Install Arm Compute Library
FROM base as arm_compute
# use python3.9 to install scons
RUN python3.9 -m pip install scons==4.7.0
RUN ln -sf /opt/python/cp39-cp39/bin/scons /usr/local/bin
COPY ./common/install_acl.sh install_acl.sh
RUN bash ./install_acl.sh && rm install_acl.sh
FROM base as final
# remove unnecessary python versions
@ -70,4 +77,7 @@ RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH
COPY --from=arm_compute /acl /acl
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:/acl/build/:$LD_LIBRARY_PATH
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

View File

@ -86,6 +86,15 @@ FROM base as nvpl
ADD ./common/install_nvpl.sh install_nvpl.sh
RUN bash ./install_nvpl.sh && rm install_nvpl.sh
# Install Arm Compute Library
FROM base as arm_compute
# use python3.9 to install scons
RUN python3.9 -m pip install scons==4.7.0
RUN ln -sf /opt/python/cp39-cp39/bin/scons /usr/local/bin
COPY ./common/install_acl.sh install_acl.sh
RUN bash ./install_acl.sh && rm install_acl.sh
FROM base as final
FROM final as cuda_final
ARG BASE_CUDA_VERSION
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
@ -93,5 +102,9 @@ COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BAS
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=nvpl /opt/nvpl/lib/ /usr/local/lib/
COPY --from=nvpl /opt/nvpl/include/ /usr/local/include/
COPY --from=arm_compute /acl /acl
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH=/acl/build/:$LD_LIBRARY_PATH
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh

View File

@ -115,6 +115,9 @@ RUN env GRPC_PYTHON_BUILD_SYSTEM_OPENSSL=True pip3 install grpcio
# cmake-3.28.0 from pip for onnxruntime
RUN python3 -mpip install cmake==3.28.0
ADD ./common/patch_libstdc.sh patch_libstdc.sh
RUN bash ./patch_libstdc.sh && rm patch_libstdc.sh
# build onnxruntime 1.21.0 from sources.
# it is not possible to build it from sources using pip,
# so just build it from upstream repository.

View File

@ -28,6 +28,7 @@ fi
MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}
DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}
OPENBLAS_VERSION=${OPENBLAS_VERSION:-}
ACL_VERSION=${ACL_VERSION:-}
case ${image} in
manylinux2_28-builder:cpu)
@ -41,7 +42,6 @@ case ${image} in
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=13 --build-arg NINJA_VERSION=1.12.1"
MANY_LINUX_VERSION="2_28_aarch64"
OPENBLAS_VERSION="v0.3.30"
;;
manylinuxs390x-builder:cpu-s390x)
TARGET=final
@ -84,9 +84,9 @@ case ${image} in
DEVTOOLSET_VERSION="11"
GPU_IMAGE=rocm/dev-almalinux-8:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"
# add gfx950 conditionally starting in ROCm 7.0
# add gfx950, gfx115x conditionally starting in ROCm 7.0
if [[ "$GPU_ARCH_VERSION" == *"7.0"* ]]; then
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950"
PYTORCH_ROCM_ARCH="${PYTORCH_ROCM_ARCH};gfx950;gfx1150;gfx1151"
fi
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=${DEVTOOLSET_VERSION}"
;;
@ -119,7 +119,8 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION}" \
--build-arg "OPENBLAS_VERSION=${OPENBLAS_VERSION:-}" \
--build-arg "ACL_VERSION=${ACL_VERSION:-}" \
--target "${TARGET}" \
-t "${tmp_tag}" \
$@ \

View File

@ -52,10 +52,10 @@ flatbuffers==24.12.23
#Pinned versions: 24.12.23
#test that import:
hypothesis==5.35.1
hypothesis==6.56.4
# Pin hypothesis to avoid flakiness: https://github.com/pytorch/pytorch/issues/31136
#Description: advanced library for generating parametrized tests
#Pinned versions: 5.35.1
#Pinned versions: 6.56.4
#test that import: test_xnnpack_integration.py, test_pruning_op.py, test_nn.py
junitparser==2.1.1
@ -98,7 +98,7 @@ librosa==0.10.2 ; python_version == "3.12" and platform_machine != "s390x"
#Pinned versions:
#test that import:
mypy==1.16.0 ; platform_system != "Windows"
mypy==1.16.0 ; platform_system == "Linux"
# Pin MyPy version because new errors are likely to appear with each release
# Skip on Windows as lots of type annotations are POSIX specific
#Description: linter
@ -120,9 +120,8 @@ ninja==1.11.1.4
numba==0.55.2 ; python_version == "3.10" and platform_machine != "s390x"
numba==0.60.0 ; python_version == "3.12" and platform_machine != "s390x"
#Description: Just-In-Time Compiler for Numerical Functions
#Pinned versions: 0.54.1, 0.49.0, <=0.49.1
#Pinned versions: 0.55.2, 0.60.0
#test that import: test_numba_integration.py
#For numba issue see https://github.com/pytorch/pytorch/issues/51511
#Need release > 0.61.2 for s390x due to https://github.com/numba/numba/pull/10073
#numpy
@ -169,7 +168,7 @@ optree==0.13.0
pillow==11.0.0
#Description: Python Imaging Library fork
#Pinned versions: 10.3.0
#Pinned versions: 11.0.0
#test that import:
protobuf==5.29.5
@ -217,7 +216,7 @@ pytest-subtests==0.13.1
#Pinned versions:
#test that import:
xdoctest==1.1.0
xdoctest==1.3.0
#Description: runs doctests in pytest
#Pinned versions: 1.1.0
#test that import:
@ -242,10 +241,9 @@ pygments==2.15.0
#Pinned versions: 14.1.0
#test that import:
scikit-image==0.19.3 ; python_version < "3.10"
scikit-image==0.22.0 ; python_version >= "3.10"
scikit-image==0.22.0
#Description: image processing routines
#Pinned versions:
#Pinned versions: 0.22.0
#test that import: test_nn.py
#scikit-learn
@ -268,7 +266,7 @@ scipy==1.14.1 ; python_version >= "3.12"
#test that import:
# needed by torchgen utils
typing-extensions>=4.10.0
typing-extensions==4.12.2
#Description: type hints for python
#Pinned versions:
#test that import:
@ -341,7 +339,7 @@ onnx==1.18.0
#Pinned versions:
#test that import:
onnxscript==0.4.0
onnxscript==0.5.3
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -361,9 +359,10 @@ pwlf==2.2.1
#test that import: test_sac_estimator.py
# To build PyTorch itself
pyyaml
pyyaml==6.0.2
pyzstd
setuptools>=70.1.0
setuptools==78.1.1
packaging==23.1
six
scons==4.5.2 ; platform_machine == "aarch64"
@ -384,7 +383,10 @@ cmake==3.31.6
tlparse==0.4.0
#Description: required for log parsing
cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"
filelock==3.18.0
#Description: required for inductor testing
cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x" and platform_system != "Darwin"
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py

View File

@ -5,7 +5,7 @@ DESIRED_ROCM ?= 7.0
DESIRED_ROCM_SHORT = $(subst .,,$(DESIRED_ROCM))
PACKAGE_NAME = magma-rocm
# inherit this from underlying docker image, do not pass this env var to docker
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201
#PYTORCH_ROCM_ARCH ?= gfx900;gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201
DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
-v $(shell git rev-parse --show-toplevel)/.ci:/builder \
@ -18,7 +18,6 @@ DOCKER_RUN = set -eou pipefail; ${DOCKER_CMD} run --rm -i \
.PHONY: all
all: magma-rocm70
all: magma-rocm64
all: magma-rocm63
.PHONY:
clean:
@ -34,8 +33,3 @@ magma-rocm70:
magma-rocm64: DESIRED_ROCM := 6.4
magma-rocm64:
$(DOCKER_RUN)
.PHONY: magma-rocm63
magma-rocm63: DESIRED_ROCM := 6.3
magma-rocm63:
$(DOCKER_RUN)

View File

@ -89,7 +89,7 @@ fi
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
export USE_MKLDNN=1
export USE_MKLDNN_ACL=1
export ACL_ROOT_DIR=/ComputeLibrary
export ACL_ROOT_DIR=/acl
fi
if [[ "$BUILD_ENVIRONMENT" == *riscv64* ]]; then

View File

@ -256,7 +256,7 @@ test_torchbench_smoketest() {
local device=mps
local dtypes=(undefined float16 bfloat16 notset)
local dtype=${dtypes[$1]}
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)
local models=(llama BERT_pytorch dcgan yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor vgg16)
for backend in eager inductor; do
@ -319,7 +319,7 @@ test_aoti_torchbench_smoketest() {
local device=mps
local dtypes=(undefined float16 bfloat16 notset)
local dtype=${dtypes[$1]}
local models=(hf_T5 llama BERT_pytorch dcgan hf_GPT2 yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor timm_resnet timm_vovnet vgg16)
local models=(llama BERT_pytorch dcgan yolov3 resnet152 sam sam_fast pytorch_unet stable_diffusion_text_encoder speech_transformer Super_SloMo doctr_det_predictor doctr_reco_predictor vgg16)
echo "Launching torchbench inference performance run for AOT Inductor and dtype ${dtype}"
local dtype_arg="--${dtype}"

View File

@ -32,6 +32,9 @@ LIBTORCH_NAMESPACE_LIST = (
"torch::",
)
# Patterns for detecting statically linked libstdc++ symbols
STATICALLY_LINKED_CXX11_ABI = [re.compile(r".*recursive_directory_iterator.*")]
def _apply_libtorch_symbols(symbols):
return [
@ -53,12 +56,17 @@ def get_symbols(lib: str) -> list[tuple[str, str, str]]:
return [x.split(" ", 2) for x in lines.decode("latin1").split("\n")[:-1]]
def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:
def grep_symbols(
lib: str, patterns: list[Any], symbol_type: str | None = None
) -> list[str]:
def _grep_symbols(
symbols: list[tuple[str, str, str]], patterns: list[Any]
) -> list[str]:
rc = []
for _s_addr, _s_type, s_name in symbols:
# Filter by symbol type if specified
if symbol_type and _s_type != symbol_type:
continue
for pattern in patterns:
if pattern.match(s_name):
rc.append(s_name)
@ -80,6 +88,18 @@ def grep_symbols(lib: str, patterns: list[Any]) -> list[str]:
return functools.reduce(list.__add__, (x.result() for x in tasks), [])
def check_lib_statically_linked_libstdc_cxx_abi_symbols(lib: str) -> None:
cxx11_statically_linked_symbols = grep_symbols(
lib, STATICALLY_LINKED_CXX11_ABI, symbol_type="T"
)
num_statically_linked_symbols = len(cxx11_statically_linked_symbols)
print(f"num_statically_linked_symbols (T): {num_statically_linked_symbols}")
if num_statically_linked_symbols > 0:
raise RuntimeError(
f"Found statically linked libstdc++ symbols (recursive_directory_iterator): {cxx11_statically_linked_symbols[:100]}"
)
def check_lib_symbols_for_abi_correctness(lib: str) -> None:
print(f"lib: {lib}")
cxx11_symbols = grep_symbols(lib, LIBTORCH_CXX11_PATTERNS)
@ -107,6 +127,7 @@ def main() -> None:
libtorch_cpu_path = str(install_root / "lib" / "libtorch_cpu.so")
check_lib_symbols_for_abi_correctness(libtorch_cpu_path)
check_lib_statically_linked_libstdc_cxx_abi_symbols(libtorch_cpu_path)
if __name__ == "__main__":

View File

@ -34,12 +34,14 @@ fi
# Patch numba to avoid CUDA-13 crash, see https://github.com/pytorch/pytorch/issues/162878
NUMBA_CUDA_DIR=$(python -c "import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))" 2>/dev/null || true)
if [ -n "$NUMBA_CUDA_DIR" ]; then
NUMBA_PATCH="$(dirname "$(realpath "${BASH_SOURCE[0]}")")/numba-cuda-13.patch"
pushd "$NUMBA_CUDA_DIR"
patch -p4 <"$NUMBA_PATCH"
popd
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
NUMBA_CUDA_DIR=$(python -c "import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))" 2>/dev/null || true)
if [ -n "$NUMBA_CUDA_DIR" ]; then
NUMBA_PATCH="$(dirname "$(realpath "${BASH_SOURCE[0]}")")/numba-cuda-13.patch"
pushd "$NUMBA_CUDA_DIR"
patch -p4 <"$NUMBA_PATCH"
popd
fi
fi
echo "Environment variables:"
@ -836,7 +838,7 @@ test_dynamo_benchmark() {
elif [[ "${suite}" == "timm_models" ]]; then
export TORCHBENCH_ONLY_MODELS="inception_v3"
elif [[ "${suite}" == "torchbench" ]]; then
export TORCHBENCH_ONLY_MODELS="hf_Bert"
export TORCHBENCH_ONLY_MODELS="BERT_pytorch"
fi
fi
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
@ -867,13 +869,13 @@ test_inductor_torchbench_smoketest_perf() {
mkdir -p "$TEST_REPORTS_DIR"
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only BERT_pytorch \
--output "$TEST_REPORTS_DIR/inductor_training_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4
# Check memory compression ratio for a few models
for test in hf_Albert timm_vision_transformer; do
for test in BERT_pytorch yolov3; do
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --amp --training \
--disable-cudagraphs --batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" \
--only $test --output "$TEST_REPORTS_DIR/inductor_training_smoketest_$test.csv"
@ -884,7 +886,7 @@ test_inductor_torchbench_smoketest_perf() {
done
# Perform some "warm-start" runs for a few huggingface models.
for test in AlbertForQuestionAnswering AllenaiLongformerBase DistilBertForMaskedLM DistillGPT2 GoogleFnet YituTechConvBert; do
for test in AllenaiLongformerBase DistilBertForMaskedLM DistillGPT2 GoogleFnet YituTechConvBert; do
python benchmarks/dynamo/huggingface.py --accuracy --training --amp --inductor --device cuda --warm-start-latency \
--only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"
python benchmarks/dynamo/check_accuracy.py \

View File

@ -0,0 +1,32 @@
#!/bin/bash
set -ex -o pipefail
# Suppress ANSI color escape sequences
export TERM=vt100
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
echo "Environment variables"
env
echo "Testing FA3 stable wheel still works with currently built torch"
echo "Installing ABI Stable FA3 wheel"
# The wheel was built on https://github.com/Dao-AILab/flash-attention/commit/b3846b059bf6b143d1cd56879933be30a9f78c81
# on torch nightly torch==2.9.0.dev20250830+cu129
$MAYBE_SUDO pip -q install https://s3.amazonaws.com/ossci-linux/wheels/flash_attn_3-3.0.0b1-cp39-abi3-linux_x86_64.whl
pushd flash-attention/hopper
export PYTHONPATH=$PWD
pytest -v -s \
"test_flash_attn.py::test_flash_attn_output[1-1-192-False-False-False-0.0-False-False-mha-dtype0]" \
"test_flash_attn.py::test_flash_attn_varlen_output[511-1-64-True-False-False-0.0-False-False-gqa-dtype2]" \
"test_flash_attn.py::test_flash_attn_kvcache[1-128-128-False-False-True-None-0.0-False-False-True-False-True-False-gqa-dtype0]" \
"test_flash_attn.py::test_flash_attn_race_condition[97-97-192-True-dtype0]" \
"test_flash_attn.py::test_flash_attn_combine[2-3-64-dtype1]" \
"test_flash_attn.py::test_flash3_bw_compatibility"
popd

View File

@ -38,10 +38,12 @@ if errorlevel 1 goto fail
if not errorlevel 0 goto fail
:: Update CMake
:: TODO: Investigate why this helps MKL detection, even when CMake from choco is not used
call choco upgrade -y cmake --no-progress --installargs 'ADD_CMAKE_TO_PATH=System' --apply-install-arguments-to-dependencies --version=3.27.9
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
:: TODO: Move to .ci/docker/requirements-ci.txt
call pip install mkl==2024.2.0 mkl-static==2024.2.0 mkl-include==2024.2.0
if errorlevel 1 goto fail
if not errorlevel 0 goto fail

View File

@ -37,27 +37,8 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; then
export PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"
fi
# TODO: Move both of them to Windows AMI
python -m pip install tensorboard==2.13.0 protobuf==5.29.4 pytest-subtests==0.13.1
# Copied from https://github.com/pytorch/test-infra/blob/be01a40157c36cd5a48391fdf44a7bc3ebd4c7e3/aws/ami/windows/scripts/Installers/Install-Pip-Dependencies.ps1#L16 with some adjustments
# pytest-rerunfailures==10.3 as 10.2 fails with INTERNALERROR> pluggy._manager.PluginValidationError: unknown hook 'pytest_configure_node'
# scipy from 1.6.3 to 1.10
# expecttest from 0.1.3 to 0.3.0
# xdoctest from 1.0.2 to 1.3.0
python -m pip install "future==0.18.2" "hypothesis==5.35.1" "expecttest==0.3.0" "librosa>=0.6.2" "scipy==1.10.1" "psutil==5.9.1" "pynvml==11.4.1" "pillow==9.2.0" "unittest-xml-reporting<=3.2.0,>=2.0.0" "pytest==7.1.3" "pytest-xdist==2.5.0" "pytest-flakefinder==1.1.0" "pytest-rerunfailures==10.3" "pytest-shard==0.1.2" "sympy==1.11.1" "xdoctest==1.3.0" "pygments==2.12.0" "opt-einsum>=3.3" "networkx==2.8.8" "mpmath==1.2.1" "pytest-cpp==2.3.0" "boto3==1.35.42"
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver==4.15.1.0
# Install tlparse for test\dynamo\test_structured_trace.py UTs.
python -m pip install tlparse==0.4.0
# Install parameterized
python -m pip install parameterized==0.8.1
# Install pulp for testing ilps under torch\distributed\_tools
python -m pip install pulp==2.9.0
# TODO: Move this to .ci/docker/requirements-ci.txt
python -m pip install "psutil==5.9.1" nvidia-ml-py "pytest-shard==0.1.2"
run_tests() {
# Run nvidia-smi if available

View File

@ -37,10 +37,10 @@ IF "%CUDA_PATH_V128%"=="" (
)
IF "%BUILD_VISION%" == "" (
set TORCH_CUDA_ARCH_LIST=6.1;7.0;7.5;8.0;8.6;9.0;10.0;12.0
set TORCH_CUDA_ARCH_LIST=7.0;7.5;8.0;8.6;9.0;10.0;12.0
set TORCH_NVCC_FLAGS=-Xfatbin -compress-all
) ELSE (
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_50,code=sm_50 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120
set NVCC_FLAGS=-D__CUDA_NO_HALF_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_90,code=compute_90 -gencode=arch=compute_100,code=compute_100 -gencode=arch=compute_120,code=compute_120
)
set "CUDA_PATH=%CUDA_PATH_V128%"

View File

@ -71,14 +71,7 @@ export PYTORCH_BUILD_NUMBER=1
# Set triton version as part of PYTORCH_EXTRA_INSTALL_REQUIREMENTS
TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"
# CUDA 12.9/13.0 builds have triton for Linux and Linux aarch64 binaries.
if [[ "$DESIRED_CUDA" == "cu129" ]] || [[ "$DESIRED_CUDA" == "cu130" ]]; then
TRITON_CONSTRAINT="platform_system == 'Linux'"
fi
TRITON_CONSTRAINT="platform_system == 'Linux'"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" && ! "$PYTORCH_BUILD_VERSION" =~ .*xpu.* ]]; then
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

View File

@ -59,13 +59,14 @@ performance-*,
-performance-enum-size,
readability-container-size-empty,
readability-delete-null-pointer,
readability-duplicate-include
readability-duplicate-include,
readability-misplaced-array-index,
readability-redundant*
readability-redundant*,
readability-simplify-subscript-expr,
readability-string-compare,
-readability-redundant-access-specifiers,
-readability-redundant-control-flow,
-readability-redundant-inline-specifier,
'
HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
WarningsAsErrors: '*'

View File

@ -28,6 +28,10 @@ runs:
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: Print GPU info (if present)
shell: bash
run: if [ -f /usr/bin/nvidia-smi ]; then nvidia-smi; fi
- name: Check if in a container runner
shell: bash
id: check_container_runner
@ -82,37 +86,6 @@ runs:
# Prune all of the docker images
docker system prune -af
- name: Manually resolve download.pytorch.org
shell: bash
continue-on-error: true
run: |
set +e
set -x
PT_DOMAIN=download.pytorch.org
# TODO: Flaky access to download.pytorch.org https://github.com/pytorch/pytorch/issues/100400,
# cleaning this up once the issue is fixed. There are more than one resolved IP here, the last
# one is returned at random
RESOLVED_IP=$(dig -4 +short "${PT_DOMAIN}" | tail -n1)
if [ -z "${RESOLVED_IP}" ]; then
echo "Couldn't resolve ${PT_DOMAIN}, retrying with Google DNS..."
RESOLVED_IP=$(dig -4 +short "${PT_DOMAIN}" @8.8.8.8 | tail -n1)
if [ -z "${RESOLVED_IP}" ]; then
echo "Couldn't resolve ${PT_DOMAIN}, exiting..."
exit 1
fi
fi
if grep -r "${PT_DOMAIN}" /etc/hosts; then
# Clean up any old records first
sudo sed -i "/${PT_DOMAIN}/d" /etc/hosts
fi
echo "${RESOLVED_IP} ${PT_DOMAIN}" | sudo tee -a /etc/hosts
cat /etc/hosts
- name: Check that the docker daemon is running
shell: bash
continue-on-error: true

View File

@ -23,9 +23,6 @@ runs:
run: |
.github\scripts\kill_active_ssh_sessions.ps1
- name: Clean up leftover processes on non-ephemeral Windows runner
uses: pytorch/test-infra/.github/actions/cleanup-runner@main
# Cleaning up Windows workspace sometimes fails flakily with device or resource busy
# error, meaning one or more processes haven't stopped completely yet. So trying to
# retry this step several time similar to how checkout-pytorch GHA does

View File

@ -1 +1 @@
78a47f87ce259a48f0391fa9ae15add05ea7432b
e614ab780645b06540b9f0427c26cb77ef34b558

View File

@ -1 +1 @@
0fc62aa26a30ed7ca419d285f285cb5ba02c4394
2a9138a26ee257fef05310ad3fecf7c55fe80d73

View File

@ -202,7 +202,7 @@ ARG max_jobs=16
ENV MAX_JOBS=${max_jobs}
ARG nvcc_threads=4
ENV NVCC_THREADS=$nvcc_threads
ARG torch_cuda_arch_list='8.0;8.6;8.9;9.0'
ARG torch_cuda_arch_list='8.0 8.6 8.9 9.0'
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
ARG USE_SCCACHE
@ -297,16 +297,28 @@ RUN echo "[INFO] Listing current directory before torch install step:" && \
echo "[INFO] Showing torch_build_versions.txt content:" && \
cat torch_build_versions.txt
# Install build and runtime dependencies, this is needed for flashinfer install
COPY requirements/build.txt requirements/build.txt
COPY use_existing_torch.py use_existing_torch.py
RUN python3 use_existing_torch.py
RUN cat requirements/build.txt
# Install uv for faster pip installs if not existed
RUN --mount=type=cache,target=/root/.cache/uv \
if ! python3 -m uv --version > /dev/null 2>&1; then \
python3 -m pip install uv==0.8.4; \
fi
ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match"
# Use copy mode to avoid hardlink failures with Docker cache mounts
ENV UV_LINK_MODE=copy
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system -r requirements/build.txt
# Default mount file as placeholder, this just avoid the mount error
ARG TORCH_WHEELS_PATH="./requirements"
# Install torch, torchaudio and torchvision
@ -332,13 +344,11 @@ RUN --mount=type=cache,target=/root/.cache/uv \
# Install xformers wheel from previous stage
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system /wheels/xformers/*.whl --verbose
# Build flashinfer from source.
ARG torch_cuda_arch_list='8.0;8.9;9.0a;10.0a;12.0'
# install package for build flashinfer
# see issue: https://github.com/flashinfer-ai/flashinfer/issues/738
RUN pip install build==1.3.0
RUN pip freeze | grep -E 'setuptools|packaging|build'
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}

View File

@ -1,9 +1,14 @@
import glob
import os
requires_files = glob.glob("requirements/*.txt")
requires_files += ["pyproject.toml"]
for file in requires_files:
if not os.path.exists(file):
print(f"!!! skipping missing {file}")
continue
print(f">>> cleaning {file}")
with open(file) as f:
lines = f.readlines()

View File

@ -1,37 +0,0 @@
boto3==1.35.42
build==1.2.2.post1
cmake==3.27.*
expecttest==0.3.0
fbscribelogger==0.1.7
filelock==3.18.0
hypothesis==6.56.4
librosa>=0.6.2
mpmath==1.3.0
networkx==2.8.7
ninja==1.10.2.4
numba==0.59.0
numpy==1.26.4
opt-einsum>=3.3
optree==0.13.0
packaging==23.1
parameterized==0.8.1
pillow==10.3.0
protobuf==5.29.5
psutil==5.9.8
pygments==2.15.0
pytest-cpp==2.3.0
pytest-flakefinder==1.1.0
pytest-rerunfailures==10.3
pytest-subtests==0.13.1
pytest-xdist==3.3.1
pytest==7.3.2
pyyaml==6.0.2
scipy==1.12.0
setuptools==78.1.1
sympy==1.13.3
tlparse==0.4.0
tensorboard==2.13.0
typing-extensions==4.12.2
unittest-xml-reporting<=3.2.0,>=2.0.0
xdoctest==1.1.0
z3-solver==4.15.1.0

Binary file not shown.

View File

@ -502,6 +502,7 @@ def perform_misc_tasks(
job_name: str,
pr_body: str,
branch: Optional[str] = None,
tag: Optional[str] = None,
) -> None:
"""
In addition to apply the filter logic, the script also does the following
@ -509,7 +510,9 @@ def perform_misc_tasks(
"""
set_output(
"keep-going",
branch == MAIN_BRANCH or check_for_setting(labels, pr_body, "keep-going"),
branch == MAIN_BRANCH
or bool(tag and re.match(r"^trunk/[a-f0-9]{40}$", tag))
or check_for_setting(labels, pr_body, "keep-going"),
)
set_output(
"ci-verbose-test-logs",
@ -634,6 +637,7 @@ def main() -> None:
job_name=args.job_name,
pr_body=pr_body if pr_body else "",
branch=args.branch,
tag=tag,
)
# Set the filtered test matrix as the output

View File

@ -53,7 +53,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | "
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | "
@ -70,7 +70,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | "
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | "
"nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | "
@ -87,7 +87,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | "
"nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | "
"nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | "
"nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | "
"nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | "
"nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | "
"nvidia-nvtx==13.0.39; platform_system == 'Linux' | "
"nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | "

View File

@ -18,6 +18,7 @@ class GitHubComment:
body_text: str
created_at: str
author_login: str
author_url: Optional[str]
author_association: str
editor_login: Optional[str]
database_id: int

Binary file not shown.

View File

@ -38,6 +38,7 @@ def mock_get_comments() -> list[GitHubComment]:
body_text="mock_body_text",
created_at="",
author_login="",
author_url=None,
author_association="",
editor_login=None,
database_id=1,
@ -48,6 +49,7 @@ def mock_get_comments() -> list[GitHubComment]:
body_text=" #" + LABEL_ERR_MSG_TITLE.replace("`", ""),
created_at="",
author_login=BOT_AUTHORS[1],
author_url=None,
author_association="",
editor_login=None,
database_id=2,

View File

@ -32,6 +32,7 @@ from trymerge import (
main as trymerge_main,
MandatoryChecksMissingError,
MergeRule,
PostCommentError,
RE_GHSTACK_DESC,
read_merge_rules,
remove_job_name_suffix,
@ -588,6 +589,23 @@ class TestTryMerge(TestCase):
self.assertEqual(mock_merge_base, pr.get_merge_base())
mocked_gh_fetch_merge_base.assert_called_once()
def test_app_can_revert(self, *args: Any) -> None:
pr = GitHubPR("pytorch", "pytorch", 164660)
repo = DummyGitRepo()
app_comment_id, impostor_comment_id = 3375785595, 3377647892
# Check that app can revert
self.assertIsNotNone(validate_revert(repo, pr, comment_id=app_comment_id))
# But impostor can not
self.assertRaises(
PostCommentError,
lambda: validate_revert(repo, pr, comment_id=impostor_comment_id),
)
# Despite it's name being the name of the bot
self.assertEqual(
pr.get_comment_by_id(impostor_comment_id).author_login,
"pytorch-auto-revert",
)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")

View File

@ -234,6 +234,7 @@ query ($owner: String!, $name: String!, $number: Int!) {
createdAt
author {
login
url
}
authorAssociation
editor {
@ -1093,6 +1094,7 @@ class GitHubPR:
body_text=node["bodyText"],
created_at=node["createdAt"] if "createdAt" in node else "",
author_login=node["author"]["login"],
author_url=node["author"].get("url", None),
author_association=node["authorAssociation"],
editor_login=editor["login"] if editor else None,
database_id=node["databaseId"],
@ -2029,6 +2031,11 @@ def validate_revert(
# For some reason, one can not be a member of private repo, only CONTRIBUTOR
if pr.is_base_repo_private():
allowed_reverters.append("CONTRIBUTOR")
# Special case the pytorch-auto-revert app, whose does not have association
# But should be able to issue revert command
if comment.author_url == "https://github.com/apps/pytorch-auto-revert":
allowed_reverters.append("NONE")
if author_association not in allowed_reverters:
raise PostCommentError(
f"Will not revert as @{author_login} is not one of "

View File

@ -40,6 +40,15 @@ jobs:
# Use gh CLI to get changed files in the PR with explicit repo
CHANGED_FILES=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/files --paginate --jq '.[] | select(.status != "removed") | .filename' | tr '\n' ' ' | sed 's/ $//')
# See https://github.com/pytorch/pytorch/pull/134215#issuecomment-2332128790
PYI_FILES_TO_ADD=""
for file in ${CHANGED_FILES}; do
if [[ "${file}" == *".pyi.in" ]]; then
PYI_FILES_TO_ADD="${PYI_FILES_TO_ADD} ${file//.in/}"
fi
done
CHANGED_FILES="${CHANGED_FILES}${PYI_FILES_TO_ADD}"
if [ -z "$CHANGED_FILES" ]; then
echo "No changed files found, setting to '*'"
CHANGED_FILES="*"

View File

@ -0,0 +1,255 @@
# The point of this workflow is to test that a FA3 wheel that was built based off the
# stable ABI as of torch nightly 20250830 can still run on the newer torch.
#
# This workflow is very similar to the _linux-test.yml workflow, with the following
# differences:
# 1. It is simpler (there is no test matrix)
# 2. It pulls flash-attention as a secondary repository in order to access the tests.
# Note that it does not BUILD anything from flash-attention, as we have a prebuilt
# wheel. We pull flash-attention only to run a few tests.
# 3. It runs only FA3 tests. No PyTorch tests are run.
name: linux-test-stable-fa3
on:
workflow_call:
inputs:
build-environment:
required: true
type: string
description: Top-level label for what's being built/tested.
docker-image:
required: true
type: string
description: Docker image to run in.
timeout-minutes:
required: false
type: number
default: 30
description: |
Set the maximum (in minutes) how long the workflow should take to finish
s3-bucket:
description: S3 bucket to download artifact
required: false
type: string
default: "gha-artifacts"
secrets:
HUGGING_FACE_HUB_TOKEN:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
VLLM_TEST_HUGGING_FACE_TOKEN:
required: false
description: |
HF Auth token to test vllm
SCRIBE_GRAPHQL_ACCESS_TOKEN:
required: false
description: |
FB app token to write to scribe endpoint
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
jobs:
test:
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on: linux.aws.h100
timeout-minutes: ${{ inputs.timeout-minutes || 30 }}
permissions:
id-token: write
contents: read
steps:
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
no-sudo: true
- name: Checkout flash-attention as a secondary repository
uses: actions/checkout@v4
with:
repository: Dao-AILab/flash-attention
path: flash-attention
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: ${{ inputs.docker-image }}
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*:}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a container runner
shell: bash
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Setup GPU_FLAG for docker run
id: setup-gpu-flag
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container
id: setup-sscache-port-flag
run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
- name: Download build artifacts
uses: ./.github/actions/download-build-artifacts
with:
name: ${{ inputs.build-environment }}
s3-bucket: ${{ inputs.s3-bucket }}
- name: Parse ref
id: parse-ref
run: .github/scripts/parse_ref.py
- name: Set Test step time
id: test-timeout
shell: bash
env:
JOB_TIMEOUT: ${{ inputs.timeout-minutes }}
run: |
echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}"
- name: Preserve github env variables for use in docker
shell: bash
run: |
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
- name: Test
id: test
timeout-minutes: ${{ fromJson(steps.test-timeout.outputs.timeout) }}
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
PR_NUMBER: ${{ github.event.pull_request.number }}
GITHUB_REPOSITORY: ${{ github.repository }}
GITHUB_WORKFLOW: ${{ github.workflow }}
GITHUB_JOB: ${{ github.job }}
GITHUB_RUN_ID: ${{ github.run_id }}
GITHUB_RUN_NUMBER: ${{ github.run_number }}
GITHUB_RUN_ATTEMPT: ${{ github.run_attempt }}
JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
JOB_NAME: ${{ steps.get-job-id.outputs.job-name }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
BASE_SHA: ${{ github.event.pull_request.base.sha || github.sha }}
SHM_SIZE: '2g'
DOCKER_IMAGE: ${{ inputs.docker-image }}
VLLM_TEST_HUGGING_FACE_TOKEN: ${{ secrets.VLLM_TEST_HUGGING_FACE_TOKEN }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
ARTIFACTS_FILE_SUFFIX: ${{ github.job }}-${{ steps.get-job-id.outputs.job-id }}
run: |
set -x
TEST_COMMAND=.ci/pytorch/test_fa3_abi_stable.sh
# Leaving 1GB for the runner and other things
TOTAL_AVAILABLE_MEMORY_IN_GB=$(awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo)
# https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details, the 3GB swap
# comes from https://github.com/pytorch/test-infra/pull/6058
TOTAL_MEMORY_WITH_SWAP=$(("${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}" + 3))
SHM_OPTS="--shm-size=${SHM_SIZE}"
JENKINS_USER="--user jenkins"
DOCKER_SHELL_CMD=
# detached container should get cleaned up by teardown_ec2_linux
# TODO: Stop building test binaries as part of the build phase
# Used for GPU_FLAG, SHM_OPTS, JENKINS_USER and DOCKER_SHELL_CMD since that doesn't play nice
# shellcheck disable=SC2086,SC2090
container_name=$(docker run \
${GPU_FLAG:-} \
${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e PR_NUMBER \
-e GITHUB_ACTIONS \
-e GITHUB_REPOSITORY \
-e GITHUB_WORKFLOW \
-e GITHUB_JOB \
-e GITHUB_RUN_ID \
-e GITHUB_RUN_NUMBER \
-e GITHUB_RUN_ATTEMPT \
-e JOB_ID \
-e JOB_NAME \
-e BASE_SHA \
-e BRANCH \
-e SHA1 \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e HUGGING_FACE_HUB_TOKEN \
-e VLLM_TEST_HUGGING_FACE_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e ARTIFACTS_FILE_SUFFIX \
--memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \
--memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--ipc=host \
${SHM_OPTS} \
--tty \
--detach \
--name="${container_name}" \
${JENKINS_USER} \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}" \
${DOCKER_SHELL_CMD}
)
echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}"
docker exec -t "${container_name}" sh -c "python3 -m pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}"
- name: Collect backtraces from coredumps (if any)
if: always()
run: |
# shellcheck disable=SC2156
find . -iname "core.[1-9]*" -exec docker exec "${DOCKER_CONTAINER_ID}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \;
- name: Store Core dumps on S3
uses: seemethere/upload-artifact-s3@baba72d0712b404f646cebe0730933554ebce96a # v5.1.0
if: failure()
with:
name: coredumps-fa3-stable-abi-smoke-tests
retention-days: 14
if-no-files-found: ignore
path: ./**/core.[1-9]*
- name: Upload utilization stats
if: ${{ always() && steps.test.conclusion && steps.test.conclusion != 'skipped' }}
continue-on-error: true
uses: ./.github/actions/upload-utilization-stats
with:
job_id: ${{ steps.get-job-id.outputs.job-id }}
job_name: ${{ steps.get-job-id.outputs.job-name }}
workflow_name: ${{ github.workflow }}
workflow_run_id: ${{github.run_id}}
workflow_attempt: ${{github.run_attempt}}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'

View File

@ -85,7 +85,7 @@ jobs:
uses: pytorch/test-infra/.github/actions/setup-python@main
with:
python-version: ${{ inputs.python-version }}
pip-requirements-file: .github/requirements/pip-requirements-macOS.txt
pip-requirements-file: .ci/docker/requirements-ci.txt
- name: Install sccache (only for non-forked PRs, and pushes to trunk)
uses: nick-fields/retry@7152eba30c6575329ac0576536151aca5a72780e # v3.0.0

View File

@ -122,7 +122,7 @@ jobs:
uses: pytorch/test-infra/.github/actions/setup-python@main
with:
python-version: ${{ inputs.python-version }}
pip-requirements-file: .github/requirements/pip-requirements-macOS.txt
pip-requirements-file: .ci/docker/requirements-ci.txt
- name: Start monitoring script
id: monitor-script

View File

@ -84,9 +84,6 @@ jobs:
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
- name: Clean up leftover processes on non-ephemeral Windows runner
uses: pytorch/test-infra/.github/actions/cleanup-runner@main
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
with:

View File

@ -77,9 +77,6 @@ jobs:
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
- name: Clean up leftover processes on non-ephemeral Windows runner
uses: pytorch/test-infra/.github/actions/cleanup-runner@main
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
with:
@ -106,18 +103,6 @@ jobs:
with:
cuda-version: ${{ inputs.cuda-version }}
# TODO: Move to a requirements.txt file for windows
- name: Install pip dependencies
uses: nick-fields/retry@7152eba30c6575329ac0576536151aca5a72780e # v3.0.0
with:
shell: bash
timeout_minutes: 5
max_attempts: 5
retry_wait_seconds: 30
command: |
set -eu
python3 -m pip install 'xdoctest>=1.1.0'
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
@ -272,15 +257,6 @@ jobs:
shell: bash
run: python3 .github/scripts/parse_ref.py
- name: Uninstall PyTorch
if: always()
continue-on-error: true
shell: bash
run: |
# This step removes PyTorch installed by the test to give a clean slate
# to the next job
python3 -mpip uninstall -y torch
- name: Teardown Windows
uses: ./.github/actions/teardown-win
if: always()

View File

@ -132,7 +132,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -178,7 +178,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -224,7 +224,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -335,7 +335,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -381,7 +381,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -427,7 +427,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -538,7 +538,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -584,7 +584,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -630,7 +630,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -741,7 +741,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -787,7 +787,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -833,7 +833,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -944,7 +944,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -990,7 +990,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1036,7 +1036,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1147,7 +1147,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1193,7 +1193,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1239,7 +1239,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1350,7 +1350,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-12_6
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1396,7 +1396,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-12_8
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -1442,7 +1442,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-13_0
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -127,7 +127,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_6-test: # Testing
@ -193,7 +193,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_8-test: # Testing
@ -259,7 +259,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda13_0-test: # Testing
@ -721,7 +721,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_6-test: # Testing
@ -787,7 +787,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_8-test: # Testing
@ -853,7 +853,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda13_0-test: # Testing
@ -1315,7 +1315,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_6-test: # Testing
@ -1381,7 +1381,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_8-test: # Testing
@ -1447,7 +1447,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda13_0-test: # Testing
@ -1909,7 +1909,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_6-test: # Testing
@ -1975,7 +1975,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_8-test: # Testing
@ -2041,7 +2041,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda13_0-test: # Testing
@ -2503,7 +2503,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_6-test: # Testing
@ -2569,7 +2569,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_8-test: # Testing
@ -2635,7 +2635,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda13_0-test: # Testing
@ -3097,7 +3097,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda12_6-test: # Testing
@ -3163,7 +3163,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda12_8-test: # Testing
@ -3229,7 +3229,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda13_0-test: # Testing
@ -3691,7 +3691,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14t-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda12_6-test: # Testing
@ -3757,7 +3757,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14t-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' | nvidia-nvshmem-cu12==3.3.24; platform_system == 'Linux' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda12_8-test: # Testing
@ -3823,7 +3823,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14t-cuda13_0
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.28.3; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | nvidia-cuda-cupti==13.0.48; platform_system == 'Linux' | nvidia-cudnn-cu13==9.13.0.50; platform_system == 'Linux' | nvidia-cublas==13.0.0.19; platform_system == 'Linux' | nvidia-cufft==12.0.0.15; platform_system == 'Linux' | nvidia-curand==10.4.0.35; platform_system == 'Linux' | nvidia-cusolver==12.0.3.29; platform_system == 'Linux' | nvidia-cusparse==12.6.2.49; platform_system == 'Linux' | nvidia-cusparselt-cu13==0.8.0; platform_system == 'Linux' | nvidia-nccl-cu13==2.27.7; platform_system == 'Linux' | nvidia-nvshmem-cu13==3.3.24; platform_system == 'Linux' | nvidia-nvtx==13.0.39; platform_system == 'Linux' | nvidia-nvjitlink==13.0.39; platform_system == 'Linux' | nvidia-cufile==1.15.0.42; platform_system == 'Linux'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda13_0-test: # Testing

View File

@ -2,7 +2,7 @@ name: inductor-perf-nightly-h100
on:
schedule:
- cron: 15 0,12 * * 1-6
- cron: 15 0 * * 1-6
- cron: 0 7 * * 0
# NB: GitHub has an upper limit of 10 inputs here, so before we can sort it
# out, let try to run torchao cudagraphs_low_precision as part of cudagraphs

View File

@ -63,6 +63,7 @@ jobs:
# Same as the build job
python-version: 3.12.7
test-matrix: ${{ needs.macos-perf-py3-arm64-build.outputs.test-matrix }}
timeout-minutes: 300
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4

View File

@ -106,6 +106,16 @@ jobs:
{ config: "dynamic_aot_eager_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_aot_eager_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_aot_eager_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
]}
secrets: inherit

View File

@ -18,6 +18,7 @@ permissions:
contents: read
jobs:
# H100 A100 runners
opmicrobenchmark-build:
if: github.repository_owner == 'pytorch'
name: opmicrobenchmark-build
@ -44,3 +45,56 @@ jobs:
docker-image: ${{ needs.opmicrobenchmark-build.outputs.docker-image }}
test-matrix: ${{ needs.opmicrobenchmark-build.outputs.test-matrix }}
secrets: inherit
# B200 runner
opmicrobenchmark-build-b200:
if: github.repository_owner == 'pytorch'
name: opmicrobenchmark-build-b200
uses: ./.github/workflows/_linux-build.yml
with:
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'
test-matrix: |
{ include: [
{ config: "operator_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.dgx.b200" },
]}
secrets: inherit
opmicrobenchmark-test-b200:
name: opmicrobenchmark-test-b200
uses: ./.github/workflows/_linux-test.yml
needs: opmicrobenchmark-build-b200
with:
timeout-minutes: 500
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm100
docker-image: ${{ needs.opmicrobenchmark-build-b200.outputs.docker-image }}
test-matrix: ${{ needs.opmicrobenchmark-build-b200.outputs.test-matrix }}
aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
secrets: inherit
# ROCM MI300 runner
opmicrobenchmark-build-rocm:
if: github.repository_owner == 'pytorch'
name: opmicrobenchmark-build-rocm
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-rocm-py3_10
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3-benchmarks
test-matrix: |
{ include: [
{ config: "operator_microbenchmark_test", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.1" },
]}
secrets: inherit
opmicrobenchmark-test-rocm:
name: opmicrobenchmark-test-rocm
uses: ./.github/workflows/_rocm-test.yml
needs: opmicrobenchmark-build-rocm
with:
timeout-minutes: 500
build-environment: linux-jammy-rocm-py3_10
docker-image: ${{ needs.opmicrobenchmark-build-rocm.outputs.docker-image }}
test-matrix: ${{ needs.opmicrobenchmark-build-rocm.outputs.test-matrix }}
secrets: inherit

View File

@ -213,9 +213,9 @@ jobs:
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.mi250.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.mi250.4", owners: ["module:rocm", "oncall:distributed"] },
{ config: "distributed", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.mi250.4", owners: ["module:rocm", "oncall:distributed"] },
]}
secrets: inherit

View File

@ -127,8 +127,6 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build with asan
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan

View File

@ -59,3 +59,29 @@ jobs:
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-rocm-py3_10-gfx1100-test:
if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}
permissions:
id-token: write
contents: read
name: linux-jammy-rocm-py3_10-gfx1100
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-jammy-rocm-py3_10-build
- target-determination
with:
build-environment: linux-jammy-rocm-py3.10
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx1100" },
{ config: "default", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx1100" },
]}
tests-to-include: >
test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs
test_autograd inductor/test_torchinductor inductor/test_kernel_benchmark
inductor/test_pad_mm inductor/test_benchmark_fusion inductor/test_aot_inductor
inductor/test_torchinductor inductor/test_decompose_mem_bound_mm
inductor/test_flex_attention inductor/test_max_autotune
secrets: inherit

View File

@ -140,8 +140,6 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
# More memory is needed to build with asan
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan

View File

@ -61,3 +61,15 @@ jobs:
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-sm90-FA3-ABI-stable-test:
name: linux-jammy-cuda12_8-py3_10-gcc11-sm90-FA3-ABI-stable-test
uses: ./.github/workflows/_linux-test-stable-fa3.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-sm90-build
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build.outputs.docker-image }}
timeout-minutes: 30
s3-bucket: gha-artifacts
secrets: inherit

View File

@ -160,9 +160,10 @@ jobs:
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral" },
]}
secrets: inherit
@ -189,41 +190,6 @@ jobs:
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
secrets: inherit
linux-jammy-rocm-py3_10-build:
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/trunk') }}
name: linux-jammy-rocm-py3.10
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-rocm-py3.10
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
sync-tag: rocm-build
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "default", shard: 2, num_shards: 2, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "distributed", shard: 1, num_shards: 1, runner: "linux.rocm.gpu.gfx942.4" },
]}
secrets: inherit
linux-jammy-rocm-py3_10-test:
if: ${{ startsWith(github.event.ref, 'refs/tags/ciflow/trunk') }}
permissions:
id-token: write
contents: read
name: linux-jammy-rocm-py3.10
uses: ./.github/workflows/_rocm-test.yml
needs:
- linux-jammy-rocm-py3_10-build
- target-determination
with:
build-environment: linux-jammy-rocm-py3.10
docker-image: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-build.outputs.test-matrix }}
tests-to-include: "test_nn test_torch test_cuda test_ops test_unary_ufuncs test_binary_ufuncs test_autograd inductor/test_torchinductor distributed/test_c10d_common distributed/test_c10d_nccl"
secrets: inherit
inductor-build:
name: inductor-build
uses: ./.github/workflows/_linux-build.yml

View File

@ -23,7 +23,7 @@ jobs:
with:
repository: pytorch/pytorch
stable-branch: viable/strict
requires: '[\"pull\", \"trunk\", \"lint\", \"^linux-binary-manywheel$\", \"^linux-binary-libtorch-release$\", \"linux-aarch64\"]'
requires: '[\"pull\", \"trunk\", \"lint\", \"linux-aarch64\"]'
secret-bot-token: ${{ secrets.MERGEBOT_TOKEN }}
clickhouse-url: ${{ secrets.CLICKHOUSE_URL }}
clickhouse-username: ${{ secrets.CLICKHOUSE_VIABLESTRICT_USERNAME }}
@ -48,4 +48,7 @@ jobs:
echo "{\"sha\": \"${LATEST_SHA}\", \"repository\":\"pytorch/pytorch\", \"timestamp\": ${TIME}}" > "/tmp/${LATEST_SHA}.json"
pip install awscli==1.29.40
aws s3 cp "/tmp/${LATEST_SHA}.json" "s3://ossci-raw-job-status/stable_pushes/pytorch/pytorch/${LATEST_SHA}.json"
# Push new viable/strict tag
cd pytorch/pytorch
git push origin "${LATEST_SHA}:refs/tags/viable/strict/${TIME}"
fi

View File

@ -42,7 +42,7 @@ jobs:
build-external-packages: "vllm"
build-environment: linux-jammy-cuda12.8-py3.12-gcc11
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3.12-gcc11-vllm
cuda-arch-list: '8.0;8.9;9.0'
cuda-arch-list: '8.0 8.9 9.0'
runner: linux.24xlarge.memory
test-matrix: |
{ include: [

2
.gitignore vendored
View File

@ -88,7 +88,7 @@ torch_compile_debug/
# Listed manually because some files in this directory are not generated
torch/testing/_internal/generated/annotated_fn_args.py
torch/testing/_internal/data/*.pt
torch/csrc/api/include/torch/version.h
torch/headeronly/version.h
torch/csrc/cudnn/cuDNN.cpp
torch/csrc/generated
torch/csrc/generic/TensorMethods.cpp

View File

@ -18,6 +18,7 @@ exclude_patterns = [
'torch/_inductor/autoheuristic/artifacts/**',
'scripts/**',
'test/generated_type_hints_smoketest.py',
'test/test_torchfuzz_repros.py',
# CPython tests
'test/dynamo/cpython/**',
# Tests from the NumPy test suite
@ -27,6 +28,7 @@ exclude_patterns = [
'torch/lib/**',
'venv/**',
'**/*.pyi',
"tools/experimental/torchfuzz/**",
'tools/test/test_selective_build.py',
]
command = [
@ -196,7 +198,7 @@ exclude_patterns = [
'tools/test/gen_operators_yaml_test.py',
'tools/test/gen_oplist_test.py',
'tools/test/test_selective_build.py',
'tools/experimental/dynamic_shapes/torchfuzz/**',
'tools/experimental/torchfuzz/**',
]
command = [
'python3',
@ -1260,6 +1262,7 @@ exclude_patterns = [
'test/test_masked.py',
'test/test_maskedtensor.py',
'test/test_matmul_cuda.py',
'test/test_scaled_matmul_cuda.py',
'test/test_meta.py',
'test/test_metal.py',
'test/test_mkl_verbose.py',
@ -1570,6 +1573,7 @@ exclude_patterns = [
'torch/_inductor/fx_passes/serialized_patterns/**',
'torch/_inductor/autoheuristic/artifacts/**',
'test/dynamo/cpython/**',
'test/test_torchfuzz_repros.py',
'scripts/**',
'third_party/**',
'fb/**',

View File

@ -13,6 +13,9 @@ load(":build_variables.bzl", "jit_core_sources", "lazy_tensor_ts_sources", "libt
load(":ufunc_defs.bzl", "aten_ufunc_generated_cpu_kernel_sources", "aten_ufunc_generated_cpu_sources", "aten_ufunc_generated_cuda_sources")
load("//:tools/bazel.bzl", "rules")
# Export files for use by torch/headeronly (where version.h generation now lives)
exports_files(["version.txt"])
define_targets(rules = rules)
COMMON_COPTS = [
@ -690,7 +693,9 @@ cc_library(
"torch/csrc/*/generated/*.h",
"torch/csrc/jit/serialization/mobile_bytecode_generated.h",
] + torch_cuda_headers,
) + GENERATED_AUTOGRAD_CPP + [":version_h"],
) + GENERATED_AUTOGRAD_CPP + [
"//torch/headeronly:version_h",
],
includes = [
"third_party/kineto/libkineto/include",
"torch/csrc",

View File

@ -181,15 +181,15 @@ caffe2/utils/hip @jeffdaily @jithunnair-amd
/torch/csrc/jit/python/init.cpp @mikaylagawarecki
# CUDA and CUDA math libraries
aten/src/ATen/cuda/ @eqy @syed-ahmed
aten/src/ATen/cudnn/ @eqy @syed-ahmed
aten/src/ATen/native/cuda/ @eqy @syed-ahmed
aten/src/ATen/native/cudnn/ @eqy @syed-ahmed
c10/cuda @eqy @syed-ahmed
torch/cuda/ @eqy @syed-ahmed
torch/csrc/cuda/ @eqy @syed-ahmed
torch/backends/cuda/ @eqy @syed-ahmed
torch/backends/cudnn/ @eqy @syed-ahmed
aten/src/ATen/cuda/ @eqy @syed-ahmed @Aidyn-A
aten/src/ATen/cudnn/ @eqy @syed-ahmed @Aidyn-A
aten/src/ATen/native/cuda/ @eqy @syed-ahmed @Aidyn-A
aten/src/ATen/native/cudnn/ @eqy @syed-ahmed @Aidyn-A
c10/cuda @eqy @syed-ahmed @Aidyn-A
torch/cuda/ @eqy @syed-ahmed @Aidyn-A
torch/csrc/cuda/ @eqy @syed-ahmed @Aidyn-A
torch/backends/cuda/ @eqy @syed-ahmed @Aidyn-A
torch/backends/cudnn/ @eqy @syed-ahmed @Aidyn-A
# PyTree utilities
/torch/utils/_pytree.py @XuehaiPan

View File

@ -81,7 +81,7 @@ git remote add upstream git@github.com:pytorch/pytorch.git
make setup-env
# Or run `make setup-env-cuda` for pre-built CUDA binaries
# Or run `make setup-env-rocm` for pre-built ROCm binaries
source venv/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
### Tips and Debugging
@ -182,28 +182,36 @@ You can use this script to check out a new nightly branch with the following:
```bash
./tools/nightly.py checkout -b my-nightly-branch
source venv/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
To install the nightly binaries built with CUDA, you can pass in the flag `--cuda`:
```bash
./tools/nightly.py checkout -b my-nightly-branch --cuda
source venv/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
To install the nightly binaries built with ROCm, you can pass in the flag `--rocm`:
```bash
./tools/nightly.py checkout -b my-nightly-branch --rocm
source venv/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
You can also use this tool to pull the nightly commits into the current branch:
```bash
./tools/nightly.py pull -p my-env
source my-env/bin/activate # or `& .\venv\Scripts\Activate.ps1` on Windows
./tools/nightly.py pull
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
To create the virtual environment with a specific Python interpreter, you can
pass in the `--python` argument:
```bash
./tools/nightly.py --python /path/to/python3.12
source venv/bin/activate # or `. .\venv\Scripts\activate` on Windows
```
Pulling will recreate a fresh virtual environment and reinstall the development

View File

@ -50,11 +50,10 @@ RUN git submodule update --init --recursive
FROM conda as conda-installs
ARG PYTHON_VERSION=3.11
ARG CUDA_PATH=cu121
ARG CUDA_CHANNEL=nvidia
ARG INSTALL_CHANNEL=whl/nightly
# Automatically set by buildx
RUN /opt/conda/bin/conda update -y -n base -c defaults conda
RUN /opt/conda/bin/conda install -y python=${PYTHON_VERSION}
# pinning version of conda here see: https://github.com/pytorch/pytorch/issues/164574
RUN /opt/conda/bin/conda install -y python=${PYTHON_VERSION} conda=25.7.0
ARG TARGETPLATFORM

View File

@ -3,6 +3,7 @@
<!-- toc -->
- [Release Compatibility Matrix](#release-compatibility-matrix)
- [PyTorch CUDA Support Matrix](#pytorch-cuda-support-matrix)
- [Release Cadence](#release-cadence)
- [General Overview](#general-overview)
- [Frequently Asked Questions](#frequently-asked-questions)
@ -63,6 +64,22 @@ Following is the Release Compatibility Matrix for PyTorch releases:
| 1.13 | >=3.7, <=3.10 | C++14 | CUDA 11.6, CUDNN 8.3.2.44 | CUDA 11.7, CUDNN 8.5.0.96 | ROCm 5.2 |
| 1.12 | >=3.7, <=3.10 | C++14 | CUDA 11.3, CUDNN 8.3.2.44 | CUDA 11.6, CUDNN 8.3.2.44 | ROCm 5.0 |
### PyTorch CUDA Support Matrix
For Release 2.9 PyTorch Supports following CUDA Architectures:
| CUDA | architectures supported for Linux x86 and Windows builds | notes |
| --- | --- | --- |
| 12.6.3 | Maxwell(5.0), Pascal(6.0), Volta(7.0), Turing(7.5), Ampere(8.0, 8.6), Hopper(9.0) | |
| 12.8.1 | Volta(7.0), Turing(7.5), Ampere(8.0, 8.6), Hopper(9.0), Blackwell(10.0, 12.0) | |
| 13.0.0 | Turing(7.5), Ampere(8.0, 8.6), Hopper(9.0), Blackwell(10.0, 12.0+PTX) | +PTX available on linux builds only |
| CUDA | architectures supported for Linux aarch64 builds |
| --- | --- |
| 12.6.3 | Ampere(8.0), Hopper(9.0) |
| 12.8.1 | Ampere(8.0), Hopper(9.0), Blackwell(10.0, 12.0) |
| 13.0.0 | Ampere(8.0), Hopper(9.0), Blackwell(10.0, 11.0, 12.0+PTX) |
## Release Cadence
Following is the release cadence. All future dates below are tentative. For latest updates on the release schedule, please follow [dev discuss](https://dev-discuss.pytorch.org/c/release-announcements/27). Please note: Patch Releases are optional.

View File

@ -605,6 +605,11 @@ if(UNIX)
if(HAVE_MALLOC_USABLE_SIZE)
add_definitions(-DHAVE_MALLOC_USABLE_SIZE=1)
endif(HAVE_MALLOC_USABLE_SIZE)
set(CMAKE_EXTRA_INCLUDE_FILES "fcntl.h")
CHECK_FUNCTION_EXISTS(posix_fallocate HAVE_POSIX_FALLOCATE)
if(HAVE_POSIX_FALLOCATE)
add_definitions(-DHAVE_POSIX_FALLOCATE=1)
endif(HAVE_POSIX_FALLOCATE)
endif(UNIX)
ADD_DEFINITIONS(-DUSE_EXTERNAL_MZCRC)

View File

@ -144,8 +144,7 @@ inline std::string _all_equal_numel_error(at::ArrayRef<Tensor> tensors) {
inline bool _apply_preamble(ArrayRef<Tensor> tensors) {
checkDeviceType("CPU_tensor_apply", tensors, kCPU);
checkLayout("CPU_tensor_apply", tensors, kStrided);
if (!_all_equal_numel(tensors))
TORCH_CHECK(false, _all_equal_numel_error(tensors));
TORCH_CHECK(_all_equal_numel(tensors), _all_equal_numel_error(tensors));
// An empty tensor has no elements
for (auto& t : tensors)
if (t.numel() == 0)

View File

@ -40,41 +40,6 @@ namespace {
->conv
->rnn
*/
const std::map<std::string, std::vector<std::string>> _fp32_precisions = {
{"generic", {{"ieee", "tf32", "bf16", "none"}}},
{"mkldnn", {{"ieee", "tf32", "bf16", "none"}}},
{"cuda", {{"ieee", "tf32", "none"}}}};
// Check whether the backend and op are legal
void check_fp32_prec_backend_and_op(
const std::string& backend,
const std::string& op) {
static std::vector<std::string> backends = {"generic", "mkldnn", "cuda"};
static std::vector<std::string> operators = {"conv", "matmul", "rnn", "all"};
TORCH_CHECK(
std::find(backends.begin(), backends.end(), backend) != backends.end(),
"Invalid backend: ",
backend);
TORCH_CHECK(
std::find(operators.begin(), operators.end(), op) != operators.end(),
"Invalid operator: ",
op);
if (backend == "generic") {
TORCH_CHECK(op == "all", "Invalid operation for generic backend: ", op);
}
}
// Return whether the precision is supported by backends
bool validate_fp32_prec(
const std::string& backend,
const std::string& precision) {
auto iterp = _fp32_precisions.find(backend);
TORCH_CHECK(iterp != _fp32_precisions.end());
auto precisions = iterp->second;
bool valid = std::find(precisions.begin(), precisions.end(), precision) !=
precisions.end();
return valid;
}
C10_ALWAYS_INLINE void warn_deprecated_fp32_precision_api(){
TORCH_WARN_ONCE(
@ -86,6 +51,54 @@ void check_fp32_prec_backend_and_op(
}
} // namespace
Float32Backend str2backend(const std::string& name) {
if (name == "generic")
return Float32Backend::GENERIC;
else if (name == "cuda")
return Float32Backend::CUDA;
else if (name == "mkldnn")
return Float32Backend::MKLDNN;
TORCH_CHECK(false, "Unknown backend: ", name);
}
Float32Op str2op(const std::string& name) {
if (name == "all")
return Float32Op::ALL;
else if (name == "conv")
return Float32Op::CONV;
else if (name == "rnn")
return Float32Op::RNN;
else if (name == "matmul")
return Float32Op::MATMUL;
TORCH_CHECK(false, "Unknown op: ", name);
}
Float32Precision str2precision(const std::string& name) {
if (name == "none")
return Float32Precision::NONE;
else if (name == "ieee")
return Float32Precision::IEEE;
else if (name == "tf32")
return Float32Precision::TF32;
else if (name == "bf16")
return Float32Precision::BF16;
TORCH_CHECK(false, "Unknown precision: ", name);
}
std::string precision2str(Float32Precision prec) {
switch (prec) {
case Float32Precision::NONE:
return "none";
case Float32Precision::IEEE:
return "ieee";
case Float32Precision::TF32:
return "tf32";
case Float32Precision::BF16:
return "bf16";
}
TORCH_CHECK(false, "Invalid enum Float32Precision(", static_cast<int>(prec), ")");
}
Context::Context() = default;
// TODO: This could be bad juju if someone calls globalContext() in the
@ -179,10 +192,10 @@ void Context::setUserEnabledNNPACK(bool e) {
enabled_nnpack = e;
}
bool Context::allowTF32CuDNN(const std::string& op) const {
if (op.empty()){
bool allow_tf32_rnn = float32Precision("cuda", "rnn") == "tf32";
bool allow_tf32_conv = float32Precision("cuda", "conv") == "tf32";
bool Context::allowTF32CuDNN(std::optional<Float32Op> op) const {
if (!op.has_value()) {
bool allow_tf32_rnn = float32Precision(Float32Backend::CUDA, Float32Op::RNN) == Float32Precision::TF32;
bool allow_tf32_conv = float32Precision(Float32Backend::CUDA, Float32Op::CONV) == Float32Precision::TF32;
TORCH_CHECK(
allow_tf32_rnn == allow_tf32_conv && allow_tf32_rnn == allow_tf32_cudnn,
"PyTorch is checking whether allow_tf32 is enabled for cuDNN without a specific operator name,",
@ -191,15 +204,15 @@ bool Context::allowTF32CuDNN(const std::string& op) const {
"We suggest only using the new API to set the TF32 flag(s). See also: ",
"https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices");
} else {
return float32Precision("cuda", op) == "tf32";
return float32Precision(Float32Backend::CUDA, op.value()) == Float32Precision::TF32;
}
warn_deprecated_fp32_precision_api();
return allow_tf32_cudnn;
}
void Context::setAllowTF32CuDNN(bool b) {
setFloat32Precision("cuda", "rnn", b ? "tf32" : "none");
setFloat32Precision("cuda", "conv", b ? "tf32" : "none");
setFloat32Precision(Float32Backend::CUDA, Float32Op::RNN, b ? Float32Precision::TF32 : Float32Precision::NONE);
setFloat32Precision(Float32Backend::CUDA, Float32Op::CONV, b ? Float32Precision::TF32 : Float32Precision::NONE);
allow_tf32_cudnn = b;
warn_deprecated_fp32_precision_api();
}
@ -279,42 +292,6 @@ bool Context::userEnabledOverrideableSDP() const {
return enabled_overrideable;
}
static constexpr const auto cublas_config_var_name = "CUBLAS_WORKSPACE_CONFIG";
static constexpr const std::array<const char*, 2> cublas_deterministic_configs = {":4096:8", ":16:8"};
bool Context::checkCuBLASConfigDeterministic() {
// If using CUDA 10.2 or greater, need to make sure CuBLAS workspace config
// is set to deterministic setting
if (hasCUDART()) {
const auto workspace_config = c10::utils::get_env(cublas_config_var_name);
return (workspace_config == cublas_deterministic_configs[0] || workspace_config == cublas_deterministic_configs[1]);
}
return true;
}
void Context::alertCuBLASConfigNotDeterministic() const {
static const bool cublas_config_deterministic = checkCuBLASConfigDeterministic();
if (C10_LIKELY(!deterministicAlgorithms() || cublas_config_deterministic)) {
return;
}
auto msg = c10::str(
"Deterministic behavior was enabled with either `torch.use_deterministic_algorithms(True)` or ",
"`at::Context::setDeterministicAlgorithms(true)`, but this operation is not deterministic because ",
"it uses CuBLAS and you have CUDA >= 10.2. To enable deterministic behavior in this ",
"case, you must set an environment variable before running your PyTorch application: ",
cublas_config_var_name, "=", cublas_deterministic_configs[0], " or ",
cublas_config_var_name, "=", cublas_deterministic_configs[1], ". For more information, go to ",
"https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility"
);
if (deterministicAlgorithmsWarnOnly()) {
TORCH_WARN(msg);
} else {
TORCH_CHECK(false, msg);
}
}
bool Context::benchmarkCuDNN() const {
return benchmark_cudnn;
}
@ -341,7 +318,7 @@ void Context::setImmediateMiopen(bool b) {
bool Context::allowTF32CuBLAS() const {
bool legacy_allow_tf32 = float32_matmul_precision != at::Float32MatmulPrecision::HIGHEST;
bool allow_tf32_new = float32Precision("cuda", "matmul") == "tf32";
bool allow_tf32_new = float32Precision(Float32Backend::CUDA, Float32Op::MATMUL) == Float32Precision::TF32;
TORCH_CHECK(
legacy_allow_tf32 == allow_tf32_new,
"PyTorch is checking whether allow_tf32_new is enabled for cuBlas matmul,",
@ -354,17 +331,17 @@ bool Context::allowTF32CuBLAS() const {
void Context::setAllowTF32CuBLAS(bool b) {
float32_matmul_precision = b ? at::Float32MatmulPrecision::HIGH : at::Float32MatmulPrecision::HIGHEST;
setFloat32Precision("cuda", "matmul", b ? "tf32" : "ieee");
setFloat32Precision(Float32Backend::CUDA, Float32Op::MATMUL, b ? Float32Precision::TF32 : Float32Precision::IEEE);
}
Float32MatmulPrecision Context::float32MatmulPrecision() const {
bool invalid = float32Precision("cuda", "matmul") == "tf32" &&
bool invalid = float32Precision(Float32Backend::CUDA, Float32Op::MATMUL) == Float32Precision::TF32 &&
float32_matmul_precision == at::Float32MatmulPrecision::HIGHEST;
invalid = invalid ||
(float32Precision("mkldnn", "matmul") == "bf16" &&
(float32Precision(Float32Backend::MKLDNN, Float32Op::MATMUL) == Float32Precision::BF16 &&
float32_matmul_precision != at::Float32MatmulPrecision::MEDIUM);
invalid = invalid ||
(float32Precision("mkldnn", "matmul") == "tf32" &&
(float32Precision(Float32Backend::MKLDNN, Float32Op::MATMUL) == Float32Precision::TF32 &&
float32_matmul_precision != at::Float32MatmulPrecision::HIGH);
TORCH_CHECK(
!invalid,
@ -376,15 +353,26 @@ Float32MatmulPrecision Context::float32MatmulPrecision() const {
return float32_matmul_precision;
}
std::string Context::float32Precision(const std::string& backend, const std::string& op) const {
check_fp32_prec_backend_and_op(backend, op);
auto precision = fp32_precision.find(backend)->second.find(op)->second;
if (precision == "none")
precision = fp32_precision.find(backend)->second.find("all")->second;
if (precision == "none")
precision = fp32_precision.find("generic")->second.find("all")->second;
bool valid_prec = validate_fp32_prec(backend, precision);
return valid_prec ? precision : "none";
Float32Precision Context::float32Precision(Float32Backend backend, Float32Op op) const {
std::pair<Float32Backend, Float32Op> key{backend, op};
auto it = fp32_precision.find(key);
TORCH_CHECK(it != fp32_precision.end(), "Invalid (backend, op) pair: (", backend, ", ", op, ")");
Float32Precision precision = it->second;
if (precision == Float32Precision::NONE) {
key.second = Float32Op::ALL;
precision = fp32_precision.find(key)->second;
}
if (precision == Float32Precision::NONE) {
key.first = Float32Backend::GENERIC;
precision = fp32_precision.find(key)->second;
}
// "cuda" does not support "bf16"
if (backend == Float32Backend::CUDA && precision == Float32Precision::BF16) {
return Float32Precision::NONE;
}
return precision;
}
void Context::setFloat32MatmulPrecision(const std::string &s) {
@ -393,18 +381,18 @@ void Context::setFloat32MatmulPrecision(const std::string &s) {
// TODO: consider if CuDNN field needs to also be set for potential future CuDNN ops like multi-headed attention
if (s_ == "highest") {
float32_matmul_precision = at::Float32MatmulPrecision::HIGHEST;
setFloat32Precision("cuda", "matmul", "ieee");
setFloat32Precision("mkldnn", "matmul", "ieee");
setFloat32Precision(Float32Backend::CUDA, Float32Op::MATMUL, Float32Precision::IEEE);
setFloat32Precision(Float32Backend::MKLDNN, Float32Op::MATMUL, Float32Precision::IEEE);
return true;
} else if (s_ == "high") {
float32_matmul_precision = at::Float32MatmulPrecision::HIGH;
setFloat32Precision("cuda", "matmul", "tf32");
setFloat32Precision("mkldnn", "matmul", "tf32");
setFloat32Precision(Float32Backend::CUDA, Float32Op::MATMUL, Float32Precision::TF32);
setFloat32Precision(Float32Backend::MKLDNN, Float32Op::MATMUL, Float32Precision::TF32);
return true;
} else if (s_ == "medium") {
float32_matmul_precision = at::Float32MatmulPrecision::MEDIUM;
setFloat32Precision("cuda", "matmul", "tf32");
setFloat32Precision("mkldnn", "matmul", "bf16");
setFloat32Precision(Float32Backend::CUDA, Float32Op::MATMUL, Float32Precision::TF32);
setFloat32Precision(Float32Backend::MKLDNN, Float32Op::MATMUL, Float32Precision::BF16);
return true;
}
return false;
@ -418,25 +406,16 @@ void Context::setFloat32MatmulPrecision(const std::string &s) {
"setFloat32MatmulPrecision call has no effect.");
}
void Context::setFloat32Precision(const std::string& backend, const std::string& op, const std::string& p) {
check_fp32_prec_backend_and_op(backend, op);
if (validate_fp32_prec(backend, p)) {
fp32_precision[backend][op] = p;
} else {
std::string msg;
auto iterp = _fp32_precisions.find(backend);
TORCH_CHECK(iterp != _fp32_precisions.end());
for (const auto& p : iterp->second) {
msg += p;
msg += " ";
}
TORCH_WARN(
"you have set wrong precision for backend:",
backend,
" setFloat32Precision call has no effect.",
"Please choose precision from: ",
msg);
}
void Context::setFloat32Precision(Float32Backend backend, Float32Op op, Float32Precision p) {
auto it = fp32_precision.find(std::make_pair(backend, op));
TORCH_CHECK(
it != fp32_precision.end(),
"Invalid (backend, op) pair: (", backend, ", ", op, ")");
TORCH_CHECK(
!(backend == Float32Backend::CUDA && p == Float32Precision::BF16),
"backend 'cuda' does not support precision 'bf16'");
it->second = p;
}
at::LinalgBackend Context::linalgPreferredBackend() const {
@ -504,8 +483,8 @@ at::BlasBackend Context::blasPreferredBackend() {
#if ROCM_VERSION >= 60300
"gfx1100", "gfx1101", "gfx1200", "gfx1201", "gfx908",
#endif
#if ROCM_VERSION >= 60500
"gfx950"
#if ROCM_VERSION >= 70000
"gfx950", "gfx1150", "gfx1151"
#endif
};
for (auto index: c10::irange(detail::getCUDAHooks().deviceCount())) {
@ -608,20 +587,33 @@ void Context::setROCmFAPreferredBackend(at::ROCmFABackend b) {
rocm_fa_preferred_backend = b;
}
bool Context::allowFP16ReductionCuBLAS() const {
CuBLASReductionOption Context::allowFP16ReductionCuBLAS() const {
return allow_fp16_reduction_cublas;
}
void Context::setAllowFP16ReductionCuBLAS(bool b) {
allow_fp16_reduction_cublas = b;
CuBLASReductionOption inline get_reduction_option(bool allow_reduced_precision, bool allow_splitk) {
TORCH_CHECK(
!(allow_reduced_precision && !allow_splitk),
"allow_splitk=False is not supported when reduced precision reductions are enabled");
if (allow_reduced_precision) {
return CuBLASReductionOption::AllowReducedPrecisionWithSplitK;
} else if (allow_splitk) {
return CuBLASReductionOption::DisallowReducedPrecisionAllowSplitK;
} else {
return CuBLASReductionOption::DisallowReducedPrecisionDisallowSplitK;
}
}
bool Context::allowBF16ReductionCuBLAS() const {
void Context::setAllowFP16ReductionCuBLAS(bool allow_reduced_precision, bool allow_splitk) {
allow_fp16_reduction_cublas = get_reduction_option(allow_reduced_precision, allow_splitk);
}
CuBLASReductionOption Context::allowBF16ReductionCuBLAS() const {
return allow_bf16_reduction_cublas;
}
void Context::setAllowBF16ReductionCuBLAS(bool b) {
allow_bf16_reduction_cublas = b;
void Context::setAllowBF16ReductionCuBLAS(bool allow_reduced_precision, bool allow_splitk) {
allow_bf16_reduction_cublas = get_reduction_option(allow_reduced_precision, allow_splitk);
}
bool Context::allowFP16AccumulationCuBLAS() const {

View File

@ -25,11 +25,13 @@
#include <c10/util/CallOnce.h>
#include <c10/util/Exception.h>
#include <c10/util/env.h>
#include <c10/util/hash.h>
#include <c10/util/irange.h>
#include <cstdint>
#include <map>
#include <mutex>
#include <unordered_map>
namespace at {
@ -37,6 +39,20 @@ class Tensor;
enum class TORCH_API Float32MatmulPrecision { HIGHEST, HIGH, MEDIUM };
enum class CuBLASReductionOption : uint8_t {
AllowReducedPrecisionWithSplitK = 0,
DisallowReducedPrecisionAllowSplitK = 1,
DisallowReducedPrecisionDisallowSplitK = 2,
};
enum class TORCH_API Float32Backend { GENERIC, CUDA, MKLDNN };
enum class TORCH_API Float32Op { ALL, CONV, RNN, MATMUL };
enum class TORCH_API Float32Precision { NONE, IEEE, TF32, BF16 };
TORCH_API Float32Backend str2backend(const std::string& name);
TORCH_API Float32Op str2op(const std::string& name);
TORCH_API Float32Precision str2precision(const std::string& name);
TORCH_API std::string precision2str(Float32Precision prec);
class TORCH_API Context {
public:
Context();
@ -210,15 +226,15 @@ class TORCH_API Context {
bool userEnabledMkldnn() const;
void setUserEnabledMkldnn(bool e);
bool benchmarkCuDNN() const;
void setBenchmarkCuDNN(bool);
void setBenchmarkCuDNN(bool /*b*/);
int benchmarkLimitCuDNN() const;
void setBenchmarkLimitCuDNN(int);
void setBenchmarkLimitCuDNN(int /*b*/);
bool immediateMiopen() const;
void setImmediateMiopen(bool);
void setImmediateMiopen(bool /*b*/);
bool deterministicCuDNN() const;
void setDeterministicCuDNN(bool);
void setDeterministicCuDNN(bool /*b*/);
bool deterministicMkldnn() const;
void setDeterministicMkldnn(bool);
void setDeterministicMkldnn(bool /*b*/);
bool userEnabledNNPACK() const;
void setUserEnabledNNPACK(bool e);
@ -236,32 +252,32 @@ class TORCH_API Context {
void setSDPPriorityOrder(const std::vector<int64_t>& order);
std::array<at::SDPBackend, at::num_sdp_backends> sDPPriorityOrder();
void setSDPUseFlash(bool);
void setSDPUseFlash(bool /*e*/);
bool userEnabledFlashSDP() const;
void setSDPUseMemEfficient(bool);
void setSDPUseMemEfficient(bool /*e*/);
bool userEnabledMemEfficientSDP() const;
void setSDPUseMath(bool);
void setSDPUseMath(bool /*e*/);
bool userEnabledMathSDP() const;
void setSDPUseCuDNN(bool);
void setSDPUseCuDNN(bool /*e*/);
bool userEnabledCuDNNSDP() const;
void setAllowFP16BF16ReductionMathSDP(bool);
void setAllowFP16BF16ReductionMathSDP(bool /*e*/);
bool allowFP16BF16ReductionMathSDP() const;
void setSDPUseOverrideable(bool);
void setSDPUseOverrideable(bool /*e*/);
bool userEnabledOverrideableSDP() const;
at::LinalgBackend linalgPreferredBackend() const;
void setLinalgPreferredBackend(at::LinalgBackend);
void setLinalgPreferredBackend(at::LinalgBackend /*b*/);
at::BlasBackend blasPreferredBackend();
void setBlasPreferredBackend(at::BlasBackend);
void setBlasPreferredBackend(at::BlasBackend /*b*/);
at::ROCmFABackend getROCmFAPreferredBackend();
void setROCmFAPreferredBackend(at::ROCmFABackend);
void setROCmFAPreferredBackend(at::ROCmFABackend /*b*/);
// Note [Enabling Deterministic Operations]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -294,9 +310,9 @@ class TORCH_API Context {
bool deterministicAlgorithms() const;
bool deterministicAlgorithmsWarnOnly() const;
void setDeterministicAlgorithms(bool, bool);
void setDeterministicAlgorithms(bool /*b*/, bool /*warn_only*/);
bool deterministicFillUninitializedMemory() const;
void setDeterministicFillUninitializedMemory(bool);
void setDeterministicFillUninitializedMemory(bool /*b*/);
// Note [Writing Nondeterministic Operations]
// ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -310,13 +326,7 @@ class TORCH_API Context {
//
// * Throw an error when `Context::deterministicAlgorithms()` is true. Most
// of the time, this should be accomplished by calling
// `at::globalContext().alertNotDeterminstic()`. However, if the
// nondeterministic behavior is caused by the CuBLAS workspace
// configuration in CUDA >= 10.2,
// `at::globalContext().alertCuBLASConfigNotDeterministic()` should be
// called instead (in this case, a comment explaining why the operation is
// nondeterministic is not necessary). See below for details on these
// methods.
// `at::globalContext().alertNotDeterminstic().
//
// * Have an entry in the list of nondeterministic PyTorch operations in the
// docstring of `use_deterministic_algorithms()` in torch/__init__.py
@ -340,33 +350,29 @@ class TORCH_API Context {
// Throws an error if `Context::deterministicAlgorithms()` is true
static void alertNotDeterministic(std::string_view const& caller);
// Throws an error if `Context::deterministicAlgorithms()` is true, CUDA
// >= 10.2, and CUBLAS_WORKSPACE_CONFIG is not set to either ":16:8" or
// ":4096:8". For more details:
// https://docs.nvidia.com/cuda/cublas/index.html#results-reproducibility
void alertCuBLASConfigNotDeterministic() const;
void setFloat32MatmulPrecision(const std::string& s);
void setFloat32Precision(
const std::string& backend,
const std::string& op,
const std::string& s);
bool allowTF32CuDNN(const std::string& op = std::string()) const;
void setAllowTF32CuDNN(bool);
Float32Backend backend,
Float32Op op,
Float32Precision p);
bool allowTF32CuDNN(std::optional<Float32Op> op = std::nullopt) const;
void setAllowTF32CuDNN(bool /*b*/);
bool allowTF32OneDNN() const;
void setAllowTF32OneDNN(bool);
void setAllowTF32OneDNN(bool /*b*/);
bool allowTF32CuBLAS() const;
void setAllowTF32CuBLAS(bool);
void setAllowTF32CuBLAS(bool /*b*/);
Float32MatmulPrecision float32MatmulPrecision() const;
std::string float32Precision(
const std::string& backend,
const std::string& op) const;
bool allowFP16ReductionCuBLAS() const;
void setAllowFP16ReductionCuBLAS(bool);
bool allowBF16ReductionCuBLAS() const;
void setAllowBF16ReductionCuBLAS(bool);
Float32Precision float32Precision(Float32Backend backend, Float32Op op) const;
CuBLASReductionOption allowFP16ReductionCuBLAS() const;
void setAllowFP16ReductionCuBLAS(
bool allow_reduced_precision,
bool allow_splitk = true);
CuBLASReductionOption allowBF16ReductionCuBLAS() const;
void setAllowBF16ReductionCuBLAS(
bool allow_reduced_precision,
bool allow_splitk = true);
bool allowFP16AccumulationCuBLAS() const;
void setAllowFP16AccumulationCuBLAS(bool);
void setAllowFP16AccumulationCuBLAS(bool /*b*/);
// Matmuls can use a so-called "persistent" kernel which launches one CUDA
// block for each SM on the GPU, and each block then iterates over multiple
@ -378,7 +384,7 @@ class TORCH_API Context {
// to make matmuls target only a subset of the SMs, so they can fully schedule
// even next to a comms kernel, and only be a few percent slower.
std::optional<int32_t> _SMCarveout_EXPERIMENTAL() const;
void _setSMCarveout_EXPERIMENTAL(std::optional<int32_t>);
void _setSMCarveout_EXPERIMENTAL(std::optional<int32_t> /*c*/);
at::QEngine qEngine() const;
void setQEngine(at::QEngine e);
@ -399,7 +405,7 @@ class TORCH_API Context {
void setDefaultMobileCPUAllocator();
void unsetDefaultMobileCPUAllocator();
bool allowFP16ReductionCPU() const;
void setAllowFP16ReductionCPU(bool);
void setAllowFP16ReductionCPU(bool /*b*/);
// Preserved for BC
void lazyInitCUDA() {
@ -429,7 +435,6 @@ class TORCH_API Context {
}
private:
static bool checkCuBLASConfigDeterministic();
std::array<c10::once_flag, at::COMPILE_TIME_MAX_DEVICE_TYPES> init_;
bool enabled_cudnn = true;
bool deterministic_cudnn = false;
@ -457,8 +462,10 @@ class TORCH_API Context {
: at::Float32MatmulPrecision::HIGHEST;
int benchmark_limit_cudnn = 10;
bool allow_tf32_cudnn = true;
bool allow_fp16_reduction_cublas = true;
bool allow_bf16_reduction_cublas = true;
CuBLASReductionOption allow_fp16_reduction_cublas =
CuBLASReductionOption::AllowReducedPrecisionWithSplitK;
CuBLASReductionOption allow_bf16_reduction_cublas =
CuBLASReductionOption::AllowReducedPrecisionWithSplitK;
bool allow_fp16_accumulation_cublas = false;
std::optional<int32_t> sm_carveout = std::nullopt;
bool enabled_mkldnn = true;
@ -488,21 +495,20 @@ class TORCH_API Context {
bool enable_sparse_tensor_invariant_checks = false;
bool allow_fp16_reduction_cpu = false;
std::map<std::string, std::map<std::string, std::string>> fp32_precision = {
{"generic", {{"all", "none"}}},
{"mkldnn",
{{"matmul", "none"},
{"conv", "none"},
{"rnn", "none"},
{"all", "none"}}},
{"cuda",
{{"matmul",
float32_matmul_precision == at::Float32MatmulPrecision::HIGHEST
? "none"
: "tf32"},
{"conv", "tf32"},
{"rnn", "tf32"},
{"all", "none"}}},
using Key = std::pair<Float32Backend, Float32Op>;
std::unordered_map<Key, Float32Precision, c10::hash<Key>> fp32_precision = {
{{Float32Backend::GENERIC, Float32Op::ALL}, Float32Precision::NONE},
{{Float32Backend::MKLDNN, Float32Op::ALL}, Float32Precision::NONE},
{{Float32Backend::MKLDNN, Float32Op::CONV}, Float32Precision::NONE},
{{Float32Backend::MKLDNN, Float32Op::RNN}, Float32Precision::NONE},
{{Float32Backend::MKLDNN, Float32Op::MATMUL}, Float32Precision::NONE},
{{Float32Backend::CUDA, Float32Op::ALL}, Float32Precision::NONE},
{{Float32Backend::CUDA, Float32Op::CONV}, Float32Precision::TF32},
{{Float32Backend::CUDA, Float32Op::RNN}, Float32Precision::TF32},
{{Float32Backend::CUDA, Float32Op::MATMUL},
float32_matmul_precision == at::Float32MatmulPrecision::HIGHEST
? Float32Precision::NONE
: Float32Precision::TF32},
};
Allocator* prev_allocator_ptr_{nullptr};
@ -684,5 +690,4 @@ struct TORCH_API ROCmBackwardPassGuard {
~ROCmBackwardPassGuard();
static bool is_backward_pass();
};
} // namespace at

View File

@ -62,7 +62,7 @@ constexpr const char* unknown_eventname = "eventname not specified";
#endif
} // namespace (anonymous)
MapAllocator::MapAllocator(WithFd, std::string_view filename, int fd, int flags, size_t size)
MapAllocator::MapAllocator(WithFd /*unused*/, std::string_view filename, int fd, int flags, size_t size)
: filename_(filename.empty() ? unknown_filename : filename)
, size_(0) // to be filled later
#ifdef _WIN32
@ -292,6 +292,28 @@ MapAllocator::MapAllocator(WithFd, std::string_view filename, int fd, int flags,
if (ftruncate(fd, static_cast<off_t>(size)) == -1) {
TORCH_CHECK(false, "unable to resize file <", filename_, "> to the right size: ", c10::utils::str_error(errno), " (", errno, ")");
}
#ifdef HAVE_POSIX_FALLOCATE
if (flags_ & ALLOCATOR_MAPPED_SHAREDMEM) {
for (;;) {
if (posix_fallocate(fd, 0, static_cast<off_t>(size)) == 0) {
break;
}
if (errno == EINTR) {
continue;
}
if (errno == EINVAL || errno == EOPNOTSUPP) {
// the underlying filesystem does not support the operation
break;
}
TORCH_CHECK(false, "unable to allocate shared memory(shm) for file <", filename_, ">: ", c10::utils::str_error(errno), " (", errno, ")");
}
}
#endif
if (fstat(fd, &file_stat) == -1 || file_stat.st_size < static_cast<int64_t>(size)) {
#ifndef STRIP_ERROR_MESSAGES
int last_err = errno;
@ -472,7 +494,7 @@ RefcountedMapAllocator::RefcountedMapAllocator(const char *filename, int flags,
initializeAlloc();
}
RefcountedMapAllocator::RefcountedMapAllocator(WithFd, const char *filename, int fd, int flags, size_t size)
RefcountedMapAllocator::RefcountedMapAllocator(WithFd /*unused*/, const char *filename, int fd, int flags, size_t size)
: RefcountedMapAllocatorArgCheck(flags)
, MapAllocator(WITH_FD, filename, flags, fd, size + map_alloc_alignment) {
@ -592,7 +614,7 @@ at::DataPtr MapAllocator::makeDataPtr(std::string_view filename, int flags, size
return {context->data(), context, &deleteMapAllocator, at::DeviceType::CPU};
}
at::DataPtr MapAllocator::makeDataPtr(WithFd, const char *filename, int fd, int flags, size_t size, size_t* actual_size_out) {
at::DataPtr MapAllocator::makeDataPtr(WithFd /*unused*/, const char *filename, int fd, int flags, size_t size, size_t* actual_size_out) {
auto* context = new MapAllocator(WITH_FD, filename, fd, flags, size);
if (actual_size_out) *actual_size_out = context->size();
return {context->data(), context, &deleteMapAllocator, at::DeviceType::CPU};
@ -604,7 +626,7 @@ at::DataPtr RefcountedMapAllocator::makeDataPtr(const char *filename, int flags,
return {context->data(), context, &deleteRefcountedMapAllocator, at::DeviceType::CPU};
}
at::DataPtr RefcountedMapAllocator::makeDataPtr(WithFd, const char *filename, int fd, int flags, size_t size, size_t* actual_size_out) {
at::DataPtr RefcountedMapAllocator::makeDataPtr(WithFd /*unused*/, const char *filename, int fd, int flags, size_t size, size_t* actual_size_out) {
auto* context = new RefcountedMapAllocator(WITH_FD, filename, fd, flags, size);
if (actual_size_out) *actual_size_out = context->size() - map_alloc_alignment;
return {context->data(), context, &deleteRefcountedMapAllocator, at::DeviceType::CPU};

View File

@ -25,7 +25,7 @@ class TORCH_API MapAllocator {
public:
MapAllocator(std::string_view filename, int flags, size_t size);
MapAllocator(
WithFd,
WithFd /*unused*/,
std::string_view filename,
int fd,
int flags,
@ -59,14 +59,14 @@ class TORCH_API MapAllocator {
return flags_;
}
static MapAllocator* fromDataPtr(const at::DataPtr&);
static MapAllocator* fromDataPtr(const at::DataPtr& /*dptr*/);
static at::DataPtr makeDataPtr(
std::string_view filename,
int flags,
size_t size,
size_t* actual_size_out);
static at::DataPtr makeDataPtr(
WithFd,
WithFd /*unused*/,
const char* filename,
int fd,
int flags,
@ -105,13 +105,13 @@ class TORCH_API RefcountedMapAllocator : private RefcountedMapAllocatorArgCheck,
public:
RefcountedMapAllocator(const char* filename, int flags, size_t size);
RefcountedMapAllocator(
WithFd,
WithFd /*unused*/,
const char* filename,
int fd,
int flags,
size_t size);
static RefcountedMapAllocator* fromDataPtr(const at::DataPtr&);
static RefcountedMapAllocator* fromDataPtr(const at::DataPtr& /*dptr*/);
RefcountedMapAllocator(const RefcountedMapAllocator&) = delete;
RefcountedMapAllocator(RefcountedMapAllocator&&) = delete;
RefcountedMapAllocator& operator=(const RefcountedMapAllocator&) = delete;
@ -122,7 +122,7 @@ class TORCH_API RefcountedMapAllocator : private RefcountedMapAllocatorArgCheck,
size_t size,
size_t* actual_size_out);
static at::DataPtr makeDataPtr(
WithFd,
WithFd /*unused*/,
const char* filename,
int fd,
int flags,

View File

@ -179,7 +179,7 @@ void propagate_names_except(const Tensor& result, const Tensor& src, IntArrayRef
return;
}
const auto src_names = src.names();
const auto result_dim = static_cast<int64_t>(result.dim());
const auto result_dim = result.dim();
const auto src_dim = static_cast<int64_t>(src_names.size());
const auto excluded_dim = static_cast<int64_t>(excluded_idxs.size());
TORCH_INTERNAL_ASSERT(src_dim - excluded_dim == result_dim);

View File

@ -273,7 +273,7 @@ c10::SymInt NestedTensorImpl::sym_numel_custom() const {
return NestedTensorImpl::numel_custom();
}
c10::SymBool NestedTensorImpl::sym_is_contiguous_custom(MemoryFormat) const {
c10::SymBool NestedTensorImpl::sym_is_contiguous_custom(MemoryFormat /*memory_format*/) const {
return nested_tensor_impl_is_contiguous(this);
}
IntArrayRef NestedTensorImpl::sizes_custom() const {

View File

@ -115,7 +115,8 @@ struct TORCH_API NestedTensorImpl : public c10::TensorImpl {
// with real implementations
int64_t numel_custom() const override;
c10::SymInt sym_numel_custom() const override;
c10::SymBool sym_is_contiguous_custom(MemoryFormat) const override;
c10::SymBool sym_is_contiguous_custom(
MemoryFormat /*memory_format*/) const override;
int64_t size_custom(int64_t d) const override {
return this->size(d);
}

View File

@ -14,7 +14,7 @@ inline int64_t divup(int64_t x, int64_t y) {
TORCH_API void init_num_threads();
// Sets the number of threads to be used in parallel region
TORCH_API void set_num_threads(int);
TORCH_API void set_num_threads(int /*nthreads*/);
// Returns the maximum number of threads that may be used in a parallel region
TORCH_API int get_num_threads();
@ -37,7 +37,7 @@ inline void lazy_init_num_threads() {
}
}
TORCH_API void set_thread_num(int);
TORCH_API void set_thread_num(int /*id*/);
class TORCH_API ThreadIdGuard {
public:
@ -130,7 +130,7 @@ inline scalar_t parallel_reduce(
TORCH_API std::string get_parallel_info();
// Sets number of threads used for inter-op parallelism
TORCH_API void set_num_interop_threads(int);
TORCH_API void set_num_interop_threads(int /*nthreads*/);
// Returns the number of threads used for inter-op parallelism
TORCH_API size_t get_num_interop_threads();

View File

@ -252,7 +252,7 @@ void SparseCsrTensorImpl::set_stride(int64_t dim, int64_t new_stride) {
void SparseCsrTensorImpl::set_storage_offset(int64_t storage_offset) {
TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have set_storage_offset.");
}
c10::SymBool SparseCsrTensorImpl::sym_is_contiguous_custom(MemoryFormat) const {
c10::SymBool SparseCsrTensorImpl::sym_is_contiguous_custom(MemoryFormat /*memory_format*/) const {
TORCH_CHECK(false, "Sparse ", at::sparse_csr::layoutToString(layout_, /*upper=*/true), " tensors do not have is_contiguous");
}
} // namespace at

View File

@ -32,10 +32,10 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
public:
explicit SparseCsrTensorImpl(
at::DispatchKeySet,
at::DispatchKeySet /*key_set*/,
at::Device device,
Layout layout,
const caffe2::TypeMeta);
const caffe2::TypeMeta /*data_type*/);
void resize_(int64_t nnz, IntArrayRef size);
void resize_and_clear_(
@ -86,7 +86,8 @@ struct TORCH_API SparseCsrTensorImpl : public TensorImpl {
protected:
IntArrayRef strides_custom() const override;
SymIntArrayRef sym_strides_custom() const override;
SymBool sym_is_contiguous_custom(MemoryFormat) const override;
SymBool sym_is_contiguous_custom(
MemoryFormat /*memory_format*/) const override;
public:
void set_size(int64_t dim, int64_t new_size) override;

View File

@ -46,7 +46,9 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
public:
// Public for now...
explicit SparseTensorImpl(at::DispatchKeySet, const caffe2::TypeMeta);
explicit SparseTensorImpl(
at::DispatchKeySet /*key_set*/,
const caffe2::TypeMeta /*data_type*/);
void release_resources() override;
@ -229,14 +231,14 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
}
void resize_(int64_t sparse_dim, int64_t dense_dim, ArrayRef<int64_t> size) {
return _resize_(sparse_dim, dense_dim, size);
_resize_(sparse_dim, dense_dim, size);
}
void resize_(
int64_t sparse_dim,
int64_t dense_dim,
ArrayRef<c10::SymInt> size) {
return _resize_(sparse_dim, dense_dim, size);
_resize_(sparse_dim, dense_dim, size);
}
// NOTE: this function will resize the sparse tensor and also set `indices`
@ -384,8 +386,8 @@ struct TORCH_API SparseTensorImpl : public TensorImpl {
private:
explicit SparseTensorImpl(
at::DispatchKeySet,
const caffe2::TypeMeta,
at::DispatchKeySet /*key_set*/,
const caffe2::TypeMeta /*data_type*/,
at::Tensor indices,
at::Tensor values);

View File

@ -59,7 +59,7 @@ static inline void set_item(const Tensor& self, ArrayRef<TensorIndex> indices, c
}
}
return set_item(self, indices, value);
set_item(self, indices, value);
}
} // namespace indexing

View File

@ -112,10 +112,10 @@ TORCH_API std::ostream& operator<<(std::ostream& stream, const Slice& slice);
// `torch.tensor([1, 2])`) | `torch::tensor({1, 2})`
struct TORCH_API TensorIndex final {
// Case 1: `at::indexing::None`
TensorIndex(std::nullopt_t) : type_(TensorIndexType::None) {}
TensorIndex(std::nullopt_t /*unused*/) : type_(TensorIndexType::None) {}
// Case 2: "..." / `at::indexing::Ellipsis`
TensorIndex(at::indexing::EllipsisIndexType)
TensorIndex(at::indexing::EllipsisIndexType /*unused*/)
: type_(TensorIndexType::Ellipsis) {}
TensorIndex(const char* str) : TensorIndex(at::indexing::Ellipsis) {
TORCH_CHECK_VALUE(
@ -214,7 +214,7 @@ inline Tensor applySlice(
"step must be greater than zero");
// See NOTE [nested tensor size for indexing]
if (self_sizes.has_value() && self_sizes.value().size() > 0) {
if (self_sizes.has_value() && !self_sizes.value().empty()) {
// Skip this optimization if we are tracing, as the trace may be polymorphic
// over the shape of the `self` tensor, and we still want to record
// the slice.

View File

@ -765,7 +765,8 @@ void TensorIteratorBase::for_each(loop2d_t loop, int64_t grain_size) {
if (numel == 0) {
return;
} else if (numel < grain_size || at::get_num_threads() == 1) {
return serial_for_each(loop, {0, numel});
serial_for_each(loop, {0, numel});
return;
} else {
at::parallel_for(0, numel, grain_size, [&](int64_t begin, int64_t end) {
serial_for_each(loop, {begin, end});

View File

@ -250,7 +250,7 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {
using PtrVector = SmallVector<char*, 4>;
using StrideVector = SmallVector<int64_t, 6>;
void build(TensorIteratorConfig&);
void build(TensorIteratorConfig& /*config*/);
// The inner-loop function operates on the fastest moving dimension. It
// implements element-wise operations in terms of 1-d strided tensors.
@ -618,20 +618,20 @@ struct TORCH_API TensorIteratorBase : public impl::MetaBase {
#undef TORCH_DISALLOW_TEMPORARIES
protected:
// Mutable reference as it moves tensors out of TensorIteratorConfig
void populate_operands(TensorIteratorConfig&);
void populate_operands(TensorIteratorConfig& /*config*/);
void mark_outputs();
void mark_resize_outputs(const TensorIteratorConfig&);
void compute_mem_overlaps(const TensorIteratorConfig&);
void compute_shape(const TensorIteratorConfig&);
void compute_strides(const TensorIteratorConfig&);
void mark_resize_outputs(const TensorIteratorConfig& /*config*/);
void compute_mem_overlaps(const TensorIteratorConfig& /*config*/);
void compute_shape(const TensorIteratorConfig& /*config*/);
void compute_strides(const TensorIteratorConfig& /*config*/);
void reorder_dimensions();
void permute_dimensions(IntArrayRef perm);
void compute_types(const TensorIteratorConfig&);
void compute_types(const TensorIteratorConfig& /*config*/);
ScalarType compute_common_dtype();
void allocate_or_resize_outputs();
bool fast_set_up(const TensorIteratorConfig&);
FastSetupType compute_fast_setup_type(const TensorIteratorConfig&);
void compute_names(const TensorIteratorConfig&);
bool fast_set_up(const TensorIteratorConfig& /*config*/);
FastSetupType compute_fast_setup_type(const TensorIteratorConfig& /*config*/);
void compute_names(const TensorIteratorConfig& /*config*/);
void propagate_names_to_outputs();
void coalesce_dimensions();

View File

@ -273,11 +273,11 @@ void checkLayout(CheckedFrom c, at::ArrayRef<Tensor> tensors, at::Layout layout)
}
void * maybe_data_ptr(const Tensor& tensor) {
return tensor.defined() ? (void *)tensor.data_ptr() : nullptr;
return tensor.defined() ? tensor.data_ptr() : nullptr;
}
void * maybe_data_ptr(const TensorArg& tensor) {
return tensor->defined() ? (void *)tensor->data_ptr() : nullptr;
return tensor->defined() ? tensor->data_ptr() : nullptr;
}
void check_dim_size(

View File

@ -20,7 +20,7 @@
namespace at {
TORCH_API int _crash_if_asan(int);
TORCH_API int _crash_if_asan(int /*arg*/);
// Converts a TensorList (i.e. ArrayRef<Tensor> to vector of TensorImpl*)
// NB: This is ONLY used by legacy TH bindings, and ONLY used by cat.

View File

@ -148,7 +148,7 @@ Tensor cached_cast(at::ScalarType to_type, const Tensor& arg, DeviceType device_
Banned functions
*******************************/
static Tensor binary_cross_entropy_banned(const Tensor &, const Tensor &, const std::optional<Tensor>&, int64_t) {
static Tensor binary_cross_entropy_banned(const Tensor & /*unused*/, const Tensor & /*unused*/, const std::optional<Tensor>& /*unused*/, int64_t /*unused*/) {
TORCH_CHECK(false, "torch.nn.functional.binary_cross_entropy and torch.nn.BCELoss are unsafe to autocast.\n"
"Many models use a sigmoid layer right before the binary cross entropy layer.\n"
"In this case, combine the two layers using torch.nn.functional.binary_cross_entropy_with_logits\n"

View File

@ -6,6 +6,7 @@
#include <c10/core/thread_pool.h>
#include <c10/util/flat_hash_map.h>
#include <c10/util/llvmMathExtras.h>
#include <iostream>
#include <optional>
#include <deque>
@ -49,19 +50,57 @@ namespace {
constexpr size_t MAX_SIZE_INDEX = 64;
}
// A large reserved pinned memory segment that is created in advance which is used
// to allocate small pinned memory requests to avoid calling into expensive APIs.
// We never free this memory and move up the pointer as we allocate new blocks
// and when blocks are freed, they are cached in the free lists.
struct PinnedReserveSegment {
PinnedReserveSegment(void *start, size_t size) : start_(start), size_(size),
current_ptr_(start_), initialized_(true) {}
PinnedReserveSegment() : start_(nullptr), size_(0), current_ptr_(nullptr), initialized_(false) {}
bool initialized() {
return initialized_;
}
void* allocate(size_t bytes) {
std::lock_guard<std::mutex> guard(mutex_);
// Round up the requested size to 4KB boundary for all including the small ones.
size_t rounded_bytes = (bytes + 4096 - 1) & ~(4096 - 1);
if (((uint8_t*)current_ptr_ + rounded_bytes) > ((uint8_t*)start_ + size_)) {
return nullptr;
}
void* ptr = current_ptr_;
current_ptr_ = (uint8_t*)current_ptr_ + rounded_bytes;
return ptr;
}
bool owns(void* ptr) {
return ptr >= start_ && ptr < (uint8_t*)start_ + size_;
}
std::mutex mutex_;
void* start_;
size_t size_;
void* current_ptr_;
bool initialized_;
};
// Struct containing memory allocator summary statistics for host.
struct TORCH_API HostStats {
// COUNT: allocations requested by client code. Note that active
// count can be extracted by looking at current allocations
Stat allocation;
// COUNT: number of allocated segments from host memory allocation.
Stat segment;
// SUM: bytes allocated by this memory alocator. Note that active bytes
// can be extracted by looking at current bytes allocated
// COUNT: total allocations (active)
Stat active_requests;
// SUM: bytes allocated/reserved by this memory alocator. (active)
Stat active_bytes;
// COUNT: total allocations (active + free)
Stat allocations;
// SUM: bytes allocated/reserved by this memory alocator. This accounts
// for both free and in-use blocks.
Stat allocated_bytes;
// SUM: bytes reserved by this memory allocator (both free and used)
Stat reserved_bytes;
// SUM: time spent in cudaHostAlloc/cudaHostRegister in microseconds
DurationStat host_alloc_time;
@ -75,6 +114,9 @@ struct TORCH_API HostStats {
// COUNT: number of times cudaHostFree/cudaHostUnregister was called.
int64_t num_host_free = 0; // This is derived from segment or timing
// Count of cudaHostAlloc/cudaHostRegister per bucket
std::vector<int64_t> bucket_allocation = std::vector<int64_t>(MAX_SIZE_INDEX);
};
// Struct containing memory allocator summary statistics for host, as they
@ -82,17 +124,22 @@ struct TORCH_API HostStats {
// avoid locking the allocator while collecting stats.
struct alignas(64) HostStatsStaged {
std::mutex timing_mutex_;
// COUNT: allocations requested by client code resulting in a new segment/block allocation
// LOCK: access to this stat is protected by the allocator's blocks_mutex_
Stat allocation;
// SUM: bytes within active memory blocks, including blocks that are
// currently in the free list.
// COUNT: total allocations (active + free)
// LOCK: access to this stat is protected by the allocator's blocks_mutex_
Stat allocations;
// SUM: bytes allocated/reserved by this memory alocator. This accounts
// for both free and in-use blocks.
Stat allocated_bytes;
// COUNT: number of allocations per bucket
// COUNT: number of allocations per bucket (active)
// LOCK: access to this stat is protected by the per bucket free_list_[index].mutex_
std::vector<Stat> active_bucket_stats = std::vector<Stat>(MAX_SIZE_INDEX);
// SUM: bytes of allocation per bucket (active)
// LOCK: access to this stat is protected by the per bucket free_list_[index].mutex_
std::vector<Stat> active_bytes_bucket_stats = std::vector<Stat>(MAX_SIZE_INDEX);
// COUNT: number of allocations per bucket (active + free)
// LOCK: access to this stat is protected by the per bucket free_list_[index].mutex_
std::vector<Stat> allocation_bucket_stats = std::vector<Stat>(MAX_SIZE_INDEX);
// SUM: bytes of allocation per bucket
// SUM: bytes of allocation per bucket (active + free)
// LOCK: access to this stat is protected by the per bucket free_list_[index].mutex_
std::vector<Stat> allocated_bytes_bucket_stats = std::vector<Stat>(MAX_SIZE_INDEX);
// SUM: time spent in cudaHostAlloc/cudaHostRegister
@ -211,12 +258,6 @@ struct CachingHostAllocatorImpl {
// Check in the recently freed blocks with pending events to see if we
// can reuse them. Call get_free_block again after processing events
if (pinned_use_background_threads()) {
process_events_for_specific_size(roundSize);
block = get_free_block(roundSize);
if (block) {
return {block->ptr_, reinterpret_cast<void*>(block)};
}
// Launch the background thread and process events in a loop.
static bool background_thread_flag [[maybe_unused]] = [this] {
getBackgroundThreadPool()->run([&]() {
@ -278,8 +319,6 @@ struct CachingHostAllocatorImpl {
auto index = size_index(block->size_);
std::lock_guard<std::mutex> g(free_list_[index].mutex_);
free_list_[index].list_.push_back(block);
stats_.allocation_bucket_stats[index].decrease(1);
stats_.allocated_bytes_bucket_stats[index].decrease(block->size_);
} else {
// restore these events that record by used streams.
std::lock_guard<std::mutex> g(events_mutex_);
@ -339,9 +378,12 @@ struct CachingHostAllocatorImpl {
for (auto* block : blocks_to_remove) {
blocks_.erase(block);
ptr_to_block_.erase(block->ptr_);
stats_.allocation.decrease(1);
stats_.allocated_bytes.decrease(block->size_);
auto index = size_index(block->size_);
free_block(block);
stats_.allocations.decrease(1);
stats_.allocated_bytes.decrease(block->size_);
stats_.allocation_bucket_stats[index].decrease(1);
stats_.allocated_bytes_bucket_stats[index].decrease(block->size_);
delete block;
}
}
@ -388,16 +430,17 @@ struct CachingHostAllocatorImpl {
// per bucket (we pick index 0 arbitrarily). These are also all the host
// allocations, not taking into account caching and free lists.
if (i == 0) {
stats.segment = stats_.allocation;
stats.reserved_bytes = stats_.allocated_bytes;
stats.num_host_alloc = stats.segment.allocated;
stats.num_host_free = stats.segment.freed;
stats.allocations = stats_.allocations;
stats.allocated_bytes = stats_.allocated_bytes;
stats.num_host_alloc = stats.allocations.allocated;
stats.num_host_free = stats.allocations.freed;
}
// Bucket stats need to be merged with the slow-path stats. We do this in
// a best effort manner, since we can't really replay the cached events per bucket.
add_bucket_stats(stats.allocation, stats_.allocation_bucket_stats[i]);
add_bucket_stats(stats.allocated_bytes, stats_.allocated_bytes_bucket_stats[i]);
add_bucket_stats(stats.active_requests, stats_.active_bucket_stats[i]);
add_bucket_stats(stats.active_bytes, stats_.active_bytes_bucket_stats[i]);
stats.bucket_allocation[i] = stats_.allocation_bucket_stats[i].allocated;
}
// Get the timing stats
@ -421,9 +464,11 @@ struct CachingHostAllocatorImpl {
std::lock_guard<std::mutex> gb(blocks_mutex_, std::adopt_lock);
if (i == 0) {
stats_.allocation.reset_accumulated();
stats_.allocations.reset_accumulated();
stats_.allocated_bytes.reset_accumulated();
}
stats_.active_bucket_stats[i].reset_accumulated();
stats_.active_bytes_bucket_stats[i].reset_accumulated();
stats_.allocation_bucket_stats[i].reset_accumulated();
stats_.allocated_bytes_bucket_stats[i].reset_accumulated();
}
@ -446,9 +491,11 @@ struct CachingHostAllocatorImpl {
std::lock_guard<std::mutex> gb(blocks_mutex_, std::adopt_lock);
if (i == 0) {
stats_.allocation.reset_peak();
stats_.allocations.reset_peak();
stats_.allocated_bytes.reset_peak();
}
stats_.active_bucket_stats[i].reset_peak();
stats_.active_bytes_bucket_stats[i].reset_peak();
stats_.allocation_bucket_stats[i].reset_peak();
stats_.allocated_bytes_bucket_stats[i].reset_peak();
}
@ -465,7 +512,7 @@ struct CachingHostAllocatorImpl {
virtual void add_allocated_block(B* block) {
std::lock_guard<std::mutex> g(blocks_mutex_);
blocks_.insert(block);
stats_.allocation.increase(1);
stats_.allocations.increase(1);
stats_.allocated_bytes.increase(block->size_);
ptr_to_block_.insert({block->ptr_, block});
@ -478,6 +525,8 @@ struct CachingHostAllocatorImpl {
std::lock_guard<std::mutex> g(free_list_[index].mutex_);
stats_.allocation_bucket_stats[index].increase(1);
stats_.allocated_bytes_bucket_stats[index].increase(size);
stats_.active_bucket_stats[index].increase(1);
stats_.active_bytes_bucket_stats[index].increase(size);
}
}
@ -488,8 +537,8 @@ struct CachingHostAllocatorImpl {
B* block = free_list_[index].list_.back();
free_list_[index].list_.pop_back();
block->allocated_ = true;
stats_.allocation_bucket_stats[index].increase(1);
stats_.allocated_bytes_bucket_stats[index].increase(size);
stats_.active_bucket_stats[index].increase(1);
stats_.active_bytes_bucket_stats[index].increase(size);
return block;
}
return nullptr;
@ -583,8 +632,8 @@ struct CachingHostAllocatorImpl {
auto index = size_index(block->size_);
std::lock_guard<std::mutex> g(free_list_[index].mutex_);
free_list_[index].list_.push_back(block);
stats_.allocation_bucket_stats[index].decrease(1);
stats_.allocated_bytes_bucket_stats[index].decrease(size);
stats_.active_bucket_stats[index].decrease(1);
stats_.active_bytes_bucket_stats[index].decrease(size);
if (size != -1) {
return;
}

View File

@ -49,7 +49,7 @@ static void check_unique_names(DimnameList names) {
}
void check_names_valid_for(const TensorBase& tensor, DimnameList names) {
return impl::check_names_valid_for(tensor.unsafeGetTensorImpl(), names);
impl::check_names_valid_for(tensor.unsafeGetTensorImpl(), names);
}
void check_names_valid_for(size_t tensor_dim, DimnameList names) {

View File

@ -27,11 +27,11 @@ struct TORCH_API NamedTensorMeta final : public c10::NamedTensorMetaInterface {
HasNonWildcard
};
explicit NamedTensorMeta(HAS_NON_WILDCARD, DimnameList names)
explicit NamedTensorMeta(HAS_NON_WILDCARD /*unused*/, DimnameList names)
: names_(names.vec()) {
check_invariants();
}
explicit NamedTensorMeta(HAS_NON_WILDCARD, std::vector<Dimname>&& names)
explicit NamedTensorMeta(HAS_NON_WILDCARD /*unused*/, std::vector<Dimname>&& names)
: names_(std::move(names)) {
check_invariants();
}
@ -52,13 +52,13 @@ struct TORCH_API NamedTensorMeta final : public c10::NamedTensorMetaInterface {
std::any_of(names_.begin(), names_.end(), [](const Dimname& n) { return !n.isWildcard(); }));
}
void set_names(HAS_NON_WILDCARD, DimnameList new_names) {
void set_names(HAS_NON_WILDCARD /*unused*/, DimnameList new_names) {
TORCH_INTERNAL_ASSERT(new_names.size() == names_.size());
std::copy(new_names.begin(), new_names.end(), names_.begin());
check_invariants();
}
void set_names(HAS_NON_WILDCARD, std::vector<Dimname>&& new_names) {
void set_names(HAS_NON_WILDCARD /*unused*/, std::vector<Dimname>&& new_names) {
TORCH_INTERNAL_ASSERT(new_names.size() == names_.size());
names_ = std::move(new_names);
check_invariants();

View File

@ -2,6 +2,7 @@
#include <c10/core/impl/PythonDispatcherTLS.h>
#include <ATen/core/PythonFallbackKernel.h>
#include <c10/core/SafePyObject.h>
#include <ATen/record_function.h>
namespace {
@ -53,20 +54,24 @@ void pythonFallback(const c10::OperatorHandle& op, c10::DispatchKeySet dispatch_
TORCH_INTERNAL_ASSERT(tls_on_entry.has_value());
// c10::impl::ForceDispatchKeyGuard dispatcher_guard(tls_on_entry.value());
// StashTLSOnEntryGuard stash_guard;
c10::impl::ExcludeDispatchKeyGuard guard(after_Python_keyset);
c10::impl::ExcludeDispatchKeyGuard exclude_guard(after_Python_keyset);
const auto& schema = op.schema();
const auto num_arguments = schema.arguments().size();
// If Torch Dispatch Mode is active, use its PyInterpreter for dispatch
const auto mode_stack_len = c10::impl::TorchDispatchModeTLS::stack_len();
if (mode_stack_len > 0) {
RECORD_FUNCTION("PythonDispatchMode", torch::jit::last(*stack, num_arguments));
const auto& cur_torch_dispatch_mode_state = c10::impl::TorchDispatchModeTLS::get_stack_at(mode_stack_len - 1);
cur_torch_dispatch_mode_state->pyinterpreter()->dispatch(op, stack);
return;
}
RECORD_FUNCTION("PythonSubclass", torch::jit::last(*stack, num_arguments));
// Otherwise, find a PyInterpreter on a Tensor
const auto& schema = op.schema();
const auto num_arguments = schema.arguments().size();
// It is safe to dispatch on the very first Tensor with a pyobj_interpreter
// without checking the interpreters of any of the arguments, because when
// we actually run dispatch(), we will take out PyObjects in the context

View File

@ -13,7 +13,7 @@ class TORCH_API PythonOpRegistrationTrampoline final {
public:
// Returns true if you successfully registered yourself (that means
// you are in the hot seat for doing the operator registrations!)
static bool registerInterpreter(c10::impl::PyInterpreter*);
static bool registerInterpreter(c10::impl::PyInterpreter* /*interp*/);
// Returns nullptr if no interpreter has been registered yet.
static c10::impl::PyInterpreter* getInterpreter();

View File

@ -138,7 +138,7 @@ void Tensor::_backward(TensorList inputs,
const std::optional<Tensor>& gradient,
std::optional<bool> keep_graph,
bool create_graph) const {
return impl::GetVariableHooks()->_backward(*this, inputs, gradient, keep_graph, create_graph);
impl::GetVariableHooks()->_backward(*this, inputs, gradient, keep_graph, create_graph);
}
const TensorBase& TensorBase::requires_grad_(bool _requires_grad) const {
@ -173,4 +173,12 @@ unsigned TensorBase::_register_hook(std::function<TensorBase(const TensorBase&)>
return impl::GetVariableHooks()->_register_hook(*this, std::move(hook));
}
std::optional<ScalarType> TensorBase::grad_dtype() const {
return impl::GetVariableHooks()->grad_dtype(*this);
}
void TensorBase::set_grad_dtype(const std::optional<ScalarType>& grad_dtype) const {
return impl::GetVariableHooks()->set_grad_dtype(*this, grad_dtype);
}
} // namespace at

View File

@ -100,7 +100,7 @@ class TORCH_API TensorBase {
// Create a Tensor with a +0 reference count. Special care must be
// taken to avoid decrementing this reference count at destruction
// time. Intended to support MaybeOwnedTraits<Tensor>.
explicit TensorBase(unsafe_borrow_t, const TensorBase& rhs)
explicit TensorBase(unsafe_borrow_t /*unused*/, const TensorBase& rhs)
: impl_(c10::intrusive_ptr<at::TensorImpl, UndefinedTensorImpl>(rhs.impl_.get(), c10::raw::DontIncreaseRefcount{})) {}
friend MaybeOwnedTraits<TensorBase>;
@ -930,6 +930,10 @@ public:
const TensorBase& requires_grad_(bool _requires_grad=true) const;
std::optional<ScalarType> grad_dtype() const;
void set_grad_dtype(const std::optional<ScalarType>& grad_dtype) const;
// View Variables
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@ -950,7 +954,7 @@ protected:
c10::intrusive_ptr<TensorImpl, UndefinedTensorImpl> impl_;
private:
TensorBase __dispatch_contiguous(c10::MemoryFormat) const;
TensorBase __dispatch_contiguous(c10::MemoryFormat /*memory_format*/) const;
};
inline DeviceIndex get_device(const TensorBase& self) {

View File

@ -117,7 +117,7 @@ C10_HOST_DEVICE inline T cauchy(T val, T median, T sigma) {
template <>
C10_HOST_DEVICE inline double cauchy(double val, double median, double sigma) {
// https://en.wikipedia.org/wiki/Cauchy_distribution#Cumulative_distribution_function
return median + sigma * at::tan(c10::pi<double> * (val - static_cast<double>(0.5)));
return median + sigma * at::tan(c10::pi<double> * (val - 0.5));
}
/**

Some files were not shown because too many files have changed in this diff Show More