Compare commits

...

231 Commits

Author SHA1 Message Date
e752a29afd revert expectedFailure on TestFSDPWithEP::test_e2e 2025-10-16 10:55:03 +00:00
36b622bb72 skip test_cupy_as_tensor 2025-10-16 10:55:03 +00:00
83a04f38a4 xfail test_fsdp_ep.py::TestFSDPWithEP::test_e2e 2025-10-16 10:55:03 +00:00
6579829bee fix lint 2025-10-16 10:55:03 +00:00
2b856676f3 skip test on SM100OrLater 2025-10-16 10:55:03 +00:00
5746261c97 Resolve more merge conflicts 2025-10-16 10:55:03 +00:00
b3c94fd0fc Resolve merge conflicts 2025-10-16 10:55:03 +00:00
6fd366b2c7 Further increase timeout.
test_gather_object* require exact world size (4) to succeed
Likely, all NVSHMEM Triton tests would fail. Skipping all for now.
2025-10-16 10:55:03 +00:00
fe25f6ab59 rebase 2025-10-16 10:55:03 +00:00
ca89e5732f Change runner to linux.12xlarge.memory
and fix lint
2025-10-16 10:55:03 +00:00
f12cb265d4 Fix Lint 2025-10-16 10:55:03 +00:00
7dc6bf5377 Skip distributed unit tests with their issue numbers. 2025-10-16 10:55:03 +00:00
e5ba464808 Undo changes to test_3_level_hierarchical_model_averager 2025-10-16 10:55:03 +00:00
7d95185044 Check if using linux.dgx.b200.8 would avoid the fsdp issues. 2025-10-16 10:55:03 +00:00
77fb3c1cac Revert "Use linux.dgx.b200.8 since linux.dgx.b200.4 is not available."
This reverts commit 1ba7b2e20b804f337aa483e0228228273ade7396.
2025-10-16 10:55:03 +00:00
11a3d1d87b Use linux.dgx.b200.8 since linux.dgx.b200.4 is not available. 2025-10-16 10:55:03 +00:00
8c6d9feb26 Add b200 distributed job
Add aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
to test job on B200 runner

Fix runner usage - this distributed job should use linux.dgx.b200.8

@require_world_size(4) does not translate to world_size==4.
For example, 8GPU B200 runner would make the unit test run with
world_size = 8.
Tested with:
TEMP_DIR=/tmp BACKEND=nccl WORLD_SIZE=4 pytest -v test/distributed/test_distributed_spawn.py -k test_new_subgroups_world_size_not_divisible_by_group_size

Add require_exact_world_size for distributed unit tests that implicitly
require world_size of 4 to pass.

Fix test_3_level_hierarchical_model_averager

Mimic H100 distributed, run distributed less often because it takes
quite long to finish all the tests (easily 4hours+ for each of the 3
shards).
2025-10-16 10:55:03 +00:00
003dd13073 [dynamo, guards] Better error messages when generated guard fails on the same frame (#165242)
Not sure what exactly we want to have in the message, but that's easy to adjust. I tried to find a reliable test to reproduce this message (happens only when a guard fails right after it's created), but I ended up mocking a `guard_manager.check` function to return `False` to trigger this behavior. I think that's fine, because any other case that we pick (like datetime.now()), we want to patch one day anyway, so every time we make the next patch, will need to chase for another repro test

@williamwen42

Fixes #164990

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165242
Approved by: https://github.com/williamwen42
2025-10-16 01:05:31 +00:00
c2bd41ac9f Build vLLM nightly wheels for CUDA 13.0 (#163239)
Now that https://github.com/vllm-project/vllm/pull/24599 has been merged
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163239
Approved by: https://github.com/malfet, https://github.com/atalman
2025-10-16 01:03:26 +00:00
ca8bd5dbed Move toString(ScalarType) and ScalarType ostream operator to headeronly (#164405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164405
Approved by: https://github.com/Skylion007, https://github.com/janeyx99
ghstack dependencies: #164350, #164354
2025-10-16 00:55:43 +00:00
26f3803433 Remove workaround to old CUDA bug (#164354)
As in the title.

A check for https://github.com/pytorch/pytorch/issues/164348 to see if the workaround can be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164354
Approved by: https://github.com/janeyx99, https://github.com/ngimel, https://github.com/malfet, https://github.com/jeffdaily
ghstack dependencies: #164350
2025-10-16 00:55:43 +00:00
48064acf37 Move AT_FORALL_... macros and ScalarTypeToCPPTypeT to headeronly (#164350)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164350
Approved by: https://github.com/janeyx99
2025-10-16 00:55:42 +00:00
e5a9c247bc [Fix XPU CI] [Inductor UT] Fix test cases broken by community. (#165406)
Fixes #163159, Fixes #164098, Fixes #164097, Fixes #164099, Fixes #165025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165406
Approved by: https://github.com/EikanWang, https://github.com/jansel
2025-10-16 00:53:32 +00:00
36371b8ec7 [ATen] Fix CUDA reduction warp shuffle order (#164790)
Typical warp shuffle reduction has the following pattern:
<img width="1138" height="501" alt="image" src="https://github.com/user-attachments/assets/3bd176dc-0ad2-4df6-90c7-06e467337166" />

which is exhibited in Triton generated by torch.compile:
<img width="663" height="403" alt="image" src="https://github.com/user-attachments/assets/7f9f36cd-b9eb-44c1-879e-b469668a2ea8" />

Switch the warp shuffle order to make bitwise equivalence between the 2 easier.
PTX difference between old and new, we see a few extra instructions: https://www.diffchecker.com/h6ly3INC/

Comparing the performance on different reduction operations, we see minimal differences. New represents the changes in this PR, old represents the past warp shuffle order:
```
Tensor Shape              Operation            New all dims (ms)       New dim=0 (ms)      New dim=1 (ms)     Old all dims (ms)    Old dim=0 (ms)      Old dim=1 (ms)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1024, 1024)              mean                 0.015817             0.016259             0.013642             0.015990             0.016258             0.013631
(1024, 1024)              sum                  0.015917             0.015906             0.013359             0.015707             0.016266             0.013226
(1024, 1024)              min                  0.016021             0.024625             0.015631             0.015761             0.024485             0.015317
(1024, 1024)              max                  0.016349             0.024971             0.015972             0.015771             0.025001             0.015314
(1024, 1024)              argmin               0.018070             0.024448             0.015578             0.018135             0.025370             0.015322
(1024, 1024)              argmax               0.018427             0.024859             0.015932             0.018164             0.024452             0.015639
(1024, 1024)              var                  0.020078             0.026413             0.020295             0.020199             0.026381             0.020214
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(2048, 2048)              mean                 0.023826             0.023726             0.022273             0.023236             0.023776             0.022248
(2048, 2048)              sum                  0.023840             0.023355             0.021974             0.023294             0.023354             0.021884
(2048, 2048)              min                  0.024519             0.041263             0.024620             0.023292             0.041491             0.024358
(2048, 2048)              max                  0.024509             0.041670             0.024277             0.023334             0.041231             0.024395
(2048, 2048)              argmin               0.026125             0.041282             0.024567             0.026772             0.041773             0.024296
(2048, 2048)              argmax               0.026117             0.041487             0.024572             0.026412             0.041477             0.024273
(2048, 2048)              var                  0.026603             0.048581             0.031308             0.027587             0.048603             0.030860
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(4096, 4096)              mean                 0.053927             0.057070             0.054073             0.053028             0.057544             0.053935
(4096, 4096)              sum                  0.053604             0.057410             0.054451             0.053076             0.057033             0.054266
(4096, 4096)              min                  0.054293             0.109122             0.058363             0.053821             0.108689             0.058382
(4096, 4096)              max                  0.054258             0.108035             0.058703             0.053492             0.110552             0.058376
(4096, 4096)              argmin               0.056805             0.111167             0.058301             0.056836             0.112325             0.058292
(4096, 4096)              argmax               0.056488             0.110958             0.058636             0.056844             0.111000             0.057928
(4096, 4096)              var                  0.058936             0.141755             0.068693             0.059735             0.141284             0.068500
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(8192, 8192)              mean                 0.145552             0.148082             0.138647             0.145364             0.147818             0.138207
(8192, 8192)              sum                  0.145985             0.147900             0.138714             0.145755             0.148031             0.138616
(8192, 8192)              min                  0.146566             0.205359             0.192739             0.145611             0.205237             0.182335
(8192, 8192)              max                  0.146526             0.204844             0.193050             0.146073             0.205457             0.182697
(8192, 8192)              argmin               0.150190             0.206605             0.192543             0.150654             0.206847             0.182007
(8192, 8192)              argmax               0.150481             0.206368             0.192535             0.150845             0.206430             0.182022
(8192, 8192)              var                  0.150884             0.184546             0.203900             0.151594             0.184172             0.197983
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1, 1024, 128)            mean                 0.014293             0.008119             0.014533             0.013861             0.008022             0.014449
(1, 1024, 128)            sum                  0.014039             0.007877             0.014111             0.014219             0.008227             0.014045
(1, 1024, 128)            min                  0.014159             0.011354             0.023493             0.014271             0.010862             0.023644
(1, 1024, 128)            max                  0.014154             0.011027             0.023368             0.014259             0.011234             0.023692
(1, 1024, 128)            argmin               0.016403             0.005677             0.023328             0.016273             0.005683             0.024073
(1, 1024, 128)            argmax               0.016734             0.005675             0.023437             0.016580             0.005318             0.023331
(1, 1024, 128)            var                  0.018338             0.009549             0.025538             0.018528             0.009391             0.024777
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(5, 1024, 128)            mean                 0.014873             0.010131             0.015546             0.015123             0.010131             0.015481
(5, 1024, 128)            sum                  0.015334             0.009673             0.015824             0.014736             0.009671             0.015438
(5, 1024, 128)            min                  0.015047             0.013252             0.024573             0.014803             0.013163             0.024551
(5, 1024, 128)            max                  0.015050             0.013339             0.024197             0.014810             0.013525             0.024230
(5, 1024, 128)            argmin               0.017341             0.012737             0.024306             0.017471             0.012379             0.024991
(5, 1024, 128)            argmax               0.017345             0.012411             0.024421             0.017422             0.012471             0.024237
(5, 1024, 128)            var                  0.019973             0.011453             0.026188             0.020050             0.011438             0.026282
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(10, 1024, 128)           mean                 0.016976             0.011575             0.016831             0.016722             0.011927             0.017173
(10, 1024, 128)           sum                  0.017039             0.011841             0.017159             0.016385             0.011860             0.016753
(10, 1024, 128)           min                  0.017036             0.015331             0.026770             0.016944             0.015205             0.027166
(10, 1024, 128)           max                  0.017369             0.015348             0.027077             0.016531             0.015716             0.026819
(10, 1024, 128)           argmin               0.019203             0.014447             0.026813             0.018994             0.014497             0.027313
(10, 1024, 128)           argmax               0.019563             0.014795             0.027140             0.019460             0.014912             0.026733
(10, 1024, 128)           var                  0.020529             0.014316             0.030405             0.020719             0.013960             0.029964
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(100, 1024, 128)          mean                 0.045046             0.039168             0.046082             0.044839             0.039217             0.045782
(100, 1024, 128)          sum                  0.045094             0.039150             0.045777             0.044496             0.039542             0.046083
(100, 1024, 128)          min                  0.045768             0.054466             0.076244             0.044915             0.053943             0.076599
(100, 1024, 128)          max                  0.045748             0.054459             0.076188             0.044931             0.053949             0.076856
(100, 1024, 128)          argmin               0.048275             0.054046             0.076647             0.048694             0.054105             0.077004
(100, 1024, 128)          argmax               0.048267             0.054395             0.077401             0.048691             0.054131             0.076751
(100, 1024, 128)          var                  0.049710             0.043254             0.083077             0.050971             0.043251             0.082378
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1000, 1000, 100)         mean                 0.202312             0.196723             0.197765             0.201774             0.196641             0.197459
(1000, 1000, 100)         sum                  0.202651             0.196682             0.197736             0.202175             0.196313             0.197523
(1000, 1000, 100)         min                  0.203022             0.264762             0.269200             0.202729             0.264129             0.268694
(1000, 1000, 100)         max                  0.202864             0.264396             0.269388             0.202486             0.263896             0.268720
(1000, 1000, 100)         argmin               0.226727             0.263781             0.268651             0.226597             0.264676             0.268983
(1000, 1000, 100)         argmax               0.226412             0.264469             0.269090             0.226570             0.264595             0.269178
(1000, 1000, 100)         var                  0.243223             0.204079             0.216096             0.241942             0.204079             0.215925
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(10000, 100)              mean                 0.016193             0.020277             0.014316             0.016152             0.020324             0.013712
(10000, 100)              sum                  0.016289             0.020237             0.014034             0.016168             0.020265             0.013708
(10000, 100)              min                  0.016046             0.030872             0.019609             0.016208             0.030867             0.018627
(10000, 100)              max                  0.016369             0.030835             0.019257             0.016218             0.030861             0.018209
(10000, 100)              argmin               0.017957             0.031171             0.019517             0.018050             0.031556             0.018077
(10000, 100)              argmax               0.017961             0.031658             0.019521             0.018060             0.031564             0.018087
(10000, 100)              var                  0.020393             0.035652             0.019339             0.020144             0.035987             0.019171
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(100000, 10)              mean                 0.015718             0.016576             0.016555             0.015999             0.016246             0.014869
(100000, 10)              sum                  0.015833             0.016247             0.016572             0.016007             0.016627             0.014872
(100000, 10)              min                  0.015888             0.020510             0.023920             0.015671             0.020821             0.021417
(100000, 10)              max                  0.015889             0.020479             0.023918             0.016077             0.020386             0.021421
(100000, 10)              argmin               0.018233             0.020863             0.023647             0.017574             0.020864             0.021103
(100000, 10)              argmax               0.017896             0.020527             0.023296             0.017569             0.020447             0.021098
(100000, 10)              var                  0.020005             0.024198             0.024372             0.020075             0.024167             0.022415
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1023, 1023, 1023)        mean                 1.874816             1.963506             1.903909             1.873279             1.963859             1.903230
(1023, 1023, 1023)        sum                  1.875030             1.965716             1.902458             1.873566             1.960730             1.901642
(1023, 1023, 1023)        min                  1.878563             2.473455             2.179092             1.875174             2.482086             2.183027
(1023, 1023, 1023)        max                  1.879128             2.474803             2.178895             1.874831             2.482253             2.183884
(1023, 1023, 1023)        argmin               1.921800             2.476629             2.174831             1.923987             2.472641             2.170453
(1023, 1023, 1023)        argmax               1.922605             2.476688             2.177927             1.923366             2.472808             2.172979
(1023, 1023, 1023)        var                  1.972606             3.088695             2.758797             1.978679             3.095658             2.762243
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1023, 1023, 255)         mean                 0.489984             0.500954             0.492957             0.489891             0.500654             0.491971
(1023, 1023, 255)         sum                  0.490228             0.500764             0.492289             0.489624             0.501089             0.492824
(1023, 1023, 255)         min                  0.491457             0.563560             0.553334             0.490355             0.564709             0.554754
(1023, 1023, 255)         max                  0.491396             0.563628             0.553345             0.490017             0.565004             0.554947
(1023, 1023, 255)         argmin               0.503666             0.561512             0.551831             0.503845             0.560972             0.551017
(1023, 1023, 255)         argmax               0.503602             0.561185             0.551407             0.504328             0.561267             0.551448
(1023, 1023, 255)         var                  0.510844             0.709452             0.701630             0.512693             0.710365             0.701965
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
(1023, 1023, 377)         mean                 0.707439             0.727646             0.712019             0.706769             0.727101             0.711632
(1023, 1023, 377)         sum                  0.707780             0.727453             0.711554             0.706807             0.726656             0.711729
(1023, 1023, 377)         min                  0.709423             0.819809             0.794379             0.707847             0.822086             0.796664
(1023, 1023, 377)         max                  0.709297             0.819780             0.794308             0.707566             0.821913             0.796690
(1023, 1023, 377)         argmin               0.725028             0.817088             0.791695             0.726039             0.816445             0.790828
(1023, 1023, 377)         argmax               0.725301             0.817011             0.791420             0.726040             0.816917             0.791143
(1023, 1023, 377)         var                  0.740859             1.034165             1.006712             0.743413             1.035506             1.007638
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164790
Approved by: https://github.com/ngimel, https://github.com/eqy
ghstack dependencies: #165494
2025-10-15 23:54:51 +00:00
7e6721fb0a [BE] Remove confusing opbenchmark-on-demand-build (#165583)
As it doesn't have a test shard, so what's the point or running the build? Was added in https://github.com/pytorch/pytorch/pull/143733 and looks like test shard never existed for it

Moreover, allow one to specify benchmark size as argument, so one
technically can do a workflow dispatch with different opbenchmark sizes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165583
Approved by: https://github.com/huydhn
2025-10-15 23:48:28 +00:00
901bbcba12 Gate division bitwise numerics under a flag (#165566)
https://github.com/pytorch/pytorch/pull/164144 ensures that division for compile is bitwise equivalent with eager. However, in https://github.com/pytorch/pytorch/issues/164301, the kernel performance is regressed.

On B200:
With standard triton `/`:
6511 GB/s

With triton `div_rn`:
4692 GB/s

Further investigation is required for the generated PTX to see why there is such a large slowdown. For now, enable bitwise equivalent results under `TORCHINDUCTOR_EMULATE_DIVISION_ROUNDING` similar to emulate_precision_cast

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165566
Approved by: https://github.com/ngimel, https://github.com/eellison
2025-10-15 23:41:01 +00:00
febb603230 [Inductor][CuTeDSL] Move load_template up two directories (#165347) (#165576)
Summary:

Moves the function used to load CuTeDSL Jinja templates up one level out of the flex attention folder. This way it can be used for more generate Inductor templates in the future.

Test Plan: `INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:cutedsl_grouped_mm -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8`

Reviewed By: drisspg

Differential Revision: D84527470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165576
Approved by: https://github.com/jananisriram
2025-10-15 23:37:55 +00:00
568d2f3ae7 [Dynamo][Logging] Add sources/types to LazyVariableTracker logging (#165402)
Fixes #162860

This task add the variable source attrition to LazyVariableTracker when output trace bytecode

Test plan -- test/dynamo/test_error_messages.py ErrorMessagesTest.test_variable_tracker_source_attribution

The output is as specified in the prior mentioned Github issue.

<img width="961" height="59" alt="Screenshot 2025-10-13 at 10 19 44 PM" src="https://github.com/user-attachments/assets/fb27da3f-d00b-437b-bf2e-52e892572cd7" />

This is specifically for the log setup with ``TORCH_LOGS=trace_bytecode``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165402
Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42

Co-authored-by: William Wen <williamwen@meta.com>
2025-10-15 23:23:09 +00:00
b54e466fd0 Megacache integration (#163533)
This diff adds megacache integration for DynamoCache.

Because DynamoCache requires lazy serialization, i.e. it can only be serialized once all relevant backends have been compiled and we're ready for a save, we actually do the DynamoCache saving only on a call to `torch.compiler.save_cache_artifacts`.

Differential Revision: [D82735763](https://our.internmc.facebook.com/intern/diff/D82735763/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163533
Approved by: https://github.com/oulgen, https://github.com/zhxchen17
2025-10-15 22:49:15 +00:00
53f9ae0e50 [ROCm] new implementation of upsample_bilinear2d_backward (#164572)
Changed the implementation from an output-based approach to an input-based one to remove `atomicAdd` operations, and it appears to deliver at least a 20× speedup.

The changes are from Yu-Yun <YuYun.Chang@amd.com>.

# Summary: Refactor of the implementation of the `upsample_bilinear2d_backward` opertion on MI300X/MI325X
- The original "scatter-add" approach
  - Each thread, representing an output pixel, scattered gradient contributions to four input pixels, using costly atomic operations on MI300X/MI325X GPUs.
- The new "gather-sum" approach
  - Each thread is responsible for a single input pixel and gathers all relevant gradient contributions from a small, calculated region of the output tensor (done by the `compute_output_range` device function).
# Breakdown of the code changes
- Inversion of the parallelization strategy of the kernel function `upsample_bilinear2d_backward_out_frame`
  - Originally, the main kernel loop was parallelized over the number of elements in the output gradient tensor (`const size_t o_numel = nc * width2 * height2;`).
    - Each thread processed one output pixel.
  - The new loop is parallelized over the number of elements in the input gradient tensor (`const size_t i_numel = nc * height1 * width1;`).
    - Each thread is responsible for calculating the final gradient for a single input pixel.
  - The kernel launch changes accordingly in the function `upsample_bilinear2d_backward_out_cuda_template`.
- Added a device function for calculating the range of output pixels that could have possibly used that the input pixel (`input_pos`) during the forward pass interpolation
  - This is essentially the mathematical inverse of the forward pass.
  - This function tries to prune a thread's search space so that it only needs to inspect a small, local window of the output tensor.
- Gradient calculation approach switching from "scatter-add" to "gather-sum"
  - Scatter-add
    - For each output pixel, the thread calculated 4 gradient contributions and use `fastAtomicAdd` 4 times to add these values to 4 different (and potentially highly contended) memory locations in the input gradient tensor.
  - Gather-sum
    - A thread responsible for one input pixel calls `compute_output_range` to determine the small rectangular region of output pixels that influence the input's final gradient value.
    - The thread iterates through this region, and for each output pixel in the regionre, it re-calculates the interpolation weights to determine the exact contribution to its specific input pixel.
    - All these contributions are accumulated into a private, per-thread register variable (`accscalar_t grad_sum = 0;`).
      - W/o any gloabl memory access, this accumulation is extremely fast.
    - When the loops are done, the thread performs a single, direct write (non-atomic) of the final summed gradient to its designated location in global memory (`idata[index] = static_cast<scalar_t>(grad_sum);`).
# Why performance gets boosted
- Analysis of the root cause of performance drop
  - Ref. (internal only) - https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1140493327/PyTorch__upsample_bilinear2d_backward
- First and foremost, elimination of the contention of atomic operations
  - Many parallel threads called `atomicAdd` frequently attempting to update the exact same memory location in the input gradient tensor at the same time.
    - The GPU's memory controler has to serialize these operations, effectively nullifying the benefit of parallel capability at those contention points.
  - MI300X/MI325X chiplet-based CDNA 3 architeture amplified the issue.
    - When contending threads reside on different XCDs, resolving the atomic operation requires high-latency coherence traffic across the Infinity Fabric interconnect.
  - The implementation change eliminates hardware-level serialization and cross-chiplet coherence traffic caused by many `atomicAdd`.
- Improved memory access pattern and locality
  - Write coalescing
    - The regular sum writes `idata[index] = static_cast<scalar_t>(grad_sum);` can be perfectly coalesced by GPUs.
  - Read locality
    - Even though there are many (potentially repeated) reads from the output tensor (`static_cast<accscalar_t>(odata[output_idx])`), these are highly cache-friendly, meaning the data for one thread is likely to be in the L1 or L2 cache already due to an access from a neighboring thread.
- Trade-off: computation for memory synchronization
  - The recalculation of interpolation weights fits well on high-computational-throughput modern GPUs like MI300X/MI325X.
  - Removal of atomic operations avoids expensive memory synchronization.

---

Optimizations of `grid_sampler_2d_backward` will be addressed in a separate PR.
Doc for reference: (internal only) https://amd.atlassian.net/wiki/spaces/~glencao2/pages/1162750701/PyTorch__grid_sampler_2d_backward

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164572
Approved by: https://github.com/jeffdaily
2025-10-15 22:35:43 +00:00
b42fe389b9 ROCm unit tests enablement (#165366)
Enables:
test_cuda.py::TestCuda::test_streaming_backwards_multiple_streams
test_cuda.py::TestCuda::test_graph_make_graphed_callables_with_amp_cache_disabled_allow_unused_input
test_cuda.py::TestCuda::test_graph_make_graphed_callables_without_amp_allow_unused_input
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_bfloat16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_float16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_10000_10000_cuda_float32
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_bfloat16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_float16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_1_10000_1000_10000_cuda_float32
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_bfloat16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_float16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_1000_1000_1000_cuda_float32
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_bfloat16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_float16
test_matmul_cuda.py::TestMatmulCudaCUDA::test_cublas_baddbmm_large_input_2_100_100_100_cuda_float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165366
Approved by: https://github.com/jeffdaily
2025-10-15 22:35:03 +00:00
66ea76ec44 [ROCm][tunableop] Improvements to tunableop Numerical Check (#163079)
Modified the flag PYTORCH_TUNABLEOP_NUMERICAL_CHECK, so that it accepts the numerical tolerances in the format atol_rtol as compared to the previous 0 and 1. Retains previous functionality with default values as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163079
Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily
2025-10-15 22:26:47 +00:00
e787d532b6 tmp fix for compile internal logger issue (#165568)
Summary: Catch runtime exception when garse and scrub uninteresting configs from inductor config

Test Plan: tested locally

Differential Revision: D84727788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165568
Approved by: https://github.com/luccafong, https://github.com/oulgen
2025-10-15 22:03:16 +00:00
b3f6d49b69 Overlap scheduler improvements (#165318)
Bucketing a number of smallish improvements:

- Account for bucketing in overlap calculation: if an in-flight collective exists with the same bucket key, reduce new collectives estimated time by its latency time
-  Update compute domination so we are ordering based on compute idx, as opposed to compute depth, so we never reorder compute. this makes it a bit easier to reason about memory, and pre-fetching, although we can exploring reordering in the future.
- When we wait on a collective, force all collectives on the same process group as it that were enqueued prior to the collective to wait as well.

Better Memory Handling:
- Pre-fetch limiting - when scheduling collectives for overlap, only pre-fetch up to a certain distance, then schedule off-path collectives (which are typically memory reducing).
- When we are above peak memory, schedule waits.

TODO:
- for each compute node, we know its original memory in the graph. we could limit pre-fetching that goes across peak memory
- By scheduling off-path collectives for overlap, we reduce memory, but if there weren't enough compute for overlap, we need to proactively schedule them. not an issue yet on examples.
- config some hard coded constants, clean up enablement (can do in subsequent pr)

On small llama 2d backward :
578 of 618 potentially hideable collectives hidden
original mem 14.4GB, rescheduled mem, 15.9GB

on forward:
254/256 potentially hideable collectives hidden
original mem 5.8 gb, reshceduled mem 5.8GB

WIP: adding tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165318
Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev
ghstack dependencies: #164738, #164783, #164944, #164945, #165059
2025-10-15 21:58:47 +00:00
bc1f2108d7 [PP] Update backward_counter and fsdp util to schedule class (#165513)
Fixed one issue with FSDP last reshard not being called.

Rest is mostly refactoring, changing some variables to be class variables so they can be used in https://github.com/pytorch/torchtitan/pull/1721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165513
Approved by: https://github.com/fegin
2025-10-15 21:58:16 +00:00
f071f17911 [Graph Partition] fix partition x memory plan issue (#165514)
For `test_graph_partition_with_memory_plan_reuse`, before this PR, when using graph partition, it would error ([P1992728479](https://www.internalfb.com/phabricator/paste/view/P1992728479)):

```
def partition_0(args):
    ...
    del buf0
    return (buf3, buf4, buf5, buf2, primals_4, )

...

  File "/tmp/torchinductor_boyuan/ww/cwwc7ukfqscg2vy6ankby2fizdb377tvgyx3fwdgddrxe3g47jg6.py", line 132, in partition_0
    return (buf3, buf4, buf5, buf2, primals_4, )
                              ^^^^
NameError: name 'buf2' is not defined. Did you mean: 'buf0'?
```

When not using graph partition, it would work and give the following code ([P1992997521](https://www.internalfb.com/phabricator/paste/view/P1992997521)):

```
def call(self, args):
    ...
    buf2 = buf0; del buf0  # reuse
    ...
```

Note that the issue is buf0 is not reused for buf2 when using graph partition.

Why? Because the codegen runs `run_wrapper_ir_passes` and `memory_plan_reuse`, which pops tailing `MemoryPlanningLine` unless it is in graph output by checking `V.graph.get_output_names()`. However, for graph partition, we should check the output of the current partition instead of the graph before partition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165514
Approved by: https://github.com/ProExpertProg, https://github.com/eellison
2025-10-15 21:52:16 +00:00
fa1539594b consolidate fw and inference compile paths (#165457)
By design, fw compile and inference compile stages should share a bunch of code; just consolidating the duplication here.

Differential Revision: D84628978

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165457
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2025-10-15 21:33:50 +00:00
dfc8a1c5dd Fix _StridedShard incorrect split (#165533)
https://github.com/pytorch/pytorch/pull/164820 introduced a bug that `_StridedShard` will call parent class `Shard`'s `split_tensor` method, thus results in incorrect data locality. (I think @ezyang spotted this issue, but we have no test to capture this)

Meanwhile, I notice another bug that when we normalize a `_StridedShard`'s placement, it will also trigger parent class `Shard`'s `split_tensor` method because it will create a Shard class [here](0c14f55de6/torch/distributed/tensor/_api.py (L783)). I think we never test `distribute_tensor` for `_StridedShard` before. So I added a test here to compare against ordered shard.

Using classmethod because the _split_tensor logic is different between `Shard` and `_StridedShard`. Basically I want to shard on local tensors without initializing the Shard object:
```
local_tensor = _StridedShard._make_shard_tensor(dim, tensor, mesh, mesh_dim, split_factor=split_factor)
local_tensor = Shard._make_shard_tensor(dim, tensor, mesh, mesh_dim)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165533
Approved by: https://github.com/XilunWu
2025-10-15 20:52:41 +00:00
7f9b745494 [ROCm][tunableop] Modified Online Tuning Mode to add Instant Logging (#163965)
- Added instant logging in online tuning mode, so that each tuned GEMM is instantly written
- Allows us to have saved tuning configs, in cases of crashes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163965
Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily
2025-10-15 20:02:31 +00:00
83f9baf413 [Bugfix][Precompile][vLLM] Support for pickling einops for aot_autograd serialization in vLLM (#165359)
Fixes issue with compiling `Qwen2_5_vl` in https://github.com/vllm-project/vllm/pull/23207 (issue happens with `aot_autograd_cache`)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165359
Approved by: https://github.com/jamesjwu
2025-10-15 20:00:24 +00:00
ffc7552e01 See if we can handle uploading all test data (#165484)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165484
Approved by: https://github.com/izaitsevfb
2025-10-15 19:57:41 +00:00
78f5a1ec60 varlen api (#164502)
**Summary**

Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA.

This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend.

**Benchmarking**

To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.

Settings:

- 1 H100 machine
- `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16`
- dtype `torch.bfloat16`
- `is_causal=False`
- for variable length, we set sequences to be random multiples of 64 up to `max_seq_len`
- 100 runs

|        | Variable Length API | SDPA     |
|--------|--------------------|----------|
| Runtime | 0.21750560760498047 ms       | 0.43171775817871094 ms  |
| TFLOPs | 231.812         | 320.840  |

The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length.

**Testing**

Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA.

**Next steps**

Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics.

(This stack builds on top of #162326)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164502
Approved by: https://github.com/v0i0, https://github.com/drisspg
2025-10-15 19:45:55 +00:00
2b71b62045 Add Memory Estimation Tracker (#165059)
Add Memory Tracker utility, which will track live memory given alternate ordering of nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165059
Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev
ghstack dependencies: #164738, #164783, #164944, #164945
2025-10-15 19:44:29 +00:00
8c4b528403 Revert "[Inductor][CuTeDSL] Move load_template up two directories (#165347)"
This reverts commit 815d6415996d5b32b569fd2a8206f1e57c75bfe3.

Reverted https://github.com/pytorch/pytorch/pull/165347 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165347#issuecomment-3407958496))
2025-10-15 19:30:46 +00:00
066f818eea Refactor and unify v1/v2 _scaled_mm codes (#165436)
Summary:

* Refactor out some core routines (scaled_gemm, auto-tuned scaled_gemm)
* Unify v1/v2 dispatch calls where possible
* Simplify call pattern w.r.t. CUDA/ROCM for easier readability.

Test Plan:

```
pytest -svv test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165436
Approved by: https://github.com/drisspg
2025-10-15 19:07:05 +00:00
14af1dc3da [DeviceMesh] Fix layout calculation when flattening non-contiguous dims (#165542)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165542
Approved by: https://github.com/ezyang, https://github.com/fduwjj
2025-10-15 18:55:45 +00:00
2395d7d7da Relax equality check (#165460)
When an object is inherited from multiple types, the previous check would fail. So we should relax it to respect eager semantic

Differential Revision: [D84635322](https://our.internmc.facebook.com/intern/diff/D84635322)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165460
Approved by: https://github.com/avikchaudhuri
2025-10-15 18:32:01 +00:00
0aa7ebaf03 Fix periodic debug tests failing due to FakeProcessGroup things (#165479)
These happen when building with CMAKE_BUILD_TYPE=RelWithAssert

This should fix two types of failures that started with https://github.com/pytorch/pytorch/pull/163665

Disclaimer that I used a lot of AI since I don't how pybind works or what refcounts and pointers are, so idk if this is a good solution, or even a solution at all (fwiw the tests pass now)

The first one type is

Truncated:
```
    default_pg, _ = _new_process_group_helper(
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2096, in _new_process_group_helper
    backend_class = creator_fn(dist_backend_opts, backend_options)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/fake_pg.py", line 25, in _create_fake_pg
    return FakeProcessGroup._create_internal(
RuntimeError: new_refcount != 1 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/c10/util/intrusive_ptr.h":319, please report a bug to PyTorch. intrusive_ptr: Cannot increase refcount after it reached zero.
Exception raised from retain_ at /var/lib/jenkins/workspace/c10/util/intrusive_ptr.h:319 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) from ??:0
#7 c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) from ??:0
#8 void pybind11::class_<c10d::FakeProcessGroup, (anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup> >::init_instance<(anonymous namespace)::IntrusivePtrNoGilDestructor<c10d::FakeProcessGroup>, 0>(pybind11::detail::instance*, void const*) from init.cpp:0
#9 pybind11::detail::type_caster_generic::cast(void const*, pybind11::return_value_policy, pybind11::handle, pybind11::detail::type_info const*, void* (*)(void const*), void* (*)(void const*), void const*) from :0
#10 pybind11::cpp_function::initialize<torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> >, int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >, pybind11::name, pybind11::scope, pybind11::sibling, pybind11::arg, pybind11::arg, pybind11::arg_v>(torch::distributed::c10d::(anonymous namespace)::c10d_init(_object*, _object*)::{lambda(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >)#127}&&, c10::intrusive_ptr<c10d::FakeProcessGroup, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup> > (*)(int, int, c10::intrusive_ptr<c10d::FakeProcessGroup::Options, c10::detail::intrusive_target_default_null_type<c10d::FakeProcessGroup::Options> >), pybind11::name const&, pybind11::scope const&, pybind11::sibling const&, pybind11::arg const&, pybind11::arg const&, pybind11::arg_v const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&) from init.cpp:0
```
and I fix it here by getting rid of `DontIncreaseRefcount` and using make_intrusive to do the ref count handling instead.  However, I also had to move the constructor to be public, which I think is not good, based on the reasoning of the original PR

The other one type is
```
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/test_testing.py", line 2415, in test_no_warning_on_import
    self.assertEqual(out, "")
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4233, in assertEqual
    raise error_metas.pop()[0].to_error(  # type: ignore[index]
AssertionError: String comparison failed: "/opt/conda/envs/py_3.10/lib/python3.10/s[352 chars]):\n" != ''
- /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/__init__.py:29: FutureWarning: pybind11-bound class 'torch._C._distributed_c10d.FakeProcessGroup' is using an old-style placement-new '__init__' which has been deprecated. See the upgrade guide in pybind11's docs. This message is only visible when compiled in debug mode.
-   if is_available() and not torch._C._c10d_init():

To execute this test, run the following from the base repo dir:
    python test/test_testing.py TestImports.test_no_warning_on_import
```
which I fix by getting rid of the `__init__` which I think is ok since it'll just error if you try to make one?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165479
Approved by: https://github.com/ezyang
2025-10-15 18:16:08 +00:00
7a97832585 [ROCm] Add more timm models, forward fix #165381 (#165569)
PR #165381 added timm models to cuda and cpu expected accuracy files. ROCm expected accuracy files were not updated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165569
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-15 18:11:21 +00:00
84d141e910 Revert "[inductor] Expand use of generic benchmark function (#164938)"
This reverts commit 5c583e2573f29243742e00b9fa36b266c5c78bb3.

Reverted https://github.com/pytorch/pytorch/pull/164938 on behalf of https://github.com/clee2000 due to I think this broke test/inductor/test_cuda_repro.py::CudaReproTests::test_epilogue_fusion_with_view? [GH job link](https://github.com/pytorch/pytorch/actions/runs/18529735968/job/52813191763) [HUD commit link](f58f301313) on both rocm and the slow grad check for linux. It did run successfully on cuda workflow on trunk, I wonder if this a gpu capability thing? no clue though ([comment](https://github.com/pytorch/pytorch/pull/164938#issuecomment-3407600224))
2025-10-15 17:48:38 +00:00
7c6c5d04fe Add scaled_grouped_mm_v2 and python API (#165154)
Summary:

* Add `torch._scaled_grouped_mm_v2` with more functionality and
  extensibility for future formats
* Add `torch.nn.functional.scaled_grouped_mm` as public entrypoint
* Test both original and v2 functionality

Test Plan:

```
pytest -svv -k grouped test/test_scaled_matmul_cuda.py
```

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Simon Layton <simonlayton@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165154
Approved by: https://github.com/drisspg, https://github.com/danielvegamyhre
2025-10-15 17:47:23 +00:00
b509fb9b5d Revert "add and fix OpInfo tests for the default partitioner (#165372)"
This reverts commit bcfea48ab7fd489218289693b98c1a6a6582d079.

Reverted https://github.com/pytorch/pytorch/pull/165372 on behalf of https://github.com/malfet due to Looks like it broke slow jobs, see 331b7cc054/1 ([comment](https://github.com/pytorch/pytorch/pull/165372#issuecomment-3407567748))
2025-10-15 17:38:52 +00:00
331b7cc054 Fix double dispatch to Python for detach (#163671)
This fixes #71725.

Differential Revision: [D83857880](https://our.internmc.facebook.com/intern/diff/D83857880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163671
Approved by: https://github.com/ezyang, https://github.com/albanD
2025-10-15 17:24:50 +00:00
815d641599 [Inductor][CuTeDSL] Move load_template up two directories (#165347)
Summary: Moves the function used to load CuTeDSL Jinja templates up one level out of the flex attention folder. This way it can be used for more generate Inductor templates in the future.

Test Plan: `INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:flex_flash -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8`

Differential Revision: D84527470

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165347
Approved by: https://github.com/drisspg
2025-10-15 16:34:58 +00:00
ffe3cb226a In pipeline parallelism: Use same dtype for receive and send tensor when initializing p2p communication. (#165539)
When initializing the p2p communication for pipeline parallelism, currently different default dtypes are used for the send and receive tensor here:
5c583e2573/torch/distributed/pipelining/stage.py (L935-L936)

This caused hard to trace issues when training on multiple nodes. Multiple stages on one node seem to work for some reason which probably caused the unit tests not to catch this.

Fixes #165143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165539
Approved by: https://github.com/H-Huang
2025-10-15 15:05:55 +00:00
7ae123d72c [DeviceMesh] Make _flatten_mapping an object attribute instead of a class attribute (#165521)
The `_flatten_mapping` field was defined as a class attribute with a mutable default value {}:
```
_flatten_mapping: dict[str, "DeviceMesh"] = {}
```
This caused all DeviceMesh instances to share the same dictionary object. When multiple test instances tried to create flattened meshes with the same name (like "dp"), they would conflict because they were all using the same shared dictionary, resulting in the error: "Flatten mesh with mesh_dim_name dp has been created before, Please specify another valid mesh_dim_name."

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165521
Approved by: https://github.com/fegin, https://github.com/lw
2025-10-15 14:47:09 +00:00
7719cb75bf [ATen][CMake] Fix duplicated CUTLASS path (#165424)
Fixes #165110

The `PUBLIC` scope causes CUTLASS of the FBGEMM being included in for all PyTorch targets, including special matmuls (RowwiseScaledMM, ScaledGroupMM and GroupMM). Due to version mismatch between FBGEMM/CUTLASS and PyTorch/CUTLASS it is unacceptable to use FBGEMM/CUTLASS in PyTorch targets. This PR limits the scope of FBGEMM/CUTLASS to `fbgemm_genai` target only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165424
Approved by: https://github.com/cthi, https://github.com/eqy, https://github.com/danielvegamyhre
2025-10-15 14:14:17 +00:00
712f54d453 [ATen] Remove explicit casting of complex nansum during accumulation (#165494)
https://github.com/pytorch/pytorch/pull/164790 modifies aten to perform a different reduction order intra warp. However, this change exposed a large difference in a sum for complex32. Namely the case:

```
import torch

a = torch.tensor([[ 4.82031250+7.34765625j,
           -3.37109375-1.9501953125j],

         [ 3.7832031250-2.43359375j,
           -6.07812500+5.32812500j]], dtype=torch.complex32, device='cuda:0')

sum_out = torch.sum(a)
nansum_out = torch.nansum(a)
torch.testing.assert_close(
    sum_out,
    nansum_out,
    rtol=0,
    atol=0,
)
```

Here, the result of `sum` and `nansum` differed significantly by 1e-2. Further investigation showed that the explicit casting of b back to `arg_t` from `scalar_t` was the root cause. `arg_t` is the dtype of the accumulator, ComplexFloat, and `scalar_t` of the input dtype, ComplexHalf. When we cast in the reduction to the accumulator order, that means the input is still of ComplexHalf, which loses precision as it can store intermediate values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165494
Approved by: https://github.com/ngimel
2025-10-15 13:49:25 +00:00
f58f301313 Fixes bug with tolist calls to GradTrackingTensors (#165184)
Fixes #161943

## The Fix
I implemented a recursive unwrapping helper function in the `tensor_to_list.cpp` file that looks for wrapped tensors and unwraps them. The recursive implementation was needed for multi-level gradTrackingTensors.

Let me know if there is any more suggestions on fixing this issue!

@guilhermeleobas @KimbingNg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165184
Approved by: https://github.com/zou3519
2025-10-15 12:54:28 +00:00
5c583e2573 [inductor] Expand use of generic benchmark function (#164938)
Use the more generic `Benchmarker.benchmark` function to allow benchmarking other devices that support the required functionality, for example prologue and epilogue fusion can be benchmarked for triton CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164938
Approved by: https://github.com/nmacchioni, https://github.com/eellison
2025-10-15 09:18:24 +00:00
0c14f55de6 [ez] fix typo (#165282)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165282
Approved by: https://github.com/ezyang, https://github.com/mlazos
2025-10-15 06:19:24 +00:00
8e510e1095 [MPS] fix empty dot op crash (#165237)
reproducer
```
import torch

# does not crash
a = torch.rand((0), device="cpu")
b = torch.rand((0), device="cpu")
a.dot(b)

# crashes due to internal assert
a = torch.rand((0), device="mps")
b = torch.rand((0), device="mps")
a.dot(b)

```

Discovered when implementing an op for SparseMPS backend
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165237
Approved by: https://github.com/malfet
2025-10-15 04:49:29 +00:00
59d30d1b75 [vision hash update] update the pinned vision hash (#165496)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165496
Approved by: https://github.com/pytorchbot
2025-10-15 04:35:50 +00:00
3915898c22 [audio hash update] update the pinned audio hash (#165495)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165495
Approved by: https://github.com/pytorchbot
2025-10-15 04:32:49 +00:00
3044e1a460 Revert "varlen api (#164502)"
This reverts commit 3681312ce03e425e280a110df2153db107616a15.

Reverted https://github.com/pytorch/pytorch/pull/164502 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the doctests failure is legit ([comment](https://github.com/pytorch/pytorch/pull/164502#issuecomment-3404419420))
2025-10-15 03:56:42 +00:00
b11593c31b [8/N] Apply ruff UP035 rule (#165214)
This is follow-up of #164653 to continue applying `UP035` fixes. The purpose is to finally enable this rule.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165214
Approved by: https://github.com/ezyang
2025-10-15 03:18:57 +00:00
36871622f1 [2/N] Mark unused parameters in C++ code (#165121)
This is follow-up of #164912 to mark unused C++ parameters to improve code readability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165121
Approved by: https://github.com/Skylion007
2025-10-15 03:04:39 +00:00
b4fd47179e feat(dynamo): IS#160752 make F.one_hot work with jacfwd + torch.compile(dynamic=True) (#160837)
Fixes #160752

# Background:
`torch.func.jacfwd` is implemented as vmap over forward-mode JVP. With torch.compile(dynamic=True), FakeTensor + SymInt shape reasoning is used while tracing through the transform. The old vmap rule for one_hot decomposed into “zeros_symint + scatter,” which interacted poorly with the transform stack and dynamic shapes, leading to failures mid-trace. Using a functional equality construction makes one_hot composable with vmap/JVP and friendly to dynamic shape tracing.

# Changes:
- functorch vmap batching rule for `aten::one_hot` now uses a purely functional formulation:
- Replace “zeros + scatter” with eq(self.unsqueeze(-1), arange(num_classes)).to(kLong) under FuncTorchBatched.
- one_hot native path remains unchanged for regular eager; vmap transform no longer relies on scatter, which was fragile under dynamic shape tracing.

The minimal repro from the issue is now fixed:
```python
import torch
import torch.nn.functional as F

MAX, BATCH = 3, 37

def func(x, idxs):
    return x.square() * F.one_hot(idxs, MAX)

def jacfunc(x, idxs):
    return torch.func.jacfwd(func, argnums=0)(x, idxs)

idxs = torch.randint(MAX, (BATCH,), dtype=torch.int64)
x = torch.rand((BATCH, MAX), dtype=torch.float64)

# eager
out_eager = jacfunc(x, idxs)

# compiled dynamic
jacfunc_c = torch.compile(jacfunc, dynamic=True)
out_comp = jacfunc_c(x, idxs)

torch.testing.assert_close(out_eager, out_comp)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/160837
Approved by: https://github.com/guilhermeleobas, https://github.com/zou3519
2025-10-15 02:48:44 +00:00
4f400ab520 Fix: nDims is mutated inside the loop in Shape.cu (#165446)
Summary:
The `nDims` variable is mutated inside the loop but never restored to its original value.
This affects subsequent iterations of the outer loop.
Each batch iteration may get incorrect `nDims` after the first batch.

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D84612194

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165446
Approved by: https://github.com/ngimel
2025-10-15 02:32:15 +00:00
839f6facdb [precompile] Fix frame construction for wrapped model. (#165454)
Summary: If a function is wrapped with functools, we should not look at the wrapped function signature but rather the wrapper, since we need to construct the frame for the top level function here.

Test Plan: test_decorated_function_with_functools_wrap_aot

Differential Revision: D84626752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165454
Approved by: https://github.com/yiming0416
2025-10-15 02:01:46 +00:00
ca65023b90 [PP] Fix edge case with FSDP when stages_per_rank > 3 (#165467)
There is an edge case with FSDP + PP when we add UNSHARD + RESHARD, we at max have 3 stages unsharded, 3f83e8915e/torch/distributed/pipelining/schedules.py (L1029-L1031)

This change is need to be able to unshard and reshard a stage multiple times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165467
Approved by: https://github.com/wwwjn
2025-10-15 01:53:04 +00:00
132ae8e6dd Don't link with libnvToolsExt when building for 12.9 (#165465)
This is to bring back this logic from https://github.com/pytorch/pytorch/pull/161916/files#diff-bf46b4a09ca67e50622bf84fefc0d11b584ffcc24ee6cc5019cf0fc7565d81a8L170.  Building libtorch on 12.9 is failing otherwise https://github.com/pytorch/pytorch/actions/runs/18458531395/job/52610761895:

```
cp: cannot stat '/usr/local/cuda/lib64/libnvToolsExt.so.1': No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165465
Approved by: https://github.com/atalman, https://github.com/malfet
2025-10-15 01:45:37 +00:00
a20afb6100 Allow at::native::offset_t to be offset using operator+= (#164570)
This will be required by CCCL 3.1.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164570
Approved by: https://github.com/Skylion007, https://github.com/eqy
2025-10-15 01:40:54 +00:00
47524dcc48 [benchmark] Add more timm models (#165381)
Added following models to timm_models

- [convnextv2_nano.fcmae_ft_in22k_in1k](https://huggingface.co/timm/convnextv2_nano.fcmae_ft_in22k_in1k)
- [vit_base_patch14_dinov2.lvd142m](https://huggingface.co/timm/vit_base_patch14_dinov2.lvd142m)
- [ViT-B-16-SigLIP-i18n-256](https://huggingface.co/timm/ViT-B-16-SigLIP-i18n-256)
- [deit_tiny_patch16_224.fb_in1k](https://huggingface.co/timm/deit_tiny_patch16_224.fb_in1k)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165381
Approved by: https://github.com/BoyuanFeng
2025-10-15 01:19:10 +00:00
9ffba8a2f9 fixing stress test failure (#164353)
Summary: This diff fixes a stress test failure by adding a new binary echo4.py and modifying the existing echo1.py binary. The changes are made in both fbcode and xplat directories. The api_test.py file is updated to use the new echo4.py binary, and the BUCK file is updated to include the new binary.

Test Plan:
```
buck test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/distributed/elastic/multiprocessing:api_test -- --exact 'caffe2/test/distributed/elastic/multiprocessing:api_test - test_binary_redirect_and_tee (api_test.StartProcessesListAsBinaryTest)' --run-disabled --stress-runs 20 --record-results
```

```
buck test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/distributed/elastic/multiprocessing:api_test -- --exact 'caffe2/test/distributed/elastic/multiprocessing:api_test - test_binary (api_test.StartProcessesListAsBinaryTest)' --run-disabled --stress-runs 20 --record-results
```

https://www.internalfb.com/intern/testinfra/testrun/17732923648474906

https://www.internalfb.com/intern/testinfra/testrun/15481123834815653

Differential Revision: D83623694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164353
Approved by: https://github.com/d4l3k
2025-10-15 01:18:50 +00:00
3681312ce0 varlen api (#164502)
**Summary**

Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA.

This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend.

**Benchmarking**

To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding.

Settings:

- 1 H100 machine
- `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16`
- dtype `torch.bfloat16`
- `is_causal=False`
- for variable length, we set sequences to be random multiples of 64 up to `max_seq_len`
- 100 runs

|        | Variable Length API | SDPA     |
|--------|--------------------|----------|
| Runtime | 0.21750560760498047 ms       | 0.43171775817871094 ms  |
| TFLOPs | 231.812         | 320.840  |

The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length.

**Testing**

Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA.

**Next steps**

Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics.

(This stack builds on top of #162326)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164502
Approved by: https://github.com/v0i0, https://github.com/drisspg
2025-10-15 00:45:06 +00:00
7778a58e7c Revert "[export] Handle kwargs better in aot_export_joint_with_descriptors (#165334)"
This reverts commit bbb902c8dd911e1587253f496c1e2fb178d4b6a1.

Reverted https://github.com/pytorch/pytorch/pull/165334 on behalf of https://github.com/jeffdaily due to trunk CI passed here but failures on HUD after merge?  test/functorch/test_aot_joint_with_descriptors.py::TestAOTJointWithDescriptors::test_module_with_kwargs [GH job link](https://github.com/pytorch/pytorch/actions/runs/18511729262/job/52755708742) [HUD commit link](bbb902c8dd) ([comment](https://github.com/pytorch/pytorch/pull/165334#issuecomment-3404071893))
2025-10-15 00:21:49 +00:00
e7091a47da [AOTI] skip Windows XPU crashed UTs. (#165393)
Skip some UTs, which crashed on Windows XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165393
Approved by: https://github.com/jansel
2025-10-14 23:45:14 +00:00
bcfea48ab7 add and fix OpInfo tests for the default partitioner (#165372)
I noticed the default partitioner was breaking in some dynamic shape tests, so prior to turning off functionalization I want to tweak it to pass all of our OpInfo tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165372
Approved by: https://github.com/ezyang
ghstack dependencies: #165327
2025-10-14 23:34:34 +00:00
d2e1dbc8f2 make aotdispatcher opinfo tests keep input mutations in graph (#165327)
This stack is going to turn off functionalization and turn on the default partitioner, so I'm going to separate out a few changes before turning off functionalization in our OpInfo tests:

(1) run our tests with input mutations allowed inside the graph

(2) run our tests with the default partitioner

(3) run with functionalization off

(4) (later) make the tests properly test for bitwise equivalence

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165327
Approved by: https://github.com/ezyang
2025-10-14 23:34:33 +00:00
89298ada83 [device_mesh] Implement _unflatten on top of CuTe layout bookkeeping (#161224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161224
Approved by: https://github.com/lw, https://github.com/fegin
ghstack dependencies: #164510
2025-10-14 23:17:11 +00:00
c467e59cb0 dynamo configs to torch.compiler (#163517)
Moving some dynamo configs to torch.compiler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163517
Approved by: https://github.com/williamwen42, https://github.com/anijain2305

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-10-14 22:44:53 +00:00
bbb902c8dd [export] Handle kwargs better in aot_export_joint_with_descriptors (#165334)
fx.Interpreter doesn't handle kwargs... not sure how this code worked previously

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165334
Approved by: https://github.com/tugsbayasgalan, https://github.com/ezyang
2025-10-14 22:22:58 +00:00
e6f766c7d7 [Dynamo] Fixes for exceptions (#153966)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153966
Approved by: https://github.com/Lucaskabela
2025-10-14 22:03:58 +00:00
13b621d87c [DTensor] add __repr__ for CommDebugMode(get_total_count()=) (#165006)
I just want to print CommDebugMode and know if there is communication. implementing `__repr__` for `print(comm_mode)`

```
comm_mode = CommDebugMode()
with comm_mode:
    out = torch.mm(inps, weight)
print(comm_mode)
# CommDebugMode(get_total_counts()=0)
```

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165006
Approved by: https://github.com/anshul-si
ghstack dependencies: #165024
2025-10-14 21:31:23 +00:00
01738a3fea Continue local tensor mode enablement for DTensor tests (#165451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165451
Approved by: https://github.com/ezyang, https://github.com/albanD
2025-10-14 21:20:54 +00:00
a2f34bdd7c Revert "Patch the flex_attention._get_mod_type to not use inspect.signature when computing num_positional_args (an alternative fix for flex attention graph break on create_block_mask) (#164923)"
This reverts commit 3401665110dbfbfa4625646e4a18ebf8c99fa92f.

Reverted https://github.com/pytorch/pytorch/pull/164923 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/164923#issuecomment-3403654378))
2025-10-14 21:20:49 +00:00
a63ab0b8cd [Inductor] Fix out-of-bounds indices in repeat_interleave decomposition (#165368)
When `repeat_interleave` is decomposed into:
```bash
  cumsum = repeat.cumsum(0)
  pos = torch.arange(output_size, device=repeat.device)
  indices = torch.searchsorted(cumsum, pos, right=True)
```
`searchsorted` op with `right=True` returns the insertion point after matching elements. When query values `pos` are `>= cumsum[-1]`, searchsorted returns `len(cumsum)`, which is out of bounds for indexing (valid range: `[0, len(cumsum)-1]`). These invalid indices trigger CUDA device-side assert errors in downstream indexing operations.

This fix adds clamping to ensure all indices stay within the valid range [0, repeat.size(0)-1].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165368
Approved by: https://github.com/mlazos
2025-10-14 21:16:36 +00:00
102b7885ff Add option to run AOT Precompile in benchmark (#164906)
Use the existing benchmark infra to get some signals for AOT precompile pass rate on OSS models. Here we also measure and log the loading time.

```
python ./benchmarks/dynamo/huggingface.py --accuracy --inference --aot-precompile

python ./benchmarks/dynamo/timm_models.py --accuracy --inference --aot-precompile

python ./benchmarks/dynamo/torchbench.py --accuracy --inference --aot-precompile
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164906
Approved by: https://github.com/zhxchen17
2025-10-14 20:59:55 +00:00
382d04a51e [Inductor][ATen][FP8] Add note for supported blockwise scaling strategy pairs (#165450)
Summary: Add note mentioning which scaling type pairs are supported in Inductor ATen, since this was a source of confusion and also informs which scaling strategies we choose to support for other backends, like Triton.

Test Plan: n/a

Reviewed By: lw

Differential Revision: D84522373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165450
Approved by: https://github.com/NikhilAPatel
2025-10-14 20:43:58 +00:00
1ec0755a7e [ISSUES] Update ci:sev template to include a note about ci: disable-autorevert label (#165459)
We noticed that disabling autorevert in any and all ci:sevs is too impactful, as ci: sevs are sometimes created just to communicate an action or a impactful change. But sometimes durring a SEV we might not want to disable autorevert anyways, a example is a ci: sev impacting jobs we don't use as basis for autorevert.

So, a note is added reminding the ci:sev author to optionally add this tag to disable auto-revert

Note: using this opportunity to fix the ci: disable-autorevert issues. As it is best for the title to be simple and the displayed message in the GitHub interface to be decorated with emoji :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165459
Approved by: https://github.com/malfet
2025-10-14 20:32:46 +00:00
058782c6ab [torch.export] Rmoving unused constants - add support for corner case (#165205)
Summary: In some cases unused constant had only one level of child node, no second level of child node. Those constants should be removed too. The added test case has the scenario where this scenario will happen.

Test Plan:
```
buck test mode/opt caffe2/test:test_export -- 'test_unused_constant'
```

https://www.internalfb.com/intern/testinfra/testrun/15481123837456594

Differential Revision: D84398413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165205
Approved by: https://github.com/angelayi
2025-10-14 20:26:28 +00:00
2b4ef6b4d6 [opaque_obj_v2] PyObject custom op schema type (#165004)
This is a cleaner implementation of opaque objects (https://github.com/pytorch/pytorch/pull/162660). Instead now we just need to do:

Call `register_opaque_type` to register the type as being "opaque" and allowed by custom ops. You also need to pass a unique name that maps to the type.
```python
class OpaqueQueue:
    def __init__(self, queue: list[torch.Tensor], init_tensor_: torch.Tensor) -> None:
        super().__init__()
        self.queue = queue
        self.init_tensor_ = init_tensor_

    def push(self, tensor: torch.Tensor) -> None:
        self.queue.append(tensor)

    def pop(self) -> torch.Tensor:
        if len(self.queue) > 0:
            return self.queue.pop(0)
        return self.init_tensor_

    def size(self) -> int:
        return len(self.queue)

register_opaque_type(OpaqueQueue, "_TestOpaqueObject_OpaqueQueue")
```

When creating the custom op, the schema will then use the unique name:
```python
self.lib = torch.library.Library("_TestOpaqueObject", "FRAGMENT")

torch.library.define(
    "_TestOpaqueObject::queue_push",
    "(_TestOpaqueObject_OpaqueQueue a, Tensor b) -> ()",
    tags=torch.Tag.pt2_compliant_tag,
    lib=self.lib,
)

@torch.library.impl(
    "_TestOpaqueObject::queue_push", "CompositeExplicitAutograd", lib=self.lib
)
def push_impl(queue: OpaqueQueue, b: torch.Tensor) -> None:
    assert isinstance(queue, OpaqueQueue)
    queue.push(b)
```

Using the custom op:
```python
queue = OpaqueQueue([], torch.zeros(3))
torch.ops._TestOpaqueObject.queue_push(queue, torch.ones(3))
self.assertTrue(queue.size(), 1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165004
Approved by: https://github.com/albanD
2025-10-14 20:21:04 +00:00
3f83e8915e [inductor] fix issue for example value with unbacked strides (#163660)
## Issue

During autotune, we're not applying size hints atomically for the example inputs used for benchmarking.

If there is unbacked symint showing up in inputs' strides, this might lead to CUDA IMA,

and this could be reproduced by the added unittest, with stride being `[128 * u0, 128, 1]` and unbacked fallback being 8192, after calling `benchmark_example_value`, we get back a tensor with stride as `[8192, 128, 1]` as opposed to `[128 * 8192, 128, 1]`

## Fix

Using the atomic API when trying to apply size hints to input tensor' strides.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163660
Approved by: https://github.com/ColinPeppler
2025-10-14 20:07:51 +00:00
d7e3f493d9 [ROCm][CI] add mi355 to inductor perf test nightly (#165326)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165326
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-14 20:03:21 +00:00
08f09d9543 Ensure rms_norm decomp generates add.Scalar for pattern match BC (#165437)
Summary: Apparently if I just do `tensor + eps` this turns into add.Tensor, which is bad because the constant Tensor ends up getting hoisted into an input, which is a bozo thing to do. Just make sure it's exactly compatible.

Test Plan:
```
buck run 'fbcode//mode/opt' fbcode//bolt/nn/executorch/backends/tests:qnn_test_ar1g1 bolt.nn.executorch.backends.tests.qnn_test_ar1g1.QnnTestAR1G1.test_RMSNorm
```

Reviewed By: tugsbayasgalan

Differential Revision: D84613184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165437
Approved by: https://github.com/tugsbayasgalan
2025-10-14 19:56:37 +00:00
74acf92648 Forward fix inductor failure (#165363) (#165443)
Summary:

Title

Test Plan: CI

Differential Revision: D84615478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165443
Approved by: https://github.com/angelayi
2025-10-14 19:31:58 +00:00
cbf212e9c7 [CI] Fix doctest job if build without distributed (#165449)
Guard test with `TORCH_DOCTEST_DISTRIBUTED` and set it to true in
run_test.py to be able to pass doctest for PyTorch build without
distribtued support. This is a regression introduced by https://github.com/pytorch/pytorch/pull/164806

Fixes https://github.com/pytorch/pytorch/issues/165343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165449
Approved by: https://github.com/seemethere
2025-10-14 19:19:03 +00:00
d18e068fd6 [dict] Implement __eq__ for dict_items (#155154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155154
Approved by: https://github.com/anijain2305
2025-10-14 18:56:51 +00:00
3401665110 Patch the flex_attention._get_mod_type to not use inspect.signature when computing num_positional_args (an alternative fix for flex attention graph break on create_block_mask) (#164923)
The initial fix for inspect.signature uses not a right approach (https://github.com/pytorch/pytorch/pull/164349#pullrequestreview-3306614010). As @williamwen42 suggests (https://github.com/pytorch/pytorch/pull/164349#issuecomment-3379222885) we can just for now get rid of `inspect.signature` call in flex_attention to resolve this high priority issue (https://github.com/pytorch/pytorch/issues/164247#issuecomment-3378673179). In this PR I did exactly this - limited the scope of fix to just computing `num_positional_args` in `flex_attention._get_mod_type` based on properties returned by `NestedUserFunctionVariable.const_getattr` (some were missing so I added them)

Fixes #164247

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164923
Approved by: https://github.com/williamwen42
2025-10-14 18:29:15 +00:00
8c60f4ae08 [Distributed] update table in docs (#165009)
Fixes #162248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165009
Approved by: https://github.com/ezyang
2025-10-14 18:17:22 +00:00
c4565c3b94 [distributed] Replace 164 assert statements in fsdp directory (#165235)
Replace assert statements with explicit if/raise patterns across 20 files:
- _optim_utils.py (38 asserts)
- _flat_param.py (25 asserts)
- _fully_shard/_fsdp_param.py (23 asserts)
- sharded_grad_scaler.py (12 asserts)
- fully_sharded_data_parallel.py (11 asserts)
- wrap.py (10 asserts)
- _state_dict_utils.py (9 asserts)
- _fully_shard/_fsdp_param_group.py (8 asserts)
- _runtime_utils.py (6 asserts)
- _init_utils.py (6 asserts)
- 10 additional files (16 asserts)

This prevents assertions from being disabled with Python -O flag.

Fixes partially #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165235
Approved by: https://github.com/albanD
2025-10-14 18:04:57 +00:00
6918f17114 [FSDP2] provide public API to share cuda streams across roots (#165024)
for pipeline parallel, we can have multiple FSDP roots (chunks)
```
model = nn.Sequential([chunk0, chunk1])
fully_shard(model.chunk0)
fully_shard(model.chunk1)
```

we can call `share_comm_ctx` to share all-gather, reduce-scatter, all-reduce cuda streams. this avoids inter-stream memory fragmentation
```
from torch.distributed.fsdp import share_comm_ctx
share_comm_ctx([model.chunk0, model.chunk1])
```

unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_share_comm_context`

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165024
Approved by: https://github.com/mori360
2025-10-14 17:50:46 +00:00
9b6be53326 [distributed] Replace 94 assert statements in tensor ops files (#165229)
Replace assert statements with explicit if/raise patterns in:
- _math_ops.py (43 asserts)
- _matrix_ops.py (27 asserts)
- _view_ops.py (24 asserts)

This prevents assertions from being disabled with Python -O flag.

Fixes partially #164878.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165229
Approved by: https://github.com/albanD
2025-10-14 17:28:06 +00:00
7fee6bbf34 [Fix] Completely remove stride normalization on DLPack Tensor (#164161)
A followup on PR #163282
Fixes #163274
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164161
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-14 17:17:11 +00:00
6adaa328f4 [autobucketing] aten autobucketing fix to enable aot_eager pass (#165063)
When the autobucketing pass  is registered as aot_eager backend `fw_compiler` and `bw_compiler`, this pr ensures the tensors are all-gathers on "cpu/cuda" device instead of "meta" device.

When we do `dist.all_gather_object`, it will create new bytestorage outside no_dispatch [here](a2e2e1d8c0/torch/distributed/distributed_c10d.py (L3303)), which is on meta device. Thus, I updated the code to use `unset_fake_temporarily`, which would gather RealTensor from other ranks.

 It is needed to unblock the aot_eager+autobucketing pass in this [PR](https://github.com/pytorch/torchtitan/pull/1813).

Otherwise, I hit the error as follows:

```bash
  traceback : Traceback (most recent call last):
    File "/home/ruisizhang123/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 358, in wrapper
      return f(*args, **kwargs)
    File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 607, in train
      self.train_step(data_iterator)
      ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
    File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 507, in train_step
      loss = self.forward_backward_step(input_dict, labels)
    File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 483, in forward_backward_step
      pred = model_parts[0](inputs, **extra_inputs, **extra_args)
    File "/home/ruisizhang123/pytorch/torch/_dynamo/eval_frame.py", line 418, in __call__
      return super().__call__(*args, **kwargs)
             ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
    File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1784, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
    File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1795, in _call_impl
      return forward_call(*args, **kwargs)
    File "/home/ruisizhang123/pytorch/torch/_dynamo/eval_frame.py", line 901, in compile_wrapper
      raise e.remove_dynamo_frames() from None  # see TORCHDYNAMO_VERBOSE=1
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ruisizhang123/pytorch/torch/_dynamo/output_graph.py", line 2359, in _call_user_compiler
      raise BackendCompilerFailed(
          self.compiler_fn, e, inspect.currentframe()
      ).with_traceback(e.__traceback__) from None
    File "/home/ruisizhang123/pytorch/torch/_dynamo/output_graph.py", line 2334, in _call_user_compiler
      compiled_fn = compiler_fn(gm, example_inputs)
    File "/home/ruisizhang123/pytorch/torch/_dynamo/repro/after_dynamo.py", line 156, in __call__
      compiled_gm = compiler_fn(gm, example_inputs)
    File "/home/ruisizhang123/pytorch/torch/__init__.py", line 2441, in __call__
      return self.compiler_fn(model_, inputs_, **self.kwargs)
             ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ruisizhang123/pytorch/torch/_dynamo/backends/common.py", line 117, in __call__
      cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
    File "/home/ruisizhang123/pytorch/torch/_functorch/aot_autograd.py", line 1100, in aot_module_simplified
      compiled_fn, _ = aot_stage2_compile(
                       ~~~~~~~~~~~~~~~~~~^
          aot_state,
          ^^^^^^^^^^
      ...<4 lines>...
          inference_compiler,
          ^^^^^^^^^^^^^^^^^^^
      )
      ^
    File "/home/ruisizhang123/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 257, in aot_stage2_compile
      return aot_stage2_autograd(aot_state, aot_graph_capture)
    File "/home/ruisizhang123/pytorch/torch/_functorch/_aot_autograd/graph_compile.py", line 1696, in aot_stage2_autograd
      compiled_fw_func = aot_config.fw_compiler(fw_module, adjusted_flat_args)
    File "/home/ruisizhang123/torchtitan/torchtitan/experiments/simple_fsdp/backend.py", line 35, in aten_autobucketing_reordering_pass
      schedule_overlap_bucketing(gm)
      ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^
    File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 755, in schedule_overlap_bucketing
      ).run()
        ~~~^^
    File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 358, in run
      self._align_compute_nodes_runtime_estimations_across_all_distributed_ranks()
      ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
    File "/home/ruisizhang123/pytorch/torch/_inductor/fx_passes/overlap_scheduling.py", line 337, in _align_compute_nodes_runtime_estimations_across_all_distributed_ranks
      dist.all_gather_object(
      ~~~~~~~~~~~~~~~~~~~~~~^
          gathered_runtime_estimations, runtime_estimations, pg
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      )
      ^
    File "/home/ruisizhang123/pytorch/torch/distributed/c10d_logger.py", line 82, in wrapper
      return func(*args, **kwargs)
    File "/home/ruisizhang123/pytorch/torch/distributed/distributed_c10d.py", line 3170, in all_gather_object
      input_tensor, local_size = _object_to_tensor(obj, current_device, group)
                                 ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ruisizhang123/pytorch/torch/distributed/distributed_c10d.py", line 3079, in _object_to_tensor
      byte_tensor = torch.ByteTensor(byte_storage).to(device)
                    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  torch._dynamo.exc.BackendCompilerFailed: backend='compiler_fn' raised:
  RuntimeError: Attempted to set the storage of a tensor on device "cpu" to a storage on different device "meta".  This is no longer allowed; the devices must match.

  Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165063
Approved by: https://github.com/eellison
2025-10-14 17:09:54 +00:00
4a7eed527f Make truediv numerics change external only for now (#165328)
Summary: For D84399286, failing ads ne deterministic tests now. These tests are especially brittle with subtle bitwise numerics changes. Will reenable for fbcode once e2e validation tests are performed

Test Plan: N/A

Differential Revision: D84514361

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165328
Approved by: https://github.com/izaitsevfb
2025-10-14 17:08:17 +00:00
d2494cbb2b Revert "[distributed] Replace assert statements with AssertionError exceptions (#165216)"
This reverts commit 74db92b21868b7e9e77cc966e5d57a8246723cbd.

Reverted https://github.com/pytorch/pytorch/pull/165216 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_pg_wrapper.py::ProcessGroupNCCLWrapperTest::test_debug_level_detail_no_gloo [GH job link](https://github.com/pytorch/pytorch/actions/runs/18492765290/job/52693842750) [HUD commit link](74db92b218), note to self: bad TD ([comment](https://github.com/pytorch/pytorch/pull/165216#issuecomment-3402838765))
2025-10-14 17:05:16 +00:00
5eddbb5e47 [annotate] Annotation should be mapped across submod (#165202)
The match for backward nodes might be in a different submod, so we should check all submod for potential matches.

In flex attention, this could happen if `mask_mod` has operations (such as index) that increase the seq_nr of the forward graph nodes. Then the backward flex_attention nodes cannot find a match in its own subgraph.

```
python test/functorch/test_aot_joint_with_descriptors.py -k preserve_annotate
```

Also tested on torchtitan joint_graph_runner branch. The flex_attention backward nodes are annotated now.

```
NGPU=8   CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml"   LOG_RANK=0   TRAIN_FILE="torchtitan.train"   TORCHFT_LIGHTHOUSE="http://localhost:29510"   PYTORCH_ALLOC_CONF="expandable_segments:True"   torchrun     --nproc_per_node=8     --rdzv_backend c10d     --rdzv_endpoint="localhost:0"     --local-ranks-filter 0     --role rank     --tee 3     -m torchtitan.train     --job.config_file ./torchtitan/models/llama3/train_configs/debug_model.toml     --model.name joint_graph_runner.llama3     --compile.enable     --parallelism.data_parallel_shard_degree=2     --parallelism.tensor_parallel_degree=4     --model.flavor=debugmodel_flex_attn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165202
Approved by: https://github.com/SherlockNoMad
2025-10-14 16:19:38 +00:00
c9b2a09530 [export] Turn on install_free_tensors flag (#164691)
The final step in removing the discrepancy between
torch.compile(fullgraph=True) and torch.export(strict=True).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164691
Approved by: https://github.com/avikchaudhuri
2025-10-14 15:33:50 +00:00
bf5aeb3148 [torch/utils][Code Clean] Clean asserts in hipify/, jit/, model_dump and tensorboard of torch/utils (#165311)
Including:
- `torch/utils/hipify/`
- `torch/utils/jit/`
- `torch/utils/model_dump/`
- `torch/utils/tensorboard/`

Fixes part of #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165311
Approved by: https://github.com/albanD
2025-10-14 15:26:23 +00:00
45b8c0f75c [distributed] Replace 54 assert statements in tensor/_ops/_tensor_ops.py (#165226)
Replace assert statements with explicit if/raise patterns to prevent assertions from being disabled with Python -O flag.

Fixes partially #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165226
Approved by: https://github.com/albanD
2025-10-14 15:10:03 +00:00
c733072874 Fix IValue from SymBool on big-endian system (#163647)
Skip test_compiled_autograd_attribution on s390x

It fails both on s390x and x86_64 at least under some circumstances. Disable it for now until on s390x until it works reliably.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163647
Approved by: https://github.com/malfet
2025-10-14 15:07:48 +00:00
fbe0d20a17 [2/N] More ruff SIM fixes (#165031)
This is follow-up of #164695 to apply ruff SIM rules to more files. Most changes are about simplifying dict.get because None is already the default value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165031
Approved by: https://github.com/mlazos
2025-10-14 14:22:54 +00:00
1fa11f42b1 [Bugfix][vLLM] Explicitly do not support instead of crashing for named tuples in infer schema (#165191)
Fixes https://github.com/vllm-project/vllm/issues/25270 by being explicit in erroring; previously we had a cryptic `__origin__ undefined` error, but now should give proper error message that we don't support NamedTuples in schema

Test with
```
python test/test_custom_ops.py TestCustomOp.test_unsupported_param_types
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165191
Approved by: https://github.com/zou3519
2025-10-14 14:18:42 +00:00
6f713e25bb [CodeClean] Replace std::runtime_error with TORCH_CHECK (#164130)
As the title stated.

**Changes**:
- torch/csrc/inductor(Part 1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164130
Approved by: https://github.com/albanD, https://github.com/Skylion007
2025-10-14 14:09:53 +00:00
09a4187b8e Update windows cuda build to use 12.8 (#165345)
As title

Motivation: The rest of the pytorch and inductor build is using 12.8 and we're deprecating cuda 12.6 builds soon per https://github.com/pytorch/pytorch/issues/165111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165345
Approved by: https://github.com/atalman, https://github.com/malfet
2025-10-14 13:58:20 +00:00
306c55ba27 [atomically_apply_size_hint] Make unbacked replacements reconciles to a single expr (#164324)
## Problem
Okay there's limitations with today's `atomically_apply_size_hint` though it works for most observed failures we've seen so far. However, it's easy to come up with an edge case.

Suppose you encounter this setup.
```
a: [s0 + u0]
b: [s1 + u1]
c: [u2 + u3]
d: [u100]
```

Today, we use a few heuristics to specify the LHS and RHS for replacements.

10d2734d9b/torch/_inductor/sizevars.py (L730-L759)

It's possible to end up with these replacement rules. Notice how there's no replacement for `s1 + u1` and `u2 + u3` :( That's because today picking the LHS and RHS matters a lot, and `s1 + u1` & `u2 + u3` happened to end up on the RHS.
```
s0 + u0 => s1 + u1
s0 + u0 => u2 + u3         # overrides previous replacement; each expr only gets one replacement
s0 + u0 => u100            # overrides previous replacement; ditto
```

I believe what we really want is this: everybody gets a replacement! And they all should (eventually) settle at the same canonical expr (i.e. `u100`) when running the replacement several times.
```
s1 + u1 ==> s0 + u0
u2 + u3 ==> s0 + u0
s0 + u0 ==> u100
```

We can just short-cut this by using the canonical expr as the replacement.
```
s1 + u1 ==> u100
u2 + u3 ==> u100
s0 + u0 ==> u100
```

## Implementation

I offer one way to deal with this:
1. assure every expression has one canonical replacement (i.e. `u100`)
2. if two expressions are equal (inferred from `deferred_runtime_asserts`), then they must have the same canonical replacement

 We can implement the above with union find.
* Whenever you see `Eq(lhs, rhs)` then do `union(lhs, rhs)`.
* Whenever you want to find the canonical replacement for a given expr then do `find(expr)`.
* When picking the canonical replacement we can use a few heuristics like (1) prefer a fully backed expr, (2) replacing with sub-expressions, and whatever we'd like.

Differential Revision: [D84549260](https://our.internmc.facebook.com/intern/diff/D84549260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164324
Approved by: https://github.com/laithsakka
2025-10-14 13:57:33 +00:00
56d6229ff9 [MPS] fix comment for normcdf (#165233)
Just a small comment fix for normcdf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165233
Approved by: https://github.com/malfet
2025-10-14 13:56:31 +00:00
74db92b218 [distributed] Replace assert statements with AssertionError exceptions (#165216)
Replaces 71 assert statements across 11 files in `torch.distributed` with explicit if-checks raising AssertionError to prevent assertions from being disabled with Python -O flag.

Fixes #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165216
Approved by: https://github.com/albanD
2025-10-14 09:58:59 +00:00
c48843e4c6 [CP][BE] Docstrings, comments polish and remove unused variables (#165039)
No logic change, just polish the docstrings, comments and remove unused variables

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165039
Approved by: https://github.com/XilunWu
ghstack dependencies: #162542, #164500, #163185
2025-10-14 09:35:32 +00:00
9e89b1c4c7 Update torch-xpu-ops commit pin (#165321)
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@ce9db1](ce9db15136), includes:

- Fix test_barrier hang by using static global rank in ProcessGroupXCCL
- Update install_xpu_headers only when content should change to speedup recompilation
- Add global rank information to communication logging
- Remove duplicate normalization from FFT methods
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165321
Approved by: https://github.com/EikanWang
2025-10-14 09:07:24 +00:00
c5972ebdfb Revert "Update windows cuda build to use 12.8 (#165345)"
This reverts commit ca96c675001fa87b9d9c648972415ab8b1591f11.

Reverted https://github.com/pytorch/pytorch/pull/165345 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165345#issuecomment-3400344079))
2025-10-14 06:46:33 +00:00
18b3658df9 [inductor][ez] properly print Pointwise (#165369)
Previously when we print a ComputedBuffer for reduction, we get something like:
```
ComputedBuffer(name='buf0', layout=FixedLayout('cuda:0', torch.float32, size=[1, 768], stride=[768, 1]), data=Reduction(
  'cuda',
  torch.float32,
  def inner_fn(index, rindex):
      _, i1 = index
      r0_0 = rindex
      tmp0 = ops.load(tangents_1, i1 + 768 * r0_0)
      tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16)
      tmp2 = ops.load(primals_1, i1 + 768 * r0_0)
      tmp3 = ops.to_dtype(tmp2, torch.float32, src_dtype=torch.bfloat16)
      tmp4 = ops.load(rsqrt, r0_0)
      tmp5 = tmp3 * tmp4
      tmp6 = tmp1 * tmp5
      return tmp6
  ,
```
But if we print a ComputedBuffer for a pointwise, we get something like
```
ComputedBuffer(name='buf2', layout=FixedLayout('cuda:0', torch.bfloat16, size=[32768, 768], stride=[768, 1]), data=Pointwise(device=device(type='cuda', index=0), dtype=torch.bfloat16, inner_fn=<function make_pointwise.<locals>.inner.<locals>.inner_fn at 0x7f12922c5bc0>, ranges=[32768, 768]))

```

Note that the inner function str is not printed.

With the change, we get the inner_fn string printed in this case:
```

ComputedBuffer(name='buf2', layout=FixedLayout('cuda:0', torch.bfloat16, size=[32768, 768], stride=[768, 1]), data=Pointwise(       14:42:46 [25/1988]
  'cuda',
  torch.bfloat16,
  def inner_fn(index):
      i0, i1 = index
      tmp0 = ops.load(tangents_1, i1 + 768 * i0)
      tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16)
      tmp2 = ops.load(primals_2, i1)
      tmp3 = tmp1 * tmp2
      tmp4 = ops.load(rsqrt, i0)
      tmp5 = tmp3 * tmp4
      tmp6 = ops.load(buf1, i0)
      tmp7 = ops.constant(-0.5, torch.float32)
      tmp8 = tmp6 * tmp7
      tmp9 = ops.load(rsqrt, i0)
      tmp10 = tmp9 * tmp9
      tmp11 = tmp10 * tmp9
      tmp12 = tmp8 * tmp11
      tmp13 = ops.constant(0.0013020833333333333, torch.float32)
      tmp14 = tmp12 * tmp13
      tmp15 = ops.load(primals_1, i1 + 768 * i0)
      tmp16 = ops.to_dtype(tmp15, torch.float32, src_dtype=torch.bfloat16)
      tmp17 = tmp14 * tmp16
      tmp18 = tmp5 + tmp17
      tmp19 = ops.load(buf1, i0)
      tmp20 = ops.constant(-0.5, torch.float32)
      tmp21 = tmp19 * tmp20
      tmp22 = ops.load(rsqrt, i0)
      tmp23 = tmp22 * tmp22
      tmp24 = tmp23 * tmp22
      tmp25 = tmp21 * tmp24
      tmp26 = ops.constant(0.0013020833333333333, torch.float32)
      tmp27 = tmp25 * tmp26
      tmp28 = ops.load(primals_1, i1 + 768 * i0)
      tmp29 = ops.to_dtype(tmp28, torch.float32, src_dtype=torch.bfloat16)
      tmp30 = tmp27 * tmp29
      tmp31 = tmp18 + tmp30
      tmp32 = ops.to_dtype(tmp31, torch.bfloat16, src_dtype=torch.float32)
      return tmp32
  ,
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165369
Approved by: https://github.com/eellison
2025-10-14 06:08:12 +00:00
5fbf93b774 Introduce automatic wrapper to run DTensor tests under local tensor mode (#165383)
The wrapper enable to share test body implementation while eliminating need test class by hand. As an example, this change converts the whole DTensorTest to use local tensor mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165383
Approved by: https://github.com/ezyang
2025-10-14 06:08:03 +00:00
a856a17799 bf16 support for per_channel bwd (#165325)
Follow up to #165098 - adding bf16 support for the backward pass. To avoid BC breaking changes/losing precision, we upcast the parameters to fp32 after the op gets called, and downcast the gradients to bf16 before returning.

For testing, we upcast to fp32 before calling the reference function. We increase the tolerance to 1e-2 for bf16 inputs because of a difference in casting calculations between python's `x.to(torch.bfloat16)` and cpp's `x.to(at::kBFloat16)` (after comparing intermediate tensors, we found that the numerics diverge after the final casting). We don't explicitly cast in the CPP op but rather let autograd/optimizer handle it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165325
Approved by: https://github.com/andrewor14
2025-10-14 05:47:32 +00:00
bc6e08954d [user-cuda-streams] Add fork/join custom ops (#162900)
Creates the fork/join stream ops. These ops are passthrough ops which mutate all of their args (without actually performing any computation on them) so that during functionalization, implicit dependencies are added on all of their args. This allows us to prevent reordering during our pre/post grad graph passes.

Make custom ops inplace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162900
Approved by: https://github.com/anijain2305
ghstack dependencies: #163027, #162899, #163028
2025-10-14 05:43:19 +00:00
45a96b2081 [user-streams] Handle aliasing properly (#163028)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163028
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
ghstack dependencies: #163027, #162899
2025-10-14 05:43:19 +00:00
04e36611bb [user-cuda-streams] Pass streams/events to the graph via lookup table (#162899)
Stores streams in a global object look table that maps a dynamo selected index to objects. This index is generated during tracing, and at runtime, a helper function is called from the bytecode to populate this map.

This differs from the previous implementation that simply mapped IDs to the associated objects. This required specialization on the IDs of the specific objects, while this new approach does not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162899
Approved by: https://github.com/anijain2305
ghstack dependencies: #163027
2025-10-14 05:43:19 +00:00
f15c25d5c3 [user-streams] Move stream code to streams module (#163027)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163027
Approved by: https://github.com/StrongerXi, https://github.com/anijain2305
2025-10-14 05:43:19 +00:00
e93981c243 [PyTorch][aarch64] Cast to signed char to fix aarch64 build (#165021)
Summary:
Initial fix: D39198776
Reverted by clang-tidy bot: D83948172

Test Plan:
Can now build on aarch64
{P1983767795}

Reviewed By: bigning

Differential Revision: D84203406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165021
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2025-10-14 05:37:34 +00:00
496adf9f9c Replace insert with std::rotate_copy for RingBuffer (#165348)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165348
Approved by: https://github.com/eqy, https://github.com/Skylion007
2025-10-14 05:11:28 +00:00
33bfec27ff Revert "use sym_numel, to allow fake tensors to work (#163831)"
This reverts commit e71c75680f2d6ce5f61ad4b2125f4934087762eb.

Reverted https://github.com/pytorch/pytorch/pull/163831 on behalf of https://github.com/isuruf due to test failure on mps introduced ([comment](https://github.com/pytorch/pytorch/pull/163831#issuecomment-3400131730))
2025-10-14 05:10:56 +00:00
f44935cc14 [torch/utils][Code Clean] Clean asserts in torch/utils/_sympy (#165279)
Including: `torch/utils/_sympy/`

Fixes part of #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165279
Approved by: https://github.com/albanD
2025-10-14 04:52:23 +00:00
39116409a1 [torch/utils][Code Clean] Clean asserts in benchmark/ and data/ in torch/utils/ (#165299)
Including:
- `torch/utils/benchmarks/`
- `torch/utils/data/`

Fixes part of #164878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165299
Approved by: https://github.com/albanD
2025-10-14 04:50:39 +00:00
515d1326c1 Add CLAUDE_CONTEXT directory to gitignore (#165358)
Claude often adds a bunch of MD files or other stuff that is specific to a local session, add a folder for claude to put this stuff that doesn't get checked into the repo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165358
Approved by: https://github.com/oulgen
2025-10-14 04:47:21 +00:00
ac529df244 Native matmul (#157743)
### Implementation of #151705

This PR introduces the initial implementation of native `tl.dot` support in Inductor, with the goal of generating Triton matmul kernels directly—without relying on predefined templates.

To avoid complexity and ease the review process, I plan to split this work into two phases as outlined in #151705:

1. **Basic support** (this PR)
2. **Lazy broadcasting** for optimal performance (future PR)

### Summary of This PR

This PR implements the basic functionality. It does **not** include lazy broadcasting, so the generated kernels may involve explicit `tl.reshape` and `tl.trans` operations before calling `tl.dot`, which introduces some overhead.

### Notable Changes

1. Adds a new config flag: `config.triton.enable_native_matmul`
2. Introduces a new `ops.dot` IR node in Inductor and lowers `aten.mm` and `aten.bmm` to it when native matmul is enabled
3. Enforces tililng suitable for matmul when the native matmul flag is enabled
4. Implements code generation for `ops.dot`
5. Adds Triton autotuning heuristics: for now, I’ve copied the configuration from the existing matmul templates. However, this may not be optimal—it currently takes a long time to tune, and I think there must be a better way to tackle this.

@eellison @jansel @PaulZhang12 @shunting314

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157743
Approved by: https://github.com/jansel
2025-10-14 04:22:30 +00:00
fa3916f466 Revert "[export] Turn on install_free_tensors flag (#164691)"
This reverts commit 220a34118f40fab4f3f517556d6e1434139a1590.

Reverted https://github.com/pytorch/pytorch/pull/164691 on behalf of https://github.com/seemethere due to Breaks some internal things, both me and author agreed that revert was the best course of action ([comment](https://github.com/pytorch/pytorch/pull/164691#issuecomment-3400013759))
2025-10-14 03:58:12 +00:00
267348fe7f Revert "Fix double dispatch to Python for detach (#163671)"
This reverts commit a3e3efe474bef63940ded803e78bb2a382681f1e.

Reverted https://github.com/pytorch/pytorch/pull/163671 on behalf of https://github.com/seemethere due to We should've reverted this when we decided to revert https://github.com/pytorch/pytorch/pull/164691 since they were actually stacked ([comment](https://github.com/pytorch/pytorch/pull/163671#issuecomment-3400009953))
2025-10-14 03:55:36 +00:00
1803d40c99 Reapply "[export] Turn on install_free_tensors flag (#164691)" (#165353)
This reverts commit 9166f6120f63e2d5d76e6ccdbfccb8d6e41cbb43.

Reverted https://github.com/pytorch/pytorch/pull/165353 on behalf of https://github.com/seemethere due to This is causing merge conflicts since a dependent PR wasn't reverted ([comment](https://github.com/pytorch/pytorch/pull/165353#issuecomment-3400006587))
2025-10-14 03:52:50 +00:00
29c5368e0f MTIA _cdist_forward registration (#165333)
Summary: Added registration for _cdist_forward on MTIA

Differential Revision: D84357997

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165333
Approved by: https://github.com/albanD
2025-10-14 03:51:31 +00:00
e71c75680f use sym_numel, to allow fake tensors to work (#163831)
Fixes #[163759](https://github.com/pytorch/pytorch/issues/163759)

Replace `numel` with `sym_numel`. Tested with example in issue and it works now .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163831
Approved by: https://github.com/bobrenjc93
2025-10-14 03:33:28 +00:00
ca96c67500 Update windows cuda build to use 12.8 (#165345)
As title

Motivation: The rest of the pytorch and inductor build is using 12.8 and we're deprecating cuda 12.6 builds soon per https://github.com/pytorch/pytorch/issues/165111

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165345
Approved by: https://github.com/atalman
2025-10-14 02:33:44 +00:00
770e6b910c [DTensor] Extend conv ops to 3D (#165241)
Current implementation hardcodes 4D input and output tensor shapes

Change that by computing `output_conv_shape` for any number of input dims
Replace `[.., .., .., slice]` with `[..., slice]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165241
Approved by: https://github.com/ezyang
2025-10-14 02:30:46 +00:00
37d57ac9cb Use sym_eq in _check_rms_norm_inputs_symint (#165112)
Summary:
### Problem
ArrayRef's `equals()`does elementwise quality using `==` operator. This can cause a DDE for unbacked symints since `==`  operator calls `guard_bool`.
```
// SymInt.h
bool operator==(const SymInt& o) const {
  return sym_eq(o).guard_bool(__FILE__, __LINE__);
}
```

### Solution
Adds `sym_equals()` to do elementwise equality for `SymIntArrayRef`. Use this instead of `equals()` for `SymIntArrayRef`.

Reviewed By: guangy10, pianpwk, muchulee8

Differential Revision: D84168401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165112
Approved by: https://github.com/Skylion007
2025-10-14 00:06:24 +00:00
9166f6120f Revert "[export] Turn on install_free_tensors flag (#164691)" (#165353)
This reverts commit 220a34118f40fab4f3f517556d6e1434139a1590.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165353
Approved by: https://github.com/seemethere
2025-10-13 23:40:11 +00:00
fb0291d14b [pt2][caching] fix runtime error in context on cpu-only machine when compile for gpu (#165220)
re https://github.com/pytorch/pytorch/pull/165186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165220
Approved by: https://github.com/clee2000
2025-10-13 22:47:41 +00:00
f3683453ae [compile] Regional inductor compilation with fx.annotate (#164776)
This PR introduces a way to compile a region of FX graph using `fx.traceback.annotate`.

### UX

1) In the user code, mark the region that you want to be compiled with inductor using `with fx_traceback.annotate({"compile_with_inductor": 0})`. As of now, we just rely on the string `compile_with_inductor` and ignore the integer. As the needs arise, we can update the logic.

Example

```
        def fn(x, y):
            sin = torch.sin(x)

            with fx_traceback.annotate({"compile_with_inductor": 0}):
                mul = sin * y
                add = mul + 1

            return torch.sin(add)
```

2) You have to instruct the compiler to use the annotations with `compile_fx_annotated_nodes_with_inductor` transformation. This is somewhat controversial, and a user might expect that just setting annotation is enough. But for now to control the blast radius, we need to explicitly do this. One such example is

```

# Set the fw and bw compiler of aot_autograd to `compile_fx_annotated_nodes_with_inductor`
def aot_eager_regional_inductor():
    return aot_autograd(
        fw_compiler=compile_fx_annotated_nodes_with_inductor,
        bw_compiler=compile_fx_annotated_nodes_with_inductor,
    )

```

3) Fixable in short-term - You have to wrap the user code in `torch.fx.traceback.preserve_node_meta` to ensure that annotations are propagated to the compiler. This is fixable, just need to make CI happy.

### Implementation

1) Relies on `CapabilityBasedPartitioner` to "scoop" out regions based on annotations, and then create subgraphs in the main graph.
2) Call `torch._inductor.standalone_compile` on these subgraphs, and jam the returned callable into the FX graph at the place of call_module

Resulting graph looks something like this - search for `torch__inductor_standalone_compile_inner`

Forward graph
```
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[10]", primals_2: "f32[10]"):
         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x)
        sin: "f32[10]" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        inner = torch__inductor_standalone_compile_inner(sin, primals_2)

         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:68 in fn, code: add = mul + 1
        getitem: "f32[10]" = inner[0];  inner = None

         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add)
        sin_1: "f32[10]" = torch.ops.aten.sin.default(getitem)
        return (sin_1, primals_1, primals_2, sin, getitem)
```

Backward graph
```
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[10]", primals_2: "f32[10]", sin: "f32[10]", add: "f32[10]", tangents_1: "f32[10]"):
         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x)
        cos_1: "f32[10]" = torch.ops.aten.cos.default(primals_1);  primals_1 = None

         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add)
        cos: "f32[10]" = torch.ops.aten.cos.default(add);  add = None
        mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None

        # No stacktrace found for following nodes
        inner = torch__inductor_standalone_compile_inner(mul_1, sin, primals_2);  mul_1 = sin = primals_2 = None

         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:67 in fn, code: mul = sin * y
        getitem: "f32[10]" = inner[0]
        getitem_1: "f32[10]" = inner[1];  inner = None

         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x)
        mul_4: "f32[10]" = torch.ops.aten.mul.Tensor(getitem_1, cos_1);  getitem_1 = cos_1 = None
        return (mul_4, getitem)
```

### Some issue raised in the HOP meeting
1) CSE will not differentiate different meta custom nodes and do wrong thing.
2) SAC - The recomputed forward will be smaller than the forward. Will we compile a smaller region than?
3) What happens if you have a op in the middle which does not disturb the topology, is it still 1 subgraph?
4) What happens with the nesting of `fx_traceback.annotate`? Are there any ordering requirements?
5) What are we going to use the annotations for?
   a) compile flex
   b) streams
   c) nn.Module info to organize MoE components for pipelining
   d) PP stages
   e) Rename graph nodes for more debugging
   f) No nested regional compile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164776
Approved by: https://github.com/SherlockNoMad
ghstack dependencies: #165188
2025-10-13 22:22:20 +00:00
1191e51c44 [dynamo][annotate] Remove the need of external ctx mgr of preserve_node_meta (#165188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165188
Approved by: https://github.com/yushangdi
2025-10-13 22:22:20 +00:00
3edd94485f [5/N][DTensor device order] Implement graph based redistribution algorithm (#164902)
(Extract out the algorithm from https://github.com/pytorch/pytorch/pull/160266.)

Build a graph to search for the path from source placement to destination placement (with device order). Currently solution introduces too many all-gathers and missing the opportunity for all-to-all when redistribute, especially when we consider the device order.

### How to build the graph:
When operator of Shard, think of collective op as operation on a stack of device axis:
- I, J are tensor dimensions;
- X, Y, Z, Y are ordered mesh dimensions.
<img width="357" height="253" alt="image" src="https://github.com/user-attachments/assets/23bb3cc3-0506-4071-9053-3c525cf0e526" />

Detailed collective op transition is implemented in `DTensorRedistributePlanner.get_next_state`.

### How to find the min cost path:
Assign weight to different type of collective ops and use Dijkstra to find the min cost path from the graph we build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164902
Approved by: https://github.com/ezyang
2025-10-13 22:03:57 +00:00
a701c937bf [dynamo][executorch] Return already added nn.Module during registration (#165338)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165338
Approved by: https://github.com/tugsbayasgalan
2025-10-13 21:24:07 +00:00
ecb53078fa Turn some const strings into constexpr in C++ code (#165203)
This PR turns more const strings into constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165203
Approved by: https://github.com/Skylion007
2025-10-13 20:25:20 +00:00
fa95882093 [BE] document distributed apis (#165194)
This PR documents some `torch.distributed.distributed_c10d` APIs. Below are some screenshots of the rendered docs.

<img width="909" height="527" alt="Screenshot 2025-10-10 at 10 18 40 PM" src="https://github.com/user-attachments/assets/555ae886-bead-47f3-8c67-9bc91c14bd11" />
<img width="885" height="548" alt="Screenshot 2025-10-10 at 10 18 47 PM" src="https://github.com/user-attachments/assets/1d6f7af1-db28-40f9-927e-5c47668a1a88" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165194
Approved by: https://github.com/janeyx99
2025-10-13 20:13:59 +00:00
a71ca4dcb9 Revert "[opaque_obj_v2] PyObject custom op schema type (#165004)"
This reverts commit 3faee200674c0c2bca3f395a063264cfd8a9a5b7.

Reverted https://github.com/pytorch/pytorch/pull/165004 on behalf of https://github.com/seemethere due to This fails internal tests, see D84399300 ([comment](https://github.com/pytorch/pytorch/pull/165004#issuecomment-3398906856))
2025-10-13 20:08:38 +00:00
c44d638b15 [Easy][Test][Dynamo] Avoid direct string comparison in MiscTestsDevice::get_device_module (#165314)
Fixes a small issue on string comparison, as the test fails with:
```
AssertionError: String comparison failed: 'cuda' != 'cuda:0'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165314
Approved by: https://github.com/soulitzer
2025-10-13 19:58:59 +00:00
7c015334a3 Remove FIXME comment about reset_max_memory_reserved (#165249)
The function doesn't actually exist https://github.com/pytorch/pytorch/blob/main/torch/cuda/__init__.py#L1816

Fixes https://github.com/pytorch/pytorch/issues/27785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165249
Approved by: https://github.com/svekars
2025-10-13 19:44:40 +00:00
cad2d473bf Force inlining into torch_function_mode_enabled (#164617)
This function is relatively hot; inlining here reduces time reported by `python -m timeit --setup 'import torch; t = torch.tensor([1])' 't._cdata'` from about 125 nsec/loop to about 110 nsec/loop. (To be fair, variance is high, but I did confirm with perf that time in this path seems to have roughly halved during torchtitan training.)

Note that locally I am getting bit by a GCC bug that I documented in a comment. Would be interested to hear if this does anything for clang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164617
Approved by: https://github.com/ezyang
2025-10-13 19:25:51 +00:00
cb328c0b20 [ONNX] TorchTensor supports tofile() (#165195)
Fixes #165120

ref: 43ebf47bb5/src/onnx_ir/tensor_adapters.py (L171-L200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165195
Approved by: https://github.com/justinchuby
2025-10-13 19:12:06 +00:00
64699b8042 [trymerge] Do not check for rules when reverting (#165342)
Why do we need to check for merge rules when reverting?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165342
Approved by: https://github.com/malfet
2025-10-13 19:07:00 +00:00
dcce473352 [BE] Fix unused parameter warning (#165272)
Fixes
```
[23/1155] Compiling /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal to EmbeddingBag_31.air
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/EmbeddingBag.metal:252:62: warning: unused parameter 'bag_size' [-Wunused-parameter]
  inline opmath_t<T> operator()(opmath_t<T> val, opmath_t<T> bag_size) {
                                                             ^
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165272
Approved by: https://github.com/Skylion007
2025-10-13 18:52:51 +00:00
c41e52118d Fix loop pipelining for 2d/2d case of Triton grouped MM (#165265)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165265
Approved by: https://github.com/ngimel
2025-10-13 18:45:39 +00:00
955cd7060b Revert "Update round size with 1 division behavior (#162203)"
This reverts commit 12d2ef557f6e127100267c31a31572d8ab5cc788.

Reverted https://github.com/pytorch/pytorch/pull/162203 on behalf of https://github.com/izaitsevfb due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162203#issuecomment-3398622898))
2025-10-13 18:32:37 +00:00
0ce945790e [NJT] Fix schema validation error in jagged functions (#165307)
Fixes #161812
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165307
Approved by: https://github.com/soulitzer
2025-10-13 17:59:18 +00:00
70ec464c16 [BE] document some quantization public apis (#165160)
This PR documents some apis in `torch.ao.quantization.utils`

<img width="885" height="296" alt="Screenshot 2025-10-10 at 4 38 10 PM" src="https://github.com/user-attachments/assets/4323a6f5-ac3a-4f2e-ba00-35f3b208bef4" />
<img width="876" height="319" alt="Screenshot 2025-10-10 at 4 38 14 PM" src="https://github.com/user-attachments/assets/164822c3-9740-46f9-953d-bb20c77bcf69" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165160
Approved by: https://github.com/janeyx99
2025-10-13 17:24:42 +00:00
2c600bb665 [torchfuzz] fix some errors when walkthroughing README.md (#165225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165225
Approved by: https://github.com/soulitzer
2025-10-13 17:17:50 +00:00
e93343cfab [CP] Introduce flex_cp_forward custom op for FlexAttention CP (#163185)
The custom op will fetch the required K and V. Currently, the forward pass is just an all-gather, and the backward pass is a reduce-scatter.  While the logic is the same as all_gather_tensor_autograd, the custom op avoids the Autograd warning that wait_tensor() is registered to autograd.

For the next step, we should explore how to interpolate the required communication based on the information from BlockMask.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163185
Approved by: https://github.com/XilunWu
ghstack dependencies: #162542, #164500
2025-10-13 17:16:32 +00:00
c86a7c5f5e Disable failing test_int8_woq_mm_concat_cuda on slow grad check (#165331)
Same as https://github.com/pytorch/pytorch/pull/165147, I missed some

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165331
Approved by: https://github.com/bbeckca
2025-10-13 17:08:00 +00:00
4e420415e8 Avoids calling builtin iter if object is a generator (#162521)
The `iter(gen)` call will return the given `gen` object. So, we just avoid this call and shaves off a few ms of tracing time

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162521
Approved by: https://github.com/mlazos
2025-10-13 17:07:54 +00:00
83cbba8759 [MPS] Support large tensors in torch.cat (#164416)
Fixes #164415
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164416
Approved by: https://github.com/malfet
2025-10-13 16:56:56 +00:00
684df93975 [CI] Default keep-going true for tags of form ciflow/something/commitsha (#165180)
Tags of the form `ciflow/something/commitsha` are usually created by running the workflow from HUD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165180
Approved by: https://github.com/huydhn
2025-10-13 16:12:37 +00:00
a3e3efe474 Fix double dispatch to Python for detach (#163671)
This fixes #71725.

Differential Revision: [D83857880](https://our.internmc.facebook.com/intern/diff/D83857880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163671
Approved by: https://github.com/ezyang, https://github.com/albanD
2025-10-13 16:10:17 +00:00
6bda3bb286 [PP] Fix split_args_kwargs_into_chunks issues (#165306)
1. https://github.com/pytorch/pytorch/pull/164111/ adds the support of splitting BlockMask. But BlockMask actually has B=1 case that the BlockMask will be broadcast. This PR adds the support of B=1 case.

2. The original split_args_kwargs_into_chunks doesn't initialize the default specs correctly. Since we now use tree_flatten and tree_unflatten to do split, we should also use tree_map to initialize the default spec. This will actually support the case when the values are not torch.Tensor, which were only supported if users explicitly provide the shard spec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165306
Approved by: https://github.com/H-Huang
2025-10-13 15:52:39 +00:00
8580112682 Revert "[dynamo][DebugMode] mask python keys in dispatch_key_set guard checks (#164992)"
This reverts commit 306b344a1847749f0baf085dcd92560f4e99cd1b.

Reverted https://github.com/pytorch/pytorch/pull/164992 on behalf of https://github.com/jeffdaily due to broke ROCm CI test/inductor/test_inductor_scheduler.py::TestSchedulerCUDA::test_flop_counter_op_options0_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/18417066364/job/52485636942) [HUD commit link](306b344a18) ([comment](https://github.com/pytorch/pytorch/pull/164992#issuecomment-3397927142))
2025-10-13 15:14:34 +00:00
4874cce52f [xla hash update] update the pinned xla hash (#165302)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165302
Approved by: https://github.com/pytorchbot
2025-10-13 12:36:29 +00:00
c509a78645 Update slow tests (#165301)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165301
Approved by: https://github.com/pytorchbot
2025-10-13 11:47:32 +00:00
8461b63f2c [CP] Replace context_parallel context manager with functional APIs (#164500)
`context_parallel()` being a context manager has annoyed users. Now that we plan to redesign CP's UX to explicitly ask users to:

1. Wrap the attention op into an `nn.Module`
2. Lift any buffers that are not sequence agnostic to input

We can replace `context_parallel()` with two functional APIs: `_context_parallel_shard` and `_enable_context_parallel_dispatcher`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164500
Approved by: https://github.com/XilunWu
ghstack dependencies: #162542
2025-10-13 06:30:18 +00:00
957b0e9793 [vision hash update] update the pinned vision hash (#165017)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vision hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165017
Approved by: https://github.com/pytorchbot
2025-10-13 04:35:52 +00:00
b04def139e [audio hash update] update the pinned audio hash (#165113)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165113
Approved by: https://github.com/pytorchbot
2025-10-13 04:35:36 +00:00
59ad8f1ac6 [XPU] Enhance XPUGeneratorImpl functionality to support XPUGraph (#163332)
As this [XPUGraph RFC](https://github.com/pytorch/pytorch/issues/162143) descripted. This PR enhances `XPUGeneratorImpl` to support XPUGraph.
In this PR, we add `XPUGerneratorState` and `PhiloxXpuState`. Which makes XPUGraph update philox state during graph capture and replay correctly

XPUGraph PR submission plan:

- [ ] 1, Enhance XPUGenerator functionality. Add XPUGeneratorState and philoxState
- [ ] 2, implemenet XPUGraph capture_begin/capture_end/instantiate functionality
- [ ] 3, implemenet XPUGraph replay/debug_dump/reset functionality
- [ ] 4, python APIs: is_current_stream_capturing/graph_pool_handle/graph
- [ ] 5, python APIs: make_graphed_callables

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163332
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD
2025-10-13 02:10:41 +00:00
8de85896e0 Enable ruff rule E721 (#165162)
`E721` checks for object type comparisons using == and other comparison operators. This is useful because it is recommended to use `is` for type comparisons.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165162
Approved by: https://github.com/Skylion007
2025-10-13 01:48:55 +00:00
a33f85e791 Add tlparse artifact for autotune_at_compile_time (#164984)
This is useful for inspecting autotuning code when `autotune_at_compile_time=True`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164984
Approved by: https://github.com/yushangdi, https://github.com/desertfire
2025-10-12 23:38:11 +00:00
5e58420dff LocalTensor (#164537)
A LocalTensor is a tensor subclass which simulates a tensor that is
distributed across SPMD ranks.  A LocalTensor might be size N, but in fact
there are world_size shards/replicas of it stored internally.  When you do a
plain PyTorch operation on it, we apply the operation to each shard; when you
do a collective, we do the mathematically equivalent operation on the local
shards.  A LocalTensor is associated with a list of ranks which specify
which ranks it holds local tensors for.

NB, this is NOT a DataParallel like abstraction where you can run operations
on multiple different GPUs. It is intended purely for *debugging* purposes,
the overhead is almost certainly too high to keep eight GPUs (even the C++
autograd needs multithreading to keep up!)  (It might potentially be possible
to trace through this with torch.compile and then compile it with CUDA graphs
but this is currently a non-goal.)

In order to handle MPMD, we provide a helper decorator that allows you to
run a function with no side effects for each LocalTensor shard and combine
results back into LocalTensor or LocalIntNode.

Note: This PR convert all DTensor ops and some DTensor tests to illustrate
intended usage and ensure conrrectness. In subsequent PR more tests will be
converted. DUring test conversion we aim to share as much as possible of
test logic between multi-process / multi-threaded and local tensor tests.
We would like to developers to be able to run both flavors of the tests.

Note: This work is based on the original proposal
by @ezyang (WIP PR https://github.com/pytorch/pytorch/pull/162753).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164537
Approved by: https://github.com/ezyang
2025-10-12 20:06:41 +00:00
a2601630cd [vllm hash update] update the pinned vllm hash (#164628)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned vllm hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164628
Approved by: https://github.com/pytorchbot

Co-authored-by: Huy Do <huydhn@gmail.com>
2025-10-12 18:26:07 +00:00
2beead7523 [PP] move FSDP reduce scatters to end of step (#165106)
Move FSDP reduce scatters to the end of the PP step. The reduce scatter compute stream sync blocks the other stages from executing their backwards leading to bubbles. There should be a way to execute these RS earlier, but doing this for now as a quick fix.

<img width="1056" height="463" alt="image" src="https://github.com/user-attachments/assets/b945dd55-8ab1-4acc-b862-c6e2e476b834" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165106
Approved by: https://github.com/weifengpy
ghstack dependencies: #164976
2025-10-12 13:28:02 +00:00
3a110c9bb2 Add a new API torch.xpu.is_tf32_supported for Intel GPU (#163141)
# Motivation
Aligned with other backends, this PR introduces a new API `torch.xpu.is_tf32_supported`, which should be used before `torch.backends.mkldnn.allow_tf32=True` or provide hardware capability information to the Triton

# Additional Context
On Intel Xe architecture and newer, TF32 operations can be accelerated through DPAS (Dot Product Accumulate Systolic) instructions. Therefore, TF32 support can be determined by checking whether the device supports subgroup matrix multiply-accumulate operations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163141
Approved by: https://github.com/EikanWang
2025-10-12 12:11:57 +00:00
5dbca58bd0 [dynamo] fix potential 3.12+ THP_PyOpcode_Caches init error seen internally (#165200)
Another attempt at merging https://github.com/pytorch/pytorch/pull/164597 due to CLA signing failure.

Differential Revision: [D84397377](https://our.internmc.facebook.com/intern/diff/D84397377)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165200
Approved by: https://github.com/anijain2305, https://github.com/mlazos
2025-10-12 05:29:04 +00:00
5ad7611b52 Reland vision pinned commit hash update (#164492)
Redo https://github.com/pytorch/pytorch/pull/154694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164492
Approved by: https://github.com/yangw-dev
2025-10-12 04:53:27 +00:00
992857e286 Fix pre-dispatch AC HOP calling convention (#165145)
For AC HOP, dynamo traces it without kwargs. (kwargs are only inputs to the HOP, not to the body)
55f01a48af/torch/_dynamo/variables/higher_order_ops.py (L2594-L2609)

When we add non-strict support, we should match this calling convention too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165145
Approved by: https://github.com/tugsbayasgalan
ghstack dependencies: #164296, #164321, #164419, #164420, #164340, #163602, #164431, #164433, #164437
2025-10-12 02:28:21 +00:00
058814794b [Code Clean] Replace std::runtime_error with TORCH_CHECK (#163437)
Replace the runtime_error of the vallina C++ exceptions with TORCH_CEHCK
Including:
- torch/csrc/export
- torch/csrc/cuda

Fixes #148114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163437
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-10-12 01:23:02 +00:00
bb0635d7dd [inductor][eazy] change how torch.use_deterministic_algorithms affect inductor (#164905)
Previously when torch.are_deterministic_algorithms_enabled() is True Inductor will
- skip autotuning pointwise kernels
- pick a fixed (and quite arbitrary) config for reduction

This PR change the behavior to
- for pointwise kernels, we still do autotuning
- for reduction kernels, we use the recent added heuristic to pick a config

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164905
Approved by: https://github.com/jansel, https://github.com/v0i0, https://github.com/mlazos
ghstack dependencies: #164904
2025-10-12 00:03:43 +00:00
5171f14064 [inductor] verify determinism with inductor benchmark script (#164904)
Verify the deterministic mode with torch.compile benchmark scripts.

Here is what my testing script does (pasted in the end):
- run a model in default mode, save it's result
- run the model again in default mode, but distort the benchmarking results. Compare it with the saved result.
- Do the above again in deterministic mode.

I tried to test a few modes
- BertForMaskedLM and GoogleFnet: I can repro the numeric change by distorting the benchnmark result in the default mode. The non-determinism is gone in the deterministic mode
- DistillGPT2: I can not repro the numeric change by distorting the benchmarking result in the default mode. It does not surprise me much. Reduction order change does not always cause numeric change.

```
model=GoogleFnet

export TORCHINDUCTOR_WRITE_ARE_DETERMINISTIC_ALGORITHMS_ENABLED=0
export TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  # disable autotune cache
export TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=0
export TORCHINDUCTOR_FX_GRAPH_CACHE=0
export TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting/
export TORCHINDUCTOR_BENCHMARK_KERNEL=1
export TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1
export INDUCTOR_TEST_DISABLE_FRESH_CACHE=1

# Non deterministic mode
# --float32 rather than --amp to make it easier to repro non-deterministic
echo "Save results for non-deterministic mode"
python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-non-deterministic.pkl

echo "Compare results with distorted benchmarking in non-deterministic mode"
TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-non-deterministic.pkl

echo "Save results for deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-deterministic.pkl

echo "Compare results with distorted benchmarking in deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-deterministic.pkl
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164904
Approved by: https://github.com/jansel, https://github.com/v0i0
2025-10-12 00:03:42 +00:00
df26c51478 error message for instantiating CUDA Stream if CUDA not available (#159868)
Fixes #159744
Summary:
```
import torch

# Generate input data
input_tensor = torch.randn(3, 3)
stream = torch.cuda.Stream()

# Call the API
input_tensor.record_stream(stream)
```

⚠️ will now show an error message
`torch.cuda.Stream requires CUDA support`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/159868
Approved by: https://github.com/malfet, https://github.com/isuruf
2025-10-11 23:21:35 +00:00
8d49cd5b26 Revert "[compile] Regional inductor compilation with fx.annotate (#164776)"
This reverts commit 1e4c7dffa31b3284a4cd4daa4424602827bd9d0f.

Reverted https://github.com/pytorch/pytorch/pull/164776 on behalf of https://github.com/malfet due to Looks like this one broke everything, not the top of the stack ([comment](https://github.com/pytorch/pytorch/pull/164776#issuecomment-3393725466))
2025-10-11 23:14:23 +00:00
a19123b37e Revert "[dynamo][annotate] Remove the need of external ctx mgr of preserve_node_meta (#165188)"
This reverts commit f0325d07876b5a52d29a44ee02dcf7a7c21b258a.

Reverted https://github.com/pytorch/pytorch/pull/165188 on behalf of https://github.com/malfet due to Looks like it broke bunch of tests, see 2d4654d208/1 ([comment](https://github.com/pytorch/pytorch/pull/165188#issuecomment-3393674273))
2025-10-11 21:38:45 +00:00
2d4654d208 do not overguard when comparing lists (#165091)
if we are comparing two lists l1, l2 of different lengths for equality.
we should early exist if len(l1) != len(l2)
and avoid guarding/comparing inner elements.

This avoids recompilations as in the unit test.
address https://github.com/pytorch/pytorch/issues/137515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165091
Approved by: https://github.com/aorenste, https://github.com/mlazos
ghstack dependencies: #164884, #164885, #164886, #164887, #164888, #164889
2025-10-11 20:37:51 +00:00
f0325d0787 [dynamo][annotate] Remove the need of external ctx mgr of preserve_node_meta (#165188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165188
Approved by: https://github.com/yushangdi
ghstack dependencies: #164776
2025-10-11 15:49:42 +00:00
1e4c7dffa3 [compile] Regional inductor compilation with fx.annotate (#164776)
This PR introduces a way to compile a region of FX graph using `fx.traceback.annotate`.

### UX

1) In the user code, mark the region that you want to be compiled with inductor using `with fx_traceback.annotate({"compile_with_inductor": 0})`. As of now, we just rely on the string `compile_with_inductor` and ignore the integer. As the needs arise, we can update the logic.

Example

```
        def fn(x, y):
            sin = torch.sin(x)

            with fx_traceback.annotate({"compile_with_inductor": 0}):
                mul = sin * y
                add = mul + 1

            return torch.sin(add)
```

2) You have to instruct the compiler to use the annotations with `compile_fx_annotated_nodes_with_inductor` transformation. This is somewhat controversial, and a user might expect that just setting annotation is enough. But for now to control the blast radius, we need to explicitly do this. One such example is

```

# Set the fw and bw compiler of aot_autograd to `compile_fx_annotated_nodes_with_inductor`
def aot_eager_regional_inductor():
    return aot_autograd(
        fw_compiler=compile_fx_annotated_nodes_with_inductor,
        bw_compiler=compile_fx_annotated_nodes_with_inductor,
    )

```

3) Fixable in short-term - You have to wrap the user code in `torch.fx.traceback.preserve_node_meta` to ensure that annotations are propagated to the compiler. This is fixable, just need to make CI happy.

### Implementation

1) Relies on `CapabilityBasedPartitioner` to "scoop" out regions based on annotations, and then create subgraphs in the main graph.
2) Call `torch._inductor.standalone_compile` on these subgraphs, and jam the returned callable into the FX graph at the place of call_module

Resulting graph looks something like this - search for `torch__inductor_standalone_compile_inner`

Forward graph
```
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[10]", primals_2: "f32[10]"):
         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x)
        sin: "f32[10]" = torch.ops.aten.sin.default(primals_1)

        # No stacktrace found for following nodes
        inner = torch__inductor_standalone_compile_inner(sin, primals_2)

         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:68 in fn, code: add = mul + 1
        getitem: "f32[10]" = inner[0];  inner = None

         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add)
        sin_1: "f32[10]" = torch.ops.aten.sin.default(getitem)
        return (sin_1, primals_1, primals_2, sin, getitem)
```

Backward graph
```
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[10]", primals_2: "f32[10]", sin: "f32[10]", add: "f32[10]", tangents_1: "f32[10]"):
         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x)
        cos_1: "f32[10]" = torch.ops.aten.cos.default(primals_1);  primals_1 = None

         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add)
        cos: "f32[10]" = torch.ops.aten.cos.default(add);  add = None
        mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None

        # No stacktrace found for following nodes
        inner = torch__inductor_standalone_compile_inner(mul_1, sin, primals_2);  mul_1 = sin = primals_2 = None

         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:67 in fn, code: mul = sin * y
        getitem: "f32[10]" = inner[0]
        getitem_1: "f32[10]" = inner[1];  inner = None

         # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x)
        mul_4: "f32[10]" = torch.ops.aten.mul.Tensor(getitem_1, cos_1);  getitem_1 = cos_1 = None
        return (mul_4, getitem)
```

### Some issue raised in the HOP meeting
1) CSE will not differentiate different meta custom nodes and do wrong thing.
2) SAC - The recomputed forward will be smaller than the forward. Will we compile a smaller region than?
3) What happens if you have a op in the middle which does not disturb the topology, is it still 1 subgraph?
4) What happens with the nesting of `fx_traceback.annotate`? Are there any ordering requirements?
5) What are we going to use the annotations for?
   a) compile flex
   b) streams
   c) nn.Module info to organize MoE components for pipelining
   d) PP stages
   e) Rename graph nodes for more debugging
   f) No nested regional compile

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164776
Approved by: https://github.com/SherlockNoMad
2025-10-11 15:49:42 +00:00
79a33e2db2 Switch docs build from c5 to c7i (#165082)
Switch docs build from c5 to c7i which should increase build
performance by roughly 15-20% while reducing costs by 10-15%.

Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
2025-10-11 10:59:18 -04:00
816fb7f48d Revert "Enable ruff rule E721 (#165162)"
This reverts commit 9e7c19f72b6d0690915c307409c0c0a76b5a3bf0.

Reverted https://github.com/pytorch/pytorch/pull/165162 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165162#issuecomment-3393328271))
2025-10-11 13:25:40 +00:00
512dd79ff0 [4/N] [DTensor device order] Support debugmode to show dtensor distribution transform path (#164821)
Enable the DebugMode to print out how `placements` and `shard_order` will update when we execute `transform_infos` to transform from source placement to target placement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164821
Approved by: https://github.com/SherlockNoMad, https://github.com/pianpwk
ghstack dependencies: #164806, #164820
2025-10-11 09:44:54 +00:00
2001b18541 [3/N] [DTensor device order] Make some placement type class method static (#164820)
Some methods in `Placement` class can be exposed as static.

Those method should be useful w/o initializing the object. E.g., when we `distribute_tensor` from normal tensor, we may want:
```
local_tensor = Shard.shard_tensor(tensor_dim, local_tensor, device_mesh, mesh_dim,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164820
Approved by: https://github.com/XilunWu, https://github.com/fduwjj, https://github.com/wanchaol
ghstack dependencies: #164806
2025-10-11 09:42:13 +00:00
9dac4e2540 [2/N] [DTensor device order] Add shard_order attribute in DTensorSpec (#164806)
Add `shard_order` field in DTensorSpec.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164806
Approved by: https://github.com/XilunWu, https://github.com/wanchaol
2025-10-11 09:39:08 +00:00
4400c5d31e Continue to build nightly CUDA 12.9 for internal (#163029)
Revert part of https://github.com/pytorch/pytorch/pull/161916 to continue building CUDA 12.9 nightly

Pull Request resolved: https://github.com/pytorch/pytorch/pull/163029
Approved by: https://github.com/malfet
2025-10-11 08:26:47 +00:00
9e7c19f72b Enable ruff rule E721 (#165162)
`E721` checks for object type comparisons using == and other comparison operators. This is useful because it is recommended to use `is` for type comparisons.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165162
Approved by: https://github.com/Skylion007
2025-10-11 06:43:53 +00:00
220a34118f [export] Turn on install_free_tensors flag (#164691)
The final step in removing the discrepancy between
torch.compile(fullgraph=True) and torch.export(strict=True).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164691
Approved by: https://github.com/avikchaudhuri
2025-10-11 04:26:09 +00:00
de8d81275a Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939)
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition.  You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.

This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.

Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
2025-10-11 01:03:55 +00:00
d73416642f [test] Skip testing of source_fn_stack in light of export changes (#165176)
This is in regards to https://github.com/pytorch/pytorch/pull/164691
where we are inlining into nn modules, and therefore it is causing this
test to fail. The test here looks for node.name which is quite different
with inlining.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165176
Approved by: https://github.com/andrewor14
ghstack dependencies: #165172
2025-10-11 00:16:59 +00:00
ef50c9b557 Remove unnecessary "static" for definitions in anonymous namespace (#165035)
This PR removes unnecessary "static" for C++ functions and variables in anonymous namespace as detected by clang-tidy. This enhances code readability. The related rules are planed to be enabled in follow-up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165035
Approved by: https://github.com/Skylion007
2025-10-11 00:04:23 +00:00
2d9f3f57f1 [dynamo][executorch] Handle lowered module from executorch delegate specially (#165172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165172
Approved by: https://github.com/tugsbayasgalan
2025-10-10 23:24:17 +00:00
c8c5187e85 Fix truediv numerics between eager and compile (#164144)
Addresses numeric differences between eager and compile in https://github.com/pytorch/pytorch/issues/141753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164144
Approved by: https://github.com/bobrenjc93
2025-10-10 22:18:11 +00:00
ee0a8a5a50 [CP]Introduce ContextParallal plan for parallelize_module() (#162542)
**Motivation**

Since FlexAttention and SDPA are both functions, not modules, we have tried numerous mechanisms to dispatch FlexAttention and SDPA to customized call paths so that we can inject the CP logic. Unfortunately, all of these approaches have their own composability issues with different techniques.

**Candidate Approaches**

1. Ask users to write a module to wrap FlexAttention/SDPA and use `parallelize_module` to install a forward hook.

   - Pros: This is similar to how we do TP.
   - Cons: 1) It is cumbersome for users as they need to create a new module. 2) We need two places to parallelize the CP, as a context_parallel context manager is still required for splitting the inputs.

2. Provide a function wrapper.

   - Pros: Users just need to replace their FlexAttention/SDPA calls with the wrapper.
   - Cons: It is not the same API, though we can maintain the API signatures to be the same as the core API.

**Summary**

~~This PR implements approach 2 and refactor the code in such a way that most code can be used by option approach 1, which will be introduced in another PR.~~

We changed this PR to implement option 1 as people like option 1 due to the consistency with the existing parallelisms. But this PR can also serve the foundation to implement option 2, which was the early version of this PR.

This PR also changes `create_cp_block_mask` logic since we now only focus on ModuleWrapper approach which doesn't require to hack the seq_len field in a BlockMask.

This PR also removes TorchFunctionMode dispatcher mode as it doesn't work well with SAC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162542
Approved by: https://github.com/XilunWu
2025-10-10 22:03:43 +00:00
50c338c2da [DeviceMesh] Move global state into class method (#164510)
This is PR trying to move bookkeeping state maps from MeshEnv to DeviceMesh class members. The reason is that in general global variables are thread local and cause potential issue.

We will also need to do DTensor CPU overhead benchmark for this change.

3-5% CPU overhead in DTensor has been observed:

before:
<img width="1147" height="535" alt="image" src="https://github.com/user-attachments/assets/9e4ac018-ec0a-46a4-8f2c-64b4dbec465c" />

After:
<img width="1114" height="576" alt="image" src="https://github.com/user-attachments/assets/eaf83660-652b-4c6b-8591-f6049ccdd14c" />

running the benchmark mentioned here: https://github.com/pytorch/pytorch/issues/159169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164510
Approved by: https://github.com/lw, https://github.com/fegin
2025-10-10 21:37:17 +00:00
3faee20067 [opaque_obj_v2] PyObject custom op schema type (#165004)
This is a cleaner implementation of opaque objects (https://github.com/pytorch/pytorch/pull/162660). Instead now we just need to do:

Call `register_opaque_type` to register the type as being "opaque" and allowed by custom ops. You also need to pass a unique name that maps to the type.
```python
class OpaqueQueue:
    def __init__(self, queue: list[torch.Tensor], init_tensor_: torch.Tensor) -> None:
        super().__init__()
        self.queue = queue
        self.init_tensor_ = init_tensor_

    def push(self, tensor: torch.Tensor) -> None:
        self.queue.append(tensor)

    def pop(self) -> torch.Tensor:
        if len(self.queue) > 0:
            return self.queue.pop(0)
        return self.init_tensor_

    def size(self) -> int:
        return len(self.queue)

register_opaque_type(OpaqueQueue, "_TestOpaqueObject_OpaqueQueue")
```

When creating the custom op, the schema will then use the unique name:
```python
self.lib = torch.library.Library("_TestOpaqueObject", "FRAGMENT")

torch.library.define(
    "_TestOpaqueObject::queue_push",
    "(_TestOpaqueObject_OpaqueQueue a, Tensor b) -> ()",
    tags=torch.Tag.pt2_compliant_tag,
    lib=self.lib,
)

@torch.library.impl(
    "_TestOpaqueObject::queue_push", "CompositeExplicitAutograd", lib=self.lib
)
def push_impl(queue: OpaqueQueue, b: torch.Tensor) -> None:
    assert isinstance(queue, OpaqueQueue)
    queue.push(b)
```

Using the custom op:
```python
queue = OpaqueQueue([], torch.zeros(3))
torch.ops._TestOpaqueObject.queue_push(queue, torch.ones(3))
self.assertTrue(queue.size(), 1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165004
Approved by: https://github.com/albanD
2025-10-10 21:31:56 +00:00
cafca357fb Fix h100 daily inductor running dispatch (#165185)
casued by merged pr: e7ed1a00eb

the if condition should also updated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165185
Approved by: https://github.com/malfet, https://github.com/huydhn
2025-10-10 21:28:58 +00:00
1e35b3c4e0 Augment DebugMode to support attributes reporting (#165109)
DebugMode reports tensor type, it shapes and placements while active. This change augments reporting to tensor attributes from configured set. This feature is intended to be used to ease understanding debug string when dealing with larger outputs. For example, before running forward pass of a model we can annotate each of parameters and buffers with their fully qualified names, so that we can see which ops are being executed against specific tensors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165109
Approved by: https://github.com/ezyang, https://github.com/pianpwk
2025-10-10 21:27:05 +00:00
f363114852 [Bugfix][Inductor][Dynamo] Fix stride incorrectness issues for stride 0 tensor (#164897)
Fixes #164814 - we update to include cases where we know symbolic expression is statically one.  There are two errors here; first in graph capture, where a tensor with size 0 yet symbolic stride would attempt to keep the symbolic stride, resulting in a mismatch.  The second is in inductor code gen, where we only checked in squeeze if size == 1, missing the case where a symbolic stride equals 1.

Also fixes #164924 (@bobrenjc93  for fuzzer finding an issue affecting users : )

### Test plan:
```
python test/dynamo/test_aot_autograd.py AotAutogradFallbackTests
```

Results in:
```
..
----------------------------------------------------------------------
Ran 49 tests in 45.622s

OK (expected failures=1)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164897
Approved by: https://github.com/laithsakka
2025-10-10 21:26:57 +00:00
0ec0120b19 Move aws OIDC credentials steps into setup-rocm.yml (#164769)
The AWS ECR login step needs `id-token: write` permissions. We move the steps to get OIDC-based credentials from `_rocm-test.yml` to `setup-rocm.yml`. This lays the groundwork to enable access to AWS ECR in workflows in other repos such as torchtitan that use [linux_job_v2.yml](https://github.com/pytorch/test-infra/blob/main/.github/workflows/linux_job_v2.yml), which also uses [setup-rocm.yml](335f4f80a0/.github/workflows/linux_job_v2.yml (L168)).

Any caller workflows that eventually execute `setup-rocm` action will thus need to provide the `id-token: write` permission.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164769
Approved by: https://github.com/huydhn
2025-10-10 21:24:29 +00:00
8360f34c36 [ROCm] hotfix test scaled matmul cuda (#165104)
Refactoring of scaled mm APIs and related tests caused previously passing tests on ROCm to start failing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165104
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-10-10 21:06:58 +00:00
370b1c12d2 [CI] Put the no gpu tests on machines that don't have gpus (#165183)
I think this is just a copy paste error?

NS: Introduced by https://github.com/pytorch/pytorch/pull/161013

Not sure where it got copied from though, the other set of no gpu tests for the other cuda version already have cpu runners
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165183
Approved by: https://github.com/malfet
2025-10-10 20:59:09 +00:00
6fd1ca28e1 [lint] Run full lint on ciflow/trunk (#165169)
Add some naming stuff to differentiate between full + partial

If we find that partial always == full, then we can get rid of it

https://github.com/pytorch/pytorch/issues/165168
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165169
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-10-10 20:38:51 +00:00
0055f07997 Disable failing test_int8_woq_mm_cuda on slow grad check (#165147)
Fixes #ISSUE_NUMBER
Failing due to memory leak, ex
https://github.com/pytorch/pytorch/actions/runs/18401518298/job/52434584458

```
2025-10-10T11:07:42.9485277Z _ TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16 _
2025-10-10T11:07:42.9485389Z Traceback (most recent call last):
2025-10-10T11:07:42.9485869Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3278, in wrapper
2025-10-10T11:07:42.9485966Z     method(*args, **kwargs)
2025-10-10T11:07:42.9486365Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3278, in wrapper
2025-10-10T11:07:42.9486454Z     method(*args, **kwargs)
2025-10-10T11:07:42.9486849Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3277, in wrapper
2025-10-10T11:07:42.9486933Z     with policy():
2025-10-10T11:07:42.9487380Z   File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2654, in __exit__
2025-10-10T11:07:42.9487473Z     raise RuntimeError(msg)
2025-10-10T11:07:42.9488533Z RuntimeError: CUDA driver API confirmed a leak in __main__.TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16! Caching allocator allocated memory was 19456 and is now reported as 29184 on device 0. CUDA driver allocated memory was 356712448 and is now 358809600.
2025-10-10T11:07:42.9488543Z
2025-10-10T11:07:42.9488722Z To execute this test, run the following from the base repo dir:
2025-10-10T11:07:42.9489520Z     PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=1 PYTORCH_TEST_WITH_SLOW_GRADCHECK=1 python test/inductor/test_cuda_select_algorithm.py TestSelectAlgorithmCudaCUDA.test_int8_woq_mm_cuda_batch_size_32_mid_dim_8_in_features_144_out_features_65_cuda_bfloat16
2025-10-10T11:07:42.9489525Z
2025-10-10T11:07:42.9489748Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

Got added in #161680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165147
Approved by: https://github.com/bbeckca
2025-10-10 20:26:31 +00:00
4f8a986b8f Make LOCK_TIMEOUT in codecache configurable (#165030)
- Introduce file_lock_timeout in config (defaults to current value of 600)
- Use the above config instead of hardcoded 600 config.

This is useful when running stress tests.

Differential Revision:
D84109142

Privacy Context Container: L1297311

Pull Request resolved: https://github.com/pytorch/pytorch/pull/165030
Approved by: https://github.com/hl475
2025-10-10 20:22:11 +00:00
5c3fe9fb30 Revert "Do not decompose in functionalization/proxy tensor if autograd wouldn't have decomposed (#164939)"
This reverts commit a6fa4f9c283971c0fb6f60a89674a1f35370ac79.

Reverted https://github.com/pytorch/pytorch/pull/164939 on behalf of https://github.com/izaitsevfb due to introduces numeric issues internally, see [D84326613](https://www.internalfb.com/diff/D84326613) ([comment](https://github.com/pytorch/pytorch/pull/164939#issuecomment-3392203314))
2025-10-10 20:21:12 +00:00
306b344a18 [dynamo][DebugMode] mask python keys in dispatch_key_set guard checks (#164992)
I found that running any compiled function under DebugMode more than once will trigger recompilations, e.g. with the really simple modified test case in `test_compile`:
```
[0/1] [__recompiles] Recompiling function f in /data/users/pianpwk/ptclone/pytorch/test/distributed/tensor/debug/test_debug_mode.py:268
[0/1] [__recompiles]     triggered by the following guard failure(s):
[0/1] [__recompiles]     - 0/0:
[0/2] [__recompiles] Recompiling function f in /data/users/pianpwk/ptclone/pytorch/test/distributed/tensor/debug/test_debug_mode.py:268
[0/2] [__recompiles]     triggered by the following guard failure(s):
[0/2] [__recompiles]     - 0/1:
[0/2] [__recompiles]     - 0/0:
```

Digging deeper, the guard failures were due to TENSOR_MATCH guards failing on dispatch key set checks (seemingly on the Python dispatch key):
5a1fbf45ad/torch/csrc/dynamo/guards.cpp (L199-L203)

This seems to due to the `ignore_compile_internals=True` flag on custom dispatch modes being on, which causes these modes to "hide" themselves during compilation, making dynamo guard on the Python dispatch key being off.

The (maybe imperfect) solution is to mask out the Python keys for guard comparisons. This might be fine because custom dispatch modes won't appear here during compilation - `ignore_compile_internals=True` hides them, and `ignore_compile_internals=False` disables compile entirely?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164992
Approved by: https://github.com/williamwen42
2025-10-10 20:00:28 +00:00
94e634942a Fix int32 overflow in embedding_dense_backward (#165095)
If `max_partial_segment` is large we can overflow `gid` and cause a bunch of IMA.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165095
Approved by: https://github.com/ngimel, https://github.com/eqy
2025-10-10 19:47:38 +00:00
a4925c0ce0 [testing] Print something for log classifier to better differentiate reruns vs real failures (#165163)
The normal pytest/unittest failure patterns also match flaky tests (specifically I think tests that fail -> succeed on rerun in a new subprocess)

So print something specifically for log classifier that it can match against
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165163
Approved by: https://github.com/izaitsevfb
2025-10-10 19:28:13 +00:00
d16627f4d0 Revert "[dynamo][executorch] Do not trace into exeuctorch LoweredBackendModule (#165126)"
This reverts commit 41936f4cf6ff93b70d81f6a23811d43a0647f1e1.

Reverted https://github.com/pytorch/pytorch/pull/165126 on behalf of https://github.com/anijain2305 due to https://github.com/pytorch/pytorch/pull/165172 is the right way ([comment](https://github.com/pytorch/pytorch/pull/165126#issuecomment-3391975498))
2025-10-10 19:21:41 +00:00
8f78999d77 [Inductor][ATen] Fix stride rounding on Blockwise128x128 to accommodate for small shapes (#164953)
Summary: Fix rounding issue on `Blockwise128x128` to accommodate for small shapes. The original implementation rounded all strides to 4, which caused failures for `test_fp8.py` tests as well as `test_scaled_matmul_cuda.py::test_scaled_mm_vs_emulated_block_wise` tests ([GitHub PR](https://github.com/pytorch/pytorch/pull/164259)).

Test Plan:
`test_fp8.py`
`test_scaled_matmul_cuda.py`

Differential Revision: D84103213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164953
Approved by: https://github.com/slayton58, https://github.com/eqy
2025-10-10 19:12:58 +00:00
7cddda1234 Update asan in slow to linux.2xlarge.memory
Followup after f2ae7084eb
2025-10-10 12:02:29 -07:00
98b53961b9 [torchfuzz] add more context to xfail test file (#165149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165149
Approved by: https://github.com/PaulZhang12
ghstack dependencies: #165116
2025-10-10 18:51:51 +00:00
a3eb275d3c Add torch compile check for ZeroBubble (#162511)
Fix https://github.com/pytorch/pytorch/issues/161904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/162511
Approved by: https://github.com/fegin
2025-10-10 18:49:45 +00:00
772 changed files with 21633 additions and 6421 deletions

View File

@ -8,6 +8,8 @@ if [[ "$GPU_ARCH_VERSION" == *"12.6"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.8"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"
elif [[ "$GPU_ARCH_VERSION" == *"12.9"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;12.0"
elif [[ "$GPU_ARCH_VERSION" == *"13.0"* ]]; then
export TORCH_CUDA_ARCH_LIST="8.0;9.0;10.0;11.0;12.0+PTX"
fi

View File

@ -143,7 +143,7 @@ def sample_vllm_test_library():
"pytest -v -s compile/test_decorator.py",
],
},
"vllm_languagde_model_test_extended_generation_28_failure_test": {
"vllm_language_model_test_extended_generation_28_failure_test": {
"title": "Language Models Test (Extended Generation) 2.8 release failure",
"id": "vllm_languagde_model_test_extended_generation_28_failure_test",
"package_install": [

View File

@ -63,7 +63,7 @@ class VllmBuildParameters:
# DOCKERFILE_PATH: path to Dockerfile used when use_local_dockerfile is True"
use_local_dockerfile: bool = env_bool_field("USE_LOCAL_DOCKERFILE", True)
dockerfile_path: Path = env_path_field(
"DOCKERFILE_PATH", ".github/ci_configs/vllm/Dockerfile.tmp_vllm"
"DOCKERFILE_PATH", ".github/ci_configs/vllm/Dockerfile"
)
# the cleaning script to remove torch dependencies from pip

View File

@ -187,19 +187,22 @@ if [[ $CUDA_VERSION == 12* || $CUDA_VERSION == 13* ]]; then
export USE_CUFILE=0
else
DEPS_LIST+=(
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libcublas.so.12"
"/usr/local/cuda/lib64/libcublasLt.so.12"
"/usr/local/cuda/lib64/libcudart.so.12"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12")
DEPS_SONAME+=(
"libnvToolsExt.so.1"
"libcublas.so.12"
"libcublasLt.so.12"
"libcudart.so.12"
"libnvrtc.so.12"
"libcupti.so.12")
if [[ $CUDA_VERSION != 12.9* ]]; then
DEPS_LIST+=("/usr/local/cuda/lib64/libnvToolsExt.so.1")
DEPS_SONAME+=("libnvToolsExt.so.1")
fi
fi
else
echo "Using nvidia libs from pypi."

View File

@ -8,6 +8,7 @@ assignees: ''
---
> NOTE: Remember to label this issue with "`ci: sev`"
> If you want autorevert to be disabled, keep the ci: disable-autorevert label
<!-- Add the `merge blocking` label to this PR to prevent PRs from being merged while this issue is open -->

View File

@ -1,7 +1,7 @@
---
name: DISABLE AUTOREVERT
name: "D❌\U0001F519 ISABLE AUTOREVERT"
about: Disables autorevert when open
title: "❌​\U0001F519 [DISABLE AUTOREVERT]"
title: "[DISABLE AUTOREVERT]"
labels: 'ci: disable-autorevert'
assignees: ''

View File

@ -65,7 +65,7 @@ runs:
cd .ci/lumen_cli
python3 -m pip install -e .
)
MAX_JOBS="$(nproc --ignore=6)"
MAX_JOBS="$(nproc --ignore=10)"
export MAX_JOBS
# Split the comma-separated list and build each target

View File

@ -111,3 +111,16 @@ runs:
# This video group ID maps to subgid 1 inside the docker image due to the /etc/subgid entries.
# The group name corresponding to group ID 1 can change depending on the OS, so both are necessary.
echo "GPU_FLAG=--device=/dev/mem --device=/dev/kfd $DEVICE_FLAG --group-add video --group-add $render_gid --group-add daemon --group-add bin --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --network=host" >> "${GITHUB_ENV}"
- name: configure aws credentials
id: aws_creds
uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Login to Amazon ECR
id: login-ecr
continue-on-error: true
uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1

View File

@ -1 +1 @@
87ff22e49ed0e92576c4935ccb8c143daac4a3cd
1b013f5b5a87a1882eb143c26d79d091150d6a37

View File

@ -1 +1 @@
966da7e46f65d6d49df3e31214470a4fe5cc8e66
faffd5cf673615583da6517275e361cb3dbc77e6

View File

@ -1 +1 @@
0ad9951c416d33c5da4f7a504fb162cbe62386f5
e5192819208c4d68194844b7dfafbc00020d0dea

View File

@ -1 +1 @@
2a9138a26ee257fef05310ad3fecf7c55fe80d73
0fa6e3129e61143224663e1ec67980d12b7ec4eb

View File

@ -1,59 +1,71 @@
# TODO(elainwy): remove this file after the torch nightly dockerfile is in sync in vllm repo
# The vLLM Dockerfile is used to construct vLLM image against torch nightly and torch main that can be directly used for testing
ARG CUDA_VERSION=12.8.1
ARG PYTHON_VERSION=3.12
# BUILD_BASE_IMAGE: used to setup python build xformers, and vllm wheels, It can be replaced with a different base image from local machine,
# by default, it uses the torch-nightly-base stage from this docker image
ARG BUILD_BASE_IMAGE=torch-nightly-base
# FINAL_BASE_IMAGE: used to set up vllm-instaled environment and build flashinfer,
# by default, it uses devel-ubuntu22.04 official image.
ARG FINAL_BASE_IMAGE=nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04
# The logic is copied from https://github.com/vllm-project/vllm/blob/main/docker/Dockerfile
ARG GET_PIP_URL="https://bootstrap.pypa.io/get-pip.py"
#################### TORCH NIGHTLY BASE IMAGE ####################
# A base image for building vLLM with devel ubuntu 22.04, this is mainly used to build vllm in vllm builtkite ci
FROM nvidia/cuda:${CUDA_VERSION}-devel-ubuntu22.04 as torch-nightly-base
ARG CUDA_VERSION
ARG PYTHON_VERSION
ARG GET_PIP_URL
# Install Python and other dependencies
# Install system dependencies and uv, then create Python virtual environment
RUN apt-get update -y \
&& apt-get install -y ccache software-properties-common git curl wget sudo vim \
&& add-apt-repository -y ppa:deadsnakes/ppa \
&& apt-get update -y \
&& apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv \
&& update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1 \
&& update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION} \
&& ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config \
&& curl -sS ${GET_PIP_URL} | python${PYTHON_VERSION} \
&& apt-get install -y ccache software-properties-common git curl sudo vim python3-pip \
&& curl -LsSf https://astral.sh/uv/install.sh | sh \
&& $HOME/.local/bin/uv venv /opt/venv --python ${PYTHON_VERSION} \
&& rm -f /usr/bin/python3 /usr/bin/python3-config /usr/bin/pip \
&& ln -s /opt/venv/bin/python3 /usr/bin/python3 \
&& ln -s /opt/venv/bin/python3-config /usr/bin/python3-config \
&& ln -s /opt/venv/bin/pip /usr/bin/pip \
&& python3 --version && python3 -m pip --version
# Upgrade to GCC 10 to avoid https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92519
# as it was causing spam when compiling the CUTLASS kernels
# Ensure gcc >= 10 to avoid CUTLASS issues (bug 92519)
RUN current_gcc_version=$(gcc -dumpversion | cut -f1 -d.) && \
if command -v apt-get >/dev/null; then \
if [ "$current_gcc_version" -lt 10 ]; then \
echo "GCC version is $current_gcc_version, installing gcc-10..."; \
apt-get update \
&& apt-get install -y gcc-10 g++-10 \
&& update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 100 \
&& update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-10 100; \
else \
echo "GCC version is $current_gcc_version, no need to install gcc-10."; \
fi \
fi \
&& gcc --version && g++ --version
RUN apt-get install -y gcc-10 g++-10
RUN update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-10 110 --slave /usr/bin/g++ g++ /usr/bin/g++-10
RUN <<EOF
gcc --version
EOF
# install uv for faster pip installs
# Install uv for faster pip installs
RUN --mount=type=cache,target=/root/.cache/uv \
python3 -m pip install uv==0.8.4
ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match"
# Use copy mode to avoid hardlink failures with Docker cache mounts
ENV UV_LINK_MODE=copy
#################### TORCH NIGHTLY BASE IMAGE ####################
#################### BASE BUILD IMAGE ####################
FROM ${BUILD_BASE_IMAGE} AS base
USER root
ARG CUDA_VERSION
ARG PYTHON_VERSION
# Only work with PyTorch manylinux builder
ENV PATH="/opt/python/cp312-cp312/bin:${PATH}"
# Install some system dependencies and double check python version
RUN if command -v apt-get >/dev/null; then \
apt-get update -y \
&& apt-get install -y ccache software-properties-common git wget sudo vim; \
else \
dnf install -y git wget sudo; \
fi \
&& python3 --version && python3 -m pip --version
# Install uv for faster pip installs if not existed
RUN --mount=type=cache,target=/root/.cache/uv \
python3 -m pip install uv==0.8.4
@ -62,51 +74,17 @@ ENV UV_INDEX_STRATEGY="unsafe-best-match"
# Use copy mode to avoid hardlink failures with Docker cache mounts
ENV UV_LINK_MODE=copy
#################### TORCH NIGHTLY BASE IMAGE ####################
#################### BASE BUILD IMAGE ####################
# A base image for building vLLM with torch nightly or torch wheels
# prepare basic build environment
FROM ${BUILD_BASE_IMAGE} AS base
USER root
ARG CUDA_VERSION
ARG PYTHON_VERSION
# TODO (huydhn): Only work with PyTorch manylinux builder
ENV PATH="/opt/python/cp312-cp312/bin:${PATH}"
# Install some system dependencies and double check python version
RUN if command -v apt-get >/dev/null; then \
apt-get update -y \
&& apt-get install -y ccache software-properties-common git curl wget sudo vim; \
else \
dnf install -y git curl wget sudo; \
fi \
&& python3 --version && python3 -m pip --version
# Install uv for faster pip installs if not existed
RUN --mount=type=cache,target=/root/.cache/uv \
if ! python3 -m uv --version >/dev/null 2>&1; then \
python3 -m pip install uv==0.8.4; \
fi
ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match"
# Use copy mode to avoid hardlink failures with Docker cache mounts
ENV UV_LINK_MODE=copy
WORKDIR /workspace
# install build and runtime dependencies
# Install build and runtime dependencies
COPY requirements/common.txt requirements/common.txt
COPY use_existing_torch.py use_existing_torch.py
COPY pyproject.toml pyproject.toml
# install build and runtime dependencies without stable torch version
# Install build and runtime dependencies without stable torch version
RUN python3 use_existing_torch.py
# default mount file as placeholder, this just avoid the mount error
# Default mount file as placeholder, this just avoid the mount error
# change to a different vllm folder if this does not exist anymore
ARG TORCH_WHEELS_PATH="./requirements"
ARG PINNED_TORCH_VERSION
@ -138,56 +116,36 @@ RUN --mount=type=cache,target=/root/.cache/uv \
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system -r requirements/common.txt
# Must put before installing xformers, so it can install the correct version of xfomrers.
ARG xformers_cuda_arch_list='7.5;8.0+PTX;9.0a'
ENV TORCH_CUDA_ARCH_LIST=${xformers_cuda_arch_list}
ARG max_jobs=16
ENV MAX_JOBS=${max_jobs}
RUN echo ${TORCH_CUDA_ARCH_LIST}
RUN echo ${MAX_JOBS}
RUN pip freeze | grep -E 'ninja'
RUN --mount=type=cache,target=/root/.cache/uv bash - <<'BASH'
export TORCH_CUDA_ARCH_LIST='7.5 8.0+PTX 9.0a'
git clone https://github.com/facebookresearch/xformers.git
# Build xformers with cuda and torch nightly/wheel
# following official xformers guidance: https://github.com/facebookresearch/xformers#build
# sha for https://github.com/facebookresearch/xformers/tree/v0.0.32.post2
ARG XFORMERS_COMMIT=5d4b92a5e5a9c6c6d4878283f47d82e17995b468
ENV CCACHE_DIR=/root/.cache/ccache
pushd xformers
git checkout v0.0.32.post2
git submodule update --init --recursive
python3 setup.py bdist_wheel --dist-dir=../xformers-dist --verbose
popd
RUN --mount=type=cache,target=/root/.cache/ccache \
--mount=type=cache,target=/root/.cache/uv \
echo 'git clone xformers...' \
&& git clone https://github.com/facebookresearch/xformers.git --recursive \
&& cd xformers \
&& git checkout ${XFORMERS_COMMIT} \
&& git submodule update --init --recursive \
&& echo 'finish git clone xformers...' \
&& rm -rf build \
&& python3 setup.py bdist_wheel --dist-dir=../xformers-dist --verbose \
&& cd .. \
&& rm -rf xformers
rm -rf xformers
BASH
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system xformers-dist/*.whl --verbose
uv pip install --system xformers-dist/*.whl
# Build can take a long time, and the torch nightly version fetched from url can be different in next docker stage.
# track the nightly torch version used in the build, when we set up runtime environment we can make sure the version is the same
RUN uv pip freeze | grep -i '^torch\|^torchvision\|^torchaudio' > torch_build_versions.txt
RUN cat torch_build_versions.txt
RUN pip freeze | grep -E 'torch|xformers|torchvision|torchaudio'
#################### BASE BUILD IMAGE ####################
#################### WHEEL BUILD IMAGE ####################
# Image used to build vllm wheel
FROM base AS build
ARG TARGETPLATFORM
COPY . .
RUN python3 use_existing_torch.py
RUN --mount=type=cache,target=/root/.cache/uv \
@ -197,20 +155,17 @@ ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != "0" ]; then bash tools/check_repo.sh ; fi
# Max jobs used by Ninja to build extensions
ARG max_jobs=16
ENV MAX_JOBS=${max_jobs}
ARG nvcc_threads=4
ARG nvcc_threads=8
ENV NVCC_THREADS=$nvcc_threads
ARG torch_cuda_arch_list='8.0 8.6 8.9 9.0'
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
ARG USE_SCCACHE
ARG SCCACHE_BUCKET_NAME=vllm-build-sccache
ARG SCCACHE_REGION_NAME=us-west-2
ARG SCCACHE_S3_NO_CREDENTIALS=0
# if USE_SCCACHE is set, use sccache to speed up compilation
# Use sccache to speed up compilation
RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,source=.git,target=.git \
if [ "$USE_SCCACHE" = "1" ]; then \
@ -235,6 +190,9 @@ RUN --mount=type=cache,target=/root/.cache/uv \
&& sccache --show-stats; \
fi
ARG torch_cuda_arch_list='8.0 8.6 8.9 9.0'
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
ARG vllm_target_device="cuda"
ENV VLLM_TARGET_DEVICE=${vllm_target_device}
ENV CCACHE_DIR=/root/.cache/ccache
@ -248,17 +206,10 @@ RUN --mount=type=cache,target=/root/.cache/ccache \
export VLLM_DOCKER_BUILD_CONTEXT=1 && \
python3 setup.py bdist_wheel --dist-dir=vllm-dist --py-limited-api=cp38; \
fi
RUN echo "[INFO] Listing current directory:" && \
ls -al && \
echo "[INFO] Showing torch_build_versions.txt content:" && \
cat torch_build_versions.txt
#################### WHEEL BUILD IMAGE ####################
################### VLLM INSTALLED IMAGE ####################
# Setup clean environment for vLLM for test and api server using ubuntu22.04 with AOT flashinfer
FROM ${FINAL_BASE_IMAGE} AS vllm-base
USER root
@ -266,7 +217,7 @@ ARG CUDA_VERSION
ARG PYTHON_VERSION
ARG GET_PIP_URL
# TODO (huydhn): Only work with PyTorch manylinux builder
# Only work with PyTorch manylinux builder
ENV PATH="/opt/python/cp312-cp312/bin:${PATH}"
# prepare for environment starts
@ -275,20 +226,19 @@ WORKDIR /workspace
# Install Python and other dependencies
RUN if command -v apt-get >/dev/null; then \
apt-get update -y \
&& apt-get install -y ccache software-properties-common git curl wget sudo vim \
&& add-apt-repository -y ppa:deadsnakes/ppa \
&& apt-get update -y \
&& apt-get install -y python${PYTHON_VERSION} python${PYTHON_VERSION}-dev python${PYTHON_VERSION}-venv \
&& update-alternatives --install /usr/bin/python3 python3 /usr/bin/python${PYTHON_VERSION} 1 \
&& update-alternatives --set python3 /usr/bin/python${PYTHON_VERSION} \
&& ln -sf /usr/bin/python${PYTHON_VERSION}-config /usr/bin/python3-config \
&& curl -sS ${GET_PIP_URL} | python${PYTHON_VERSION}; \
&& apt-get install -y ccache software-properties-common git sudo vim python3-pip; \
else \
dnf install -y git curl wget sudo; \
dnf install -y git wget sudo; \
fi \
&& curl -LsSf https://astral.sh/uv/install.sh | sh \
&& $HOME/.local/bin/uv venv /opt/venv --python ${PYTHON_VERSION} \
&& rm -f /usr/bin/python3 /usr/bin/python3-config /usr/bin/pip \
&& ln -s /opt/venv/bin/python3 /usr/bin/python3 \
&& ln -s /opt/venv/bin/python3-config /usr/bin/python3-config \
&& ln -s /opt/venv/bin/pip /usr/bin/pip \
&& python3 --version && python3 -m pip --version
# Get the torch versions, and whls used in previous stagtes for consistency
# Get the torch versions, and whls used in previous stage
COPY --from=base /workspace/torch_build_versions.txt ./torch_build_versions.txt
COPY --from=base /workspace/xformers-dist /wheels/xformers
COPY --from=build /workspace/vllm-dist /wheels/vllm
@ -297,33 +247,29 @@ RUN echo "[INFO] Listing current directory before torch install step:" && \
echo "[INFO] Showing torch_build_versions.txt content:" && \
cat torch_build_versions.txt
# Install build and runtime dependencies, this is needed for flashinfer install
COPY requirements/build.txt requirements/build.txt
COPY use_existing_torch.py use_existing_torch.py
RUN python3 use_existing_torch.py
RUN cat requirements/build.txt
# Install uv for faster pip installs if not existed
RUN --mount=type=cache,target=/root/.cache/uv \
if ! python3 -m uv --version > /dev/null 2>&1; then \
python3 -m pip install uv==0.8.4; \
fi
python3 -m pip install uv==0.8.4
ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match"
# Use copy mode to avoid hardlink failures with Docker cache mounts
ENV UV_LINK_MODE=copy
# Install build and runtime dependencies, this is needed for flashinfer install
COPY requirements/build.txt requirements/build.txt
COPY use_existing_torch.py use_existing_torch.py
RUN python3 use_existing_torch.py
RUN cat requirements/build.txt
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system -r requirements/build.txt
# Default mount file as placeholder, this just avoid the mount error
ARG TORCH_WHEELS_PATH="./requirements"
# Install torch, torchaudio and torchvision
# if TORCH_WHEELS_PATH is default "./requirements", it will pull the nightly versions using pip using torch_build_versions.txt
# otherwise, it will use the whls from TORCH_WHEELS_PATH from the host machine
# Install torch, torchaudio and torchvision. If TORCH_WHEELS_PATH is default
# to ./requirements, it will pull the nightly versions using pip. Otherwise,
# it will use the local wheels from TORCH_WHEELS_PATH
RUN --mount=type=bind,source=${TORCH_WHEELS_PATH},target=/dist \
--mount=type=cache,target=/root/.cache/uv \
if [ -n "$TORCH_WHEELS_PATH" ] && [ "$TORCH_WHEELS_PATH" != "./requirements" ] && [ -d "/dist" ] && ls /dist/torch*.whl >/dev/null 2>&1; then \
@ -344,18 +290,14 @@ RUN --mount=type=cache,target=/root/.cache/uv \
# Install xformers wheel from previous stage
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system /wheels/xformers/*.whl --verbose
# Build flashinfer from source.
# Build FlashInfer from source
ARG torch_cuda_arch_list='8.0;8.9;9.0a;10.0a;12.0'
# install package for build flashinfer
# see issue: https://github.com/flashinfer-ai/flashinfer/issues/738
RUN pip freeze | grep -E 'setuptools|packaging|build'
ENV TORCH_CUDA_ARCH_LIST=${torch_cuda_arch_list}
# Build flashinfer for torch nightly from source around 10 mins
ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git"
# Keep this in sync with https://github.com/vllm-project/vllm/blob/main/requirements/cuda.txt
ARG FLASHINFER_GIT_REF="v0.2.14.post1"
RUN --mount=type=cache,target=/root/.cache/uv \
git clone --depth 1 --recursive --shallow-submodules \
--branch ${FLASHINFER_GIT_REF} \
@ -367,7 +309,7 @@ RUN --mount=type=cache,target=/root/.cache/uv \
&& cd .. \
&& rm -rf flashinfer
# install flashinfer python
# Install FlashInfer
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system wheels/flashinfer/*.whl --verbose
@ -377,49 +319,6 @@ RUN uv pip freeze | grep -i '^torch\|^torchvision\|^torchaudio\|^xformers\|^vllm
################### VLLM INSTALLED IMAGE ####################
#################### UNITTEST IMAGE #############################
FROM vllm-base as test
ENV UV_HTTP_TIMEOUT=500
ENV UV_INDEX_STRATEGY="unsafe-best-match"
# Use copy mode to avoid hardlink failures with Docker cache mounts
ENV UV_LINK_MODE=copy
COPY tests/ tests/
COPY examples examples
COPY benchmarks benchmarks
COPY ./vllm/collect_env.py .
COPY requirements/common.txt requirements/common.txt
COPY use_existing_torch.py use_existing_torch.py
COPY pyproject.toml pyproject.toml
# Install build and runtime dependencies without stable torch version
COPY requirements/nightly_torch_test.txt requirements/nightly_torch_test.txt
RUN python3 use_existing_torch.py
# install packages
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system -r requirements/common.txt
# enable fast downloads from hf (for testing)
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system hf_transfer
ENV HF_HUB_ENABLE_HF_TRANSFER 1
# install development dependencies (for testing)
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system -e tests/vllm_test_utils
RUN --mount=type=cache,target=/root/.cache/uv \
uv pip install --system -r requirements/nightly_torch_test.txt
# Logging to confirm the torch versions
RUN pip freeze | grep -E 'torch|xformers|vllm|flashinfer'
# Logging to confirm all the packages are installed
RUN pip freeze
#################### UNITTEST IMAGE #############################
#################### EXPORT STAGE ####################
FROM scratch as export-wheels

View File

@ -3,6 +3,7 @@ ciflow_tracking_issue: 64124
ciflow_push_tags:
- ciflow/b200
- ciflow/b200-symm-mem
- ciflow/b200-distributed
- ciflow/binaries
- ciflow/binaries_libtorch
- ciflow/binaries_wheel
@ -15,7 +16,8 @@ ciflow_push_tags:
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
- ciflow/inductor-perf-compare
- ciflow/inductor-perf-test-nightly-rocm
- ciflow/inductor-perf-test-nightly-rocm-mi300
- ciflow/inductor-perf-test-nightly-rocm-mi355
- ciflow/inductor-perf-test-nightly-x86-zen
- ciflow/inductor-periodic
- ciflow/inductor-rocm

View File

@ -512,6 +512,8 @@ def perform_misc_tasks(
"keep-going",
branch == MAIN_BRANCH
or bool(tag and re.match(r"^trunk/[a-f0-9]{40}$", tag))
# Pattern for tags created via manual run on HUD
or bool(tag and re.match(r"^ciflow/[^/]+/[a-f0-9]{40}$", tag))
or check_for_setting(labels, pr_body, "keep-going"),
)
set_output(

View File

@ -16,16 +16,18 @@ from typing import Optional
# NOTE: Please also update the CUDA sources in `PIP_SOURCES` in tools/nightly.py when changing this
CUDA_ARCHES = ["12.6", "12.8", "13.0"]
CUDA_ARCHES = ["12.6", "12.8", "12.9", "13.0"]
CUDA_STABLE = "12.8"
CUDA_ARCHES_FULL_VERSION = {
"12.6": "12.6.3",
"12.8": "12.8.1",
"12.9": "12.9.1",
"13.0": "13.0.0",
}
CUDA_ARCHES_CUDNN_VERSION = {
"12.6": "9",
"12.8": "9",
"12.9": "9",
"13.0": "9",
}
@ -38,7 +40,7 @@ CPU_AARCH64_ARCH = ["cpu-aarch64"]
CPU_S390X_ARCH = ["cpu-s390x"]
CUDA_AARCH64_ARCHES = ["12.6-aarch64", "12.8-aarch64", "13.0-aarch64"]
CUDA_AARCH64_ARCHES = ["12.6-aarch64", "12.8-aarch64", "12.9-aarch64", "13.0-aarch64"]
PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
@ -76,6 +78,23 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' | "
"nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux'"
),
"12.9": (
"nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"13.0": (
"nvidia-cuda-nvrtc==13.0.48; platform_system == 'Linux' | "
"nvidia-cuda-runtime==13.0.48; platform_system == 'Linux' | "
@ -322,7 +341,7 @@ def generate_wheels_matrix(
# cuda linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install
if (
arch_version in ["13.0", "12.8", "12.6"]
arch_version in ["13.0", "12.9", "12.8", "12.6"]
and os == "linux"
or arch_version in CUDA_AARCH64_ARCHES
):
@ -386,5 +405,6 @@ def generate_wheels_matrix(
validate_nccl_dep_consistency("13.0")
validate_nccl_dep_consistency("12.9")
validate_nccl_dep_consistency("12.8")
validate_nccl_dep_consistency("12.6")

View File

@ -2042,10 +2042,6 @@ def validate_revert(
f"[{', '.join(allowed_reverters)}], but instead is {author_association}."
)
# Raises exception if matching rule is not found, but ignores all status checks
find_matching_merge_rule(
pr, repo, skip_mandatory_checks=True, skip_internal_checks=True
)
commit_sha = get_pr_commit_sha(repo, pr)
return (author_login, commit_sha)

View File

@ -177,6 +177,9 @@ jobs:
runs-on: linux.rocm.gpu.mi250
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config) }}
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm

View File

@ -72,7 +72,7 @@ jobs:
# Let's try to figure out how this can be improved
timeout-minutes: 360
- docs_type: python
runner: ${{ inputs.runner_prefix }}linux.2xlarge
runner: ${{ inputs.runner_prefix }}linux.c7i.2xlarge
# It takes less than 30m to finish python docs unless there are issues
timeout-minutes: 30
# Set a fixed name for this job instead of using the current matrix-generated name, i.e. build-docs (cpp, linux.12xlarge, 180)

View File

@ -102,19 +102,6 @@ jobs:
exit 1
fi
- name: configure aws credentials
id: aws_creds
uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
aws-region: us-east-1
role-duration-seconds: 18000
- name: Login to Amazon ECR
id: login-ecr
continue-on-error: true
uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

62
.github/workflows/b200-distributed.yml vendored Normal file
View File

@ -0,0 +1,62 @@
name: CI for distributed tests on B200
on:
pull_request:
paths:
- .github/workflows/b200-distributed.yml
workflow_dispatch:
push:
tags:
- ciflow/b200-distributed/*
schedule:
- cron: 46 8 * * * # about 1:46am PDT
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-cuda12_8-py3_10-gcc11-build-distributed-b200:
name: linux-jammy-cuda12.8-py3.10-gcc11-build-distributed-b200
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: linux.12xlarge.memory
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-distributed-b200
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '10.0'
test-matrix: |
{ include: [
{ config: "distributed", shard: 1, num_shards: 2, runner: "linux.dgx.b200.8" },
{ config: "distributed", shard: 2, num_shards: 2, runner: "linux.dgx.b200.8" },
]}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-test-distributed-b200:
name: linux-jammy-cuda12.8-py3.10-gcc11-test-b200
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-build-distributed-b200
with:
timeout-minutes: 1200
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-distributed-b200
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build-distributed-b200.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-build-distributed-b200.outputs.test-matrix }}
aws-role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_s3_and_ecr_read_only
secrets: inherit

View File

@ -46,10 +46,12 @@ jobs:
fail-fast: false
matrix:
include: [
{ name: "manylinux2_28-builder", tag: "cuda13.0", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cuda13.0", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cuda12.8", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cuda12.9", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "cuda12.6", runner: "linux.9xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda13.0", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.9", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.8", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinuxaarch64-builder", tag: "cuda12.6", runner: "linux.arm64.2xlarge.ephemeral" },
{ name: "manylinux2_28-builder", tag: "rocm6.4", runner: "linux.9xlarge.ephemeral" },

View File

@ -27,9 +27,8 @@ jobs:
fail-fast: false
matrix:
python-version: [ '3.12' ]
# TODO (huydhn): Add cu130 after https://github.com/vllm-project/vllm/issues/24464 is resolved
platform: [ 'manylinux_2_28_x86_64', 'manylinux_2_28_aarch64' ]
device: [ 'cu128', 'cu129' ]
device: [ 'cu128', 'cu129', 'cu130' ]
include:
- platform: manylinux_2_28_x86_64
device: cu128
@ -39,6 +38,10 @@ jobs:
device: cu129
manylinux-image: 'pytorch/manylinux2_28-builder:cuda12.9'
runner: linux.12xlarge.memory
- platform: manylinux_2_28_x86_64
device: cu130
manylinux-image: 'pytorch/manylinux2_28-builder:cuda13.0'
runner: linux.12xlarge.memory
- platform: manylinux_2_28_aarch64
device: cu128
manylinux-image: 'pytorch/manylinuxaarch64-builder:cuda12.8'
@ -47,6 +50,11 @@ jobs:
device: cu129
manylinux-image: 'pytorch/manylinuxaarch64-builder:cuda12.9'
runner: linux.arm64.r7g.12xlarge.memory
exclude:
# TODO (huydhn): Add cu130 aarch64 once PyTorch is on 2.9+ and
# xformers is update to support 13.0
- platform: manylinux_2_28_aarch64
device: cu130
name: "Build ${{ matrix.device }} vLLM wheel on ${{ matrix.platform }}"
runs-on: ${{ matrix.runner }}
timeout-minutes: 480
@ -169,7 +177,12 @@ jobs:
fail-fast: false
matrix:
platform: [ 'manylinux_2_28_x86_64', 'manylinux_2_28_aarch64' ]
device: [ 'cu128', 'cu129' ]
device: [ 'cu128', 'cu129', 'cu130' ]
exclude:
# TODO (huydhn): Add cu130 aarch64 once PyTorch is on 2.9+ and
# xformers is update to support 13.0
- platform: manylinux_2_28_aarch64
device: cu130
env:
PLATFORM: ${{ matrix.platform }}
BUILD_DEVICE: ${{ matrix.device }}

View File

@ -204,6 +204,52 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -407,6 +453,52 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -610,6 +702,52 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -813,6 +951,52 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -1016,6 +1200,52 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -1219,6 +1449,52 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.14"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.14"
build_name: manywheel-py3_14-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -1422,6 +1698,52 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14t-cuda-aarch64-12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.14t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.r7g.12xlarge.memory
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_14t-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda-aarch64-12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14t-cuda-aarch64-12_9-build
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9-aarch64"
GPU_ARCH_TYPE: cuda-aarch64
DOCKER_IMAGE: manylinuxaarch64-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.14t"
build_name: manywheel-py3_14t-cuda-aarch64-12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14t-cuda-aarch64-13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml

View File

@ -248,6 +248,74 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_9-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: libtorch-cuda12_9-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_9-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cuda12_9-shared-with-deps-release-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-cuda12_9-shared-with-deps-release
build_environment: linux-binary-libtorch
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8+ builds need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_9-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_9-shared-with-deps-release-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-cuda12_9-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda13_0-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -358,6 +426,9 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -473,6 +544,9 @@ jobs:
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm

View File

@ -241,6 +241,72 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_9-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_10-cuda12_9-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_9
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8+ builds need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_10-cuda12_9-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cuda12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_10-cuda13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -347,6 +413,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.10"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -459,6 +528,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
DESIRED_PYTHON: "3.10"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -835,6 +907,72 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_9-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_11-cuda12_9-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_9
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8+ builds need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda12_9-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -941,6 +1079,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.11"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -1053,6 +1194,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
DESIRED_PYTHON: "3.11"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -1429,6 +1573,72 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_9-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_12-cuda12_9-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda12_9
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8+ builds need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_12-cuda12_9-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cuda12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_12-cuda13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -1535,6 +1745,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.12"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -1647,6 +1860,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
DESIRED_PYTHON: "3.12"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -2023,6 +2239,72 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.13"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_9-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13-cuda12_9-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda12_9
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8+ builds need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13-cuda12_9-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.13"
build_name: manywheel-py3_13-cuda12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13-cuda13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -2129,6 +2411,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.13"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -2241,6 +2526,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
DESIRED_PYTHON: "3.13"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -2617,6 +2905,72 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_9-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda12_9-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_9
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8+ builds need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda12_9-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -2723,6 +3077,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.13t"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -2835,6 +3192,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
DESIRED_PYTHON: "3.13t"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -3211,6 +3571,72 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14-cuda12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.14"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda12_9-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_14-cuda12_9-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.14"
build_name: manywheel-py3_14-cuda12_9
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8+ builds need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14-cuda12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14-cuda12_9-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.14"
build_name: manywheel-py3_14-cuda12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14-cuda13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -3317,6 +3743,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.14"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -3429,6 +3858,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
DESIRED_PYTHON: "3.14"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -3805,6 +4237,72 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14t-cuda12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.14t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_14t-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.20; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda12_9-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_14t-cuda12_9-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.14t"
build_name: manywheel-py3_14t-cuda12_9
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8+ builds need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_14t-cuda12_9-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_14t-cuda12_9-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
DESIRED_PYTHON: "3.14t"
build_name: manywheel-py3_14t-cuda12_9
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_14t-cuda13_0-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -3911,6 +4409,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm6.4
DESIRED_PYTHON: "3.14t"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm
@ -4023,6 +4524,9 @@ jobs:
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: rocm7.0
DESIRED_PYTHON: "3.14t"
permissions:
id-token: write
contents: read
steps:
- name: Setup ROCm
uses: ./.github/actions/setup-rocm

View File

@ -788,6 +788,256 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_9-shared-with-deps-debug-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v4.4.0
if: always()
with:
name: libtorch-cuda12_9-shared-with-deps-debug
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_9-shared-with-deps-debug-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cuda12_9-shared-with-deps-debug-build
- get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.g4dn.xlarge"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-cuda12_9-shared-with-deps-debug
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_9-shared-with-deps-debug-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_9-shared-with-deps-debug-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
build_name: libtorch-cuda12_9-shared-with-deps-debug
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda13_0-shared-with-deps-debug-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type

View File

@ -788,6 +788,256 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_9-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v4.4.0
if: always()
with:
name: libtorch-cuda12_9-shared-with-deps-release
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_9-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cuda12_9-shared-with-deps-release-build
- get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.g4dn.xlarge"
timeout-minutes: 360
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-cuda12_9-shared-with-deps-release
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_9-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_9-shared-with-deps-release-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: "12.9"
GPU_ARCH_TYPE: cuda
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.10"
build_name: libtorch-cuda12_9-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda13_0-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type

File diff suppressed because it is too large Load Diff

View File

@ -130,7 +130,7 @@ jobs:
name: test-periodically
uses: ./.github/workflows/_linux-test.yml
needs: build
if: github.event.schedule == '15 0,12 * * 1-6'
if: github.event.schedule == '15 0 * * 1-6'
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm90
dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-cudagraphs_low_precision-true

View File

@ -0,0 +1,132 @@
name: inductor-perf-nightly-rocm-mi300
on:
push:
tags:
- ciflow/inductor-perf-test-nightly-rocm-mi300/*
schedule:
- cron: 15 0 * * *
# NB: GitHub has an upper limit of 10 inputs here, so before we can sort it
# out, let try to run torchao cudagraphs_low_precision as part of cudagraphs
workflow_dispatch:
inputs:
training:
description: Run training (on by default)?
required: false
type: boolean
default: true
inference:
description: Run inference (on by default)?
required: false
type: boolean
default: true
default:
description: Run inductor_default?
required: false
type: boolean
default: false
dynamic:
description: Run inductor_dynamic_shapes?
required: false
type: boolean
default: false
cppwrapper:
description: Run inductor_cpp_wrapper?
required: false
type: boolean
default: false
cudagraphs:
description: Run inductor_cudagraphs?
required: false
type: boolean
default: true
freezing_cudagraphs:
description: Run inductor_cudagraphs with freezing for inference?
required: false
type: boolean
default: false
aotinductor:
description: Run aot_inductor for inference?
required: false
type: boolean
default: false
maxautotune:
description: Run inductor_max_autotune?
required: false
type: boolean
default: false
benchmark_configs:
description: The list of configs used the benchmark
required: false
type: string
default: inductor_huggingface_perf_rocm_mi300,inductor_timm_perf_rocm_mi300,inductor_torchbench_perf_rocm_mi300
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
jobs:
get-label-type:
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
linux-jammy-rocm-py3_10-inductor-benchmark-build:
if: github.repository_owner == 'pytorch'
name: rocm-py3_10-inductor-benchmark-build
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-rocm-py3_10
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3-benchmarks
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_rocm_mi300", shard: 1, num_shards: 5, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_huggingface_perf_rocm_mi300", shard: 2, num_shards: 5, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_huggingface_perf_rocm_mi300", shard: 3, num_shards: 5, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_huggingface_perf_rocm_mi300", shard: 4, num_shards: 5, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_huggingface_perf_rocm_mi300", shard: 5, num_shards: 5, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm_mi300", shard: 1, num_shards: 7, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm_mi300", shard: 2, num_shards: 7, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm_mi300", shard: 3, num_shards: 7, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm_mi300", shard: 4, num_shards: 7, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm_mi300", shard: 5, num_shards: 7, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm_mi300", shard: 6, num_shards: 7, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm_mi300", shard: 7, num_shards: 7, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm_mi300", shard: 1, num_shards: 9, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm_mi300", shard: 2, num_shards: 9, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm_mi300", shard: 3, num_shards: 9, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm_mi300", shard: 4, num_shards: 9, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm_mi300", shard: 5, num_shards: 9, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm_mi300", shard: 6, num_shards: 9, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm_mi300", shard: 7, num_shards: 9, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm_mi300", shard: 8, num_shards: 9, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm_mi300", shard: 9, num_shards: 9, runner: "linux.rocm.gpu.gfx942.1" },
]}
secrets: inherit
linux-jammy-rocm-py3_10-inductor-benchmark-test:
permissions:
id-token: write
contents: read
name: rocm-py3_10-inductor-benchmark-test
uses: ./.github/workflows/_rocm-test.yml
needs: linux-jammy-rocm-py3_10-inductor-benchmark-build
with:
build-environment: linux-jammy-rocm-py3_10
dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-cudagraphs_low_precision-true
docker-image: ${{ needs.linux-jammy-rocm-py3_10-inductor-benchmark-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-rocm-py3_10-inductor-benchmark-build.outputs.test-matrix }}
timeout-minutes: 720
# Disable monitor in perf tests for more investigation
disable-monitor: true
monitor-log-interval: 10
monitor-data-collect-interval: 2
secrets: inherit

View File

@ -1,11 +1,11 @@
name: inductor-perf-nightly-rocm
name: inductor-perf-nightly-rocm-mi355
on:
push:
tags:
- ciflow/inductor-perf-test-nightly-rocm/*
- ciflow/inductor-perf-test-nightly-rocm-mi355/*
schedule:
- cron: 0 7 * * 0,3
- cron: 15 0 * * *
# NB: GitHub has an upper limit of 10 inputs here, so before we can sort it
# out, let try to run torchao cudagraphs_low_precision as part of cudagraphs
workflow_dispatch:
@ -59,7 +59,7 @@ on:
description: The list of configs used the benchmark
required: false
type: string
default: inductor_huggingface_perf_rocm,inductor_timm_perf_rocm,inductor_torchbench_perf_rocm
default: inductor_huggingface_perf_rocm_mi355,inductor_timm_perf_rocm_mi355,inductor_torchbench_perf_rocm_mi355
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
@ -88,23 +88,27 @@ jobs:
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3-benchmarks
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_rocm", shard: 1, num_shards: 4, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_huggingface_perf_rocm", shard: 2, num_shards: 4, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_huggingface_perf_rocm", shard: 3, num_shards: 4, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_huggingface_perf_rocm", shard: 4, num_shards: 4, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm", shard: 1, num_shards: 5, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm", shard: 2, num_shards: 5, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm", shard: 3, num_shards: 5, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm", shard: 4, num_shards: 5, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_timm_perf_rocm", shard: 5, num_shards: 5, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm", shard: 1, num_shards: 8, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm", shard: 2, num_shards: 8, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm", shard: 3, num_shards: 8, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm", shard: 4, num_shards: 8, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm", shard: 5, num_shards: 8, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm", shard: 6, num_shards: 8, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm", shard: 7, num_shards: 8, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_torchbench_perf_rocm", shard: 8, num_shards: 8, runner: "linux.rocm.gpu.gfx942.1" },
{ config: "inductor_huggingface_perf_rocm_mi355", shard: 1, num_shards: 5, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_huggingface_perf_rocm_mi355", shard: 2, num_shards: 5, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_huggingface_perf_rocm_mi355", shard: 3, num_shards: 5, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_huggingface_perf_rocm_mi355", shard: 4, num_shards: 5, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_huggingface_perf_rocm_mi355", shard: 5, num_shards: 5, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_timm_perf_rocm_mi355", shard: 1, num_shards: 7, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_timm_perf_rocm_mi355", shard: 2, num_shards: 7, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_timm_perf_rocm_mi355", shard: 3, num_shards: 7, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_timm_perf_rocm_mi355", shard: 4, num_shards: 7, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_timm_perf_rocm_mi355", shard: 5, num_shards: 7, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_timm_perf_rocm_mi355", shard: 6, num_shards: 7, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_timm_perf_rocm_mi355", shard: 7, num_shards: 7, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_torchbench_perf_rocm_mi355", shard: 1, num_shards: 9, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_torchbench_perf_rocm_mi355", shard: 2, num_shards: 9, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_torchbench_perf_rocm_mi355", shard: 3, num_shards: 9, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_torchbench_perf_rocm_mi355", shard: 4, num_shards: 9, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_torchbench_perf_rocm_mi355", shard: 5, num_shards: 9, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_torchbench_perf_rocm_mi355", shard: 6, num_shards: 9, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_torchbench_perf_rocm_mi355", shard: 7, num_shards: 9, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_torchbench_perf_rocm_mi355", shard: 8, num_shards: 9, runner: "linux.rocm.gpu.mi355.2" },
{ config: "inductor_torchbench_perf_rocm_mi355", shard: 9, num_shards: 9, runner: "linux.rocm.gpu.mi355.2" },
]}
secrets: inherit

View File

@ -12,6 +12,7 @@ on:
- landchecks/*
tags:
- ciflow/pull/*
- ciflow/trunk/*
workflow_dispatch:
permissions: read-all
@ -32,10 +33,12 @@ jobs:
name: Get changed files
uses: ./.github/workflows/_get-changed-files.yml
with:
all_files: ${{ contains(github.event.pull_request.labels.*.name, 'lint-all-files') || contains(github.event.pull_request.labels.*.name, 'Reverted') }}
all_files: ${{ contains(github.event.pull_request.labels.*.name, 'lint-all-files') || contains(github.event.pull_request.labels.*.name, 'Reverted') || github.event_name == 'push' }}
lintrunner-clang:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
# Needed to prevent deduping on HUD
name: lintrunner-clang-${{ needs.get-changed-files.outputs.changed-files == '*' && 'all' || 'partial' }}
needs: [get-label-type, get-changed-files]
# Only run if there are changed files relevant to clangtidy / clangformat
if: |
@ -75,6 +78,7 @@ jobs:
# fails to find types when it should
lintrunner-mypy:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
name: lintrunner-mypy-${{ needs.get-changed-files.outputs.changed-files == '*' && 'all' || 'partial' }}
needs: [get-label-type, get-changed-files]
# Only run if there are changed files relevant to mypy
if: |
@ -99,6 +103,7 @@ jobs:
lintrunner-noclang:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
name: lintrunner-noclang-${{ needs.get-changed-files.outputs.changed-files == '*' && 'all' || 'partial' }}
needs: [get-label-type, get-changed-files]
with:
timeout: 120

View File

@ -7,9 +7,11 @@ on:
workflow_dispatch:
inputs:
test_mode:
required: false
type: string
default: 'short'
type: choice
options:
- 'short'
- 'long'
- 'all'
description: tag filter for operator benchmarks, options from long, short, all
schedule:
# Run at 07:00 UTC every Sunday
@ -37,20 +39,7 @@ jobs:
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
{ config: "cpu_operator_benchmark_short", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
]}
secrets: inherit
opbenchmark-on-demand-build:
if: ${{ github.event_name == 'workflow_dispatch' && github.repository_owner == 'pytorch' }}
name: opbenchmark-on-demand-build
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-jammy-py3.10-gcc11-build
docker-image-name: ci-image:pytorch-linux-jammy-py3-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
{ config: "cpu_operator_benchmark_${{ inputs.test_mode }}", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "cpu_operator_benchmark_${{ inputs.test_mode || 'short' }}", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
]}
secrets: inherit

View File

@ -182,11 +182,11 @@ jobs:
docker-image-name: ci-image:pytorch-linux-jammy-cuda13.0-cudnn9-py3-gcc11
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "nogpu_AVX512", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_AVX512", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g4dn.4xlarge.nvidia.gpu" },
]}
secrets: inherit

View File

@ -140,6 +140,7 @@ jobs:
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner: linux.2xlarge.memory
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.10-clang18-asan
docker-image-name: ci-image:pytorch-linux-jammy-py3-clang18-asan

View File

@ -180,13 +180,13 @@ jobs:
disable-monitor: false
secrets: inherit
win-vs2022-cuda12_6-py3-build:
name: win-vs2022-cuda12.6-py3
win-vs2022-cuda12_8-py3-build:
name: win-vs2022-cuda12.8-py3
uses: ./.github/workflows/_win-build.yml
needs: get-label-type
with:
build-environment: win-vs2022-cuda12.6-py3
cuda-version: "12.6"
build-environment: win-vs2022-cuda12.8-py3
cuda-version: "12.8"
runner: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
secrets: inherit

View File

@ -46,7 +46,7 @@ jobs:
runner: linux.24xlarge.memory
test-matrix: |
{ include: [
{ config: "vllm_basic_correctness_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_basic_correctness_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_basic_models_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_entrypoints_test", shard: 1, num_shards: 1,runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_regression_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
@ -54,7 +54,7 @@ jobs:
{ config: "vllm_pytorch_compilation_unit_tests", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_multi_model_test_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu"},
{ config: "vllm_languagde_model_test_extended_generation_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu"},
{ config: "vllm_language_model_test_extended_generation_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu"},
{ config: "vllm_distributed_test_2_gpu_28_failure_test", shard: 1, num_shards: 1, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_test", shard: 0, num_shards: 4, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },
{ config: "vllm_lora_test", shard: 1, num_shards: 4, runner: "linux.g6.4xlarge.experimental.nvidia.gpu" },

1
.gitignore vendored
View File

@ -395,3 +395,4 @@ android/pytorch_android_torchvision/.cxx
CLAUDE.local.md
/test_*.py
/debug_*.py
CLAUDE_CONTEXT/

View File

@ -256,6 +256,7 @@ endif()
IF(USE_FBGEMM_GENAI)
set(FBGEMM_THIRD_PARTY ${PROJECT_SOURCE_DIR}/third_party/fbgemm/external/)
set(FBGEMM_GENAI_SRCS ${PROJECT_SOURCE_DIR}/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize)
if(USE_CUDA)
# To avoid increasing the build time/binary size unnecessarily, use an allow-list of kernels to build.
# If you want to integrate a kernel from FBGEMM into torch, you have to add it here.
@ -292,58 +293,64 @@ IF(USE_FBGEMM_GENAI)
"${FBGEMM_GENAI_SRCS}/cutlass_extensions/mx8mx8bf16_grouped/"
)
target_include_directories(fbgemm_genai PUBLIC
target_include_directories(fbgemm_genai PRIVATE
${FBGEMM_THIRD_PARTY}/cutlass/include
${FBGEMM_THIRD_PARTY}/cutlass/tools/util/include
${fbgemm_genai_mx8mx8bf16_grouped}
${FBGEMM_GENAI_SRCS}/common/include/ # includes fbgemm_gpu/quantize/utils.h, fbgemm_gpu/quantize/tuning_cache.hpp
${FBGEMM_GENAI_SRCS}/include/ # includes fbgemm_gpu/torch_ops.h
)
else()
if(USE_ROCM)
# Only include the kernels we want to build to avoid increasing binary size.
file(GLOB_RECURSE fbgemm_genai_native_rocm_hip
"${FBGEMM_GENAI_SRCS}/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped*.hip"
"${FBGEMM_GENAI_SRCS}/ck_extensions/fp8_rowwise_grouped/fp8_rowwise_grouped_gemm.hip")
set_source_files_properties(${fbgemm_genai_native_rocm_hip} PROPERTIES HIP_SOURCE_PROPERTY_FORMAT 1)
# Add additional HIPCC compiler flags for performance
set(FBGEMM_GENAI_EXTRA_HIPCC_FLAGS
-mllvm
-amdgpu-coerce-illegal-types=1
-mllvm
-enable-post-misched=0
-mllvm
-greedy-reverse-local-assignment=1
-fhip-new-launch-api)
# Add FBGEMM_GENAI include directories for torch_ops.h
list(APPEND ATen_CUDA_INCLUDE ${PROJECT_SOURCE_DIR}/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/include)
list(APPEND ATen_CUDA_INCLUDE ${PROJECT_SOURCE_DIR}/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include)
elseif(USE_ROCM)
# Only include the kernels we want to build to avoid increasing binary size.
file(GLOB_RECURSE fbgemm_genai_native_rocm_hip
"${FBGEMM_GENAI_SRCS}/ck_extensions/fp8_rowwise_grouped/kernels/fp8_rowwise_grouped*.hip"
"${FBGEMM_GENAI_SRCS}/ck_extensions/fp8_rowwise_grouped/fp8_rowwise_grouped_gemm.hip")
set_source_files_properties(${fbgemm_genai_native_rocm_hip} PROPERTIES HIP_SOURCE_PROPERTY_FORMAT 1)
# Only compile for gfx942 for now.
# This is rather hacky, I could not figure out a clean solution :(
set(HIP_CLANG_FLAGS_ORIGINAL ${HIP_CLANG_FLAGS})
string(REGEX REPLACE "--offload-arch=[^ ]*" "" FILTERED_HIP_CLANG_FLAGS "${HIP_CLANG_FLAGS}")
if("gfx942" IN_LIST PYTORCH_ROCM_ARCH)
list(APPEND FILTERED_HIP_CLANG_FLAGS --offload-arch=gfx942;)
endif()
set(HIP_CLANG_FLAGS ${FILTERED_HIP_CLANG_FLAGS})
# Add additional HIPCC compiler flags for performance
set(FBGEMM_GENAI_EXTRA_HIPCC_FLAGS
-mllvm
-amdgpu-coerce-illegal-types=1
-mllvm
-enable-post-misched=0
-mllvm
-greedy-reverse-local-assignment=1
-fhip-new-launch-api)
hip_add_library(
fbgemm_genai STATIC
${fbgemm_genai_native_rocm_hip}
HIPCC_OPTIONS ${HIP_HCC_FLAGS} ${FBGEMM_GENAI_EXTRA_HIPCC_FLAGS})
set(HIP_CLANG_FLAGS ${HIP_CLANG_FLAGS_ORIGINAL})
set_target_properties(fbgemm_genai PROPERTIES POSITION_INDEPENDENT_CODE ON)
target_compile_definitions(fbgemm_genai PRIVATE FBGEMM_GENAI_NO_EXTENDED_SHAPES)
target_include_directories(fbgemm_genai PUBLIC
# FBGEMM version of Composable Kernel is used due to some customizations
${FBGEMM_THIRD_PARTY}/composable_kernel/include
${FBGEMM_THIRD_PARTY}/composable_kernel/library/include
${FBGEMM_THIRD_PARTY}/cutlass/include
${FBGEMM_THIRD_PARTY}/cutlass/tools/util/include
${FBGEMM_GENAI_SRCS}/common/include/ # includes fbgemm_gpu/quantize/utils.h, fbgemm_gpu/quantize/tuning_cache.hpp
${FBGEMM_GENAI_SRCS}/include/ # includes fbgemm_gpu/torch_ops.h
)
# Only compile for gfx942 for now.
# This is rather hacky, I could not figure out a clean solution :(
set(HIP_CLANG_FLAGS_ORIGINAL ${HIP_CLANG_FLAGS})
string(REGEX REPLACE "--offload-arch=[^ ]*" "" FILTERED_HIP_CLANG_FLAGS "${HIP_CLANG_FLAGS}")
if("gfx942" IN_LIST PYTORCH_ROCM_ARCH)
list(APPEND FILTERED_HIP_CLANG_FLAGS --offload-arch=gfx942;)
endif()
set(HIP_CLANG_FLAGS ${FILTERED_HIP_CLANG_FLAGS})
hip_add_library(
fbgemm_genai STATIC
${fbgemm_genai_native_rocm_hip}
HIPCC_OPTIONS ${HIP_HCC_FLAGS} ${FBGEMM_GENAI_EXTRA_HIPCC_FLAGS})
set(HIP_CLANG_FLAGS ${HIP_CLANG_FLAGS_ORIGINAL})
set_target_properties(fbgemm_genai PROPERTIES POSITION_INDEPENDENT_CODE ON)
target_compile_definitions(fbgemm_genai PRIVATE FBGEMM_GENAI_NO_EXTENDED_SHAPES)
target_include_directories(fbgemm_genai PRIVATE
# FBGEMM version of Composable Kernel is used due to some customizations
${FBGEMM_THIRD_PARTY}/composable_kernel/include
${FBGEMM_THIRD_PARTY}/composable_kernel/library/include
${FBGEMM_THIRD_PARTY}/cutlass/include
${FBGEMM_THIRD_PARTY}/cutlass/tools/util/include
${FBGEMM_GENAI_SRCS}/common/include/ # includes fbgemm_gpu/quantize/utils.h, fbgemm_gpu/quantize/tuning_cache.hpp
${FBGEMM_GENAI_SRCS}/include/ # includes fbgemm_gpu/torch_ops.h
)
# Add FBGEMM_GENAI include directories for torch_ops.h
list(APPEND ATen_HIP_INCLUDE ${PROJECT_SOURCE_DIR}/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/include)
list(APPEND ATen_HIP_INCLUDE ${PROJECT_SOURCE_DIR}/third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include)
endif()
endif()
@ -692,12 +699,6 @@ if(USE_CUDA AND NOT USE_ROCM)
list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/cutlass/include)
list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/cutlass/tools/util/include)
# Add FBGEMM_GENAI include directories for torch_ops.h
if(USE_FBGEMM_GENAI)
list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/include)
list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/fbgemm/fbgemm_gpu/experimental/gen_ai/src/quantize/common/include)
endif()
if($ENV{ATEN_STATIC_CUDA})
if(CUDA_VERSION VERSION_LESS_EQUAL 12.9)
list(APPEND ATen_CUDA_DEPENDENCY_LIBS

View File

@ -389,37 +389,16 @@ void fillVersion<DLManagedTensorVersioned>(
// constructed out of ATen tensor
template <class T>
T* toDLPackImpl(const Tensor& src) {
auto view = src;
// Detect whether there is need to normalize the strides
// Background: gh-83069
//
// However, normalizing strides can come at a high-cost
// to slow down toDLPack conversion 3x, so we
// only normalize if needed.
//
// The following code detects whether the src follows
// a continuous pattern. If the src follows such pattern (common-case)
// then we do not need to normalize the strides.
bool need_normalize_strides = src.dim() == 1 && src.size(0) == 1 && src.stride(0) != 1;
// less common case, try normalizing the strides
if (need_normalize_strides) {
// create a new tensor with possibly normalized strides
// gh-83069
auto shape = src.sizes();
view = src.as_strided(shape, {1}, src.storage_offset());
}
ATenDLMTensor<T>* atDLMTensor(new ATenDLMTensor<T>);
atDLMTensor->handle = view;
atDLMTensor->handle = src;
atDLMTensor->tensor.manager_ctx = atDLMTensor;
atDLMTensor->tensor.deleter = &deleter<T>;
atDLMTensor->tensor.dl_tensor.data = view.data_ptr();
atDLMTensor->tensor.dl_tensor.data = src.data_ptr();
atDLMTensor->tensor.dl_tensor.device = torchDeviceToDLDevice(src.device());
atDLMTensor->tensor.dl_tensor.ndim = static_cast<int32_t>(src.dim());
atDLMTensor->tensor.dl_tensor.dtype = getDLDataType(src);
atDLMTensor->tensor.dl_tensor.shape = const_cast<int64_t*>(view.sizes().data());
atDLMTensor->tensor.dl_tensor.strides = const_cast<int64_t*>(view.strides().data());
atDLMTensor->tensor.dl_tensor.shape = const_cast<int64_t*>(src.sizes().data());
atDLMTensor->tensor.dl_tensor.strides = const_cast<int64_t*>(src.strides().data());
atDLMTensor->tensor.dl_tensor.byte_offset = 0;
fillVersion(&atDLMTensor->tensor);

View File

@ -52,16 +52,16 @@ struct DLPackTraits {};
template <>
struct DLPackTraits<DLManagedTensor> {
inline static const char* capsule = "dltensor";
inline static const char* used = "used_dltensor";
inline static constexpr const char* capsule = "dltensor";
inline static constexpr const char* used = "used_dltensor";
inline static auto toDLPack = at::toDLPack;
inline static auto fromDLPack = at::fromDLPack;
};
template <>
struct DLPackTraits<DLManagedTensorVersioned> {
inline static const char* capsule = "dltensor_versioned";
inline static const char* used = "used_dltensor_versioned";
inline static constexpr const char* capsule = "dltensor_versioned";
inline static constexpr const char* used = "used_dltensor_versioned";
inline static auto toDLPack = at::toDLPackVersioned;
inline static auto fromDLPack = at::fromDLPackVersioned;
};

View File

@ -58,7 +58,7 @@ namespace at {
namespace{
// PyTorch allows operations to specify dim 0 and dim -1 on a scalar tensor.
static bool is_allowed_dim_on_scalar_tensor(int64_t dim) {
bool is_allowed_dim_on_scalar_tensor(int64_t dim) {
return dim == 0 || dim == -1;
}
@ -365,7 +365,7 @@ Tensor select_batching_rule(const Tensor& self, int64_t dim, int64_t index) {
return self_physical.getPhysicalToLogicalMap().apply(result);
}
static int64_t getGradInputPhysicalDim(int64_t dim, IntArrayRef input_sizes, int64_t num_batch_dims) {
int64_t getGradInputPhysicalDim(int64_t dim, IntArrayRef input_sizes, int64_t num_batch_dims) {
return maybe_wrap_dim(dim, static_cast<int64_t>(input_sizes.size())) + num_batch_dims;
}
@ -488,7 +488,7 @@ Tensor view_as_complex_batching_rule(const Tensor& self) {
// Checks that the smallest batch stride is greater than the largest example
// stride. This is something we can support but we choose not to because it's
// potentially error prone.
static void checkBatchDimsAtFrontInLayout(IntArrayRef physical_strides, int64_t num_batch_dims) {
void checkBatchDimsAtFrontInLayout(IntArrayRef physical_strides, int64_t num_batch_dims) {
auto smallest_batch_stride = std::min_element(
physical_strides.begin(), physical_strides.begin() + num_batch_dims);
auto largest_example_stride = std::max_element(
@ -508,7 +508,7 @@ static void checkBatchDimsAtFrontInLayout(IntArrayRef physical_strides, int64_t
// given (sizes, strides, storage_offset) returns the maximum location that
// can be indexed (or nullopt if such a location doesn't exist, e.g., tensors
// with zero-size dims).
static std::optional<int64_t> maximum_indexable_location(
std::optional<int64_t> maximum_indexable_location(
IntArrayRef sizes, IntArrayRef strides, int64_t storage_offset) {
auto result = native::storage_size_for(sizes, strides);
if (result == 0) {
@ -521,7 +521,7 @@ static std::optional<int64_t> maximum_indexable_location(
// This checks that the range of possible memory locations accessible by
// x.as_strided(sizes, strides, maybe_storage_offset)
// are within the bounds of possible memory locations accessible by x.
static void checkBasicAsStridedValidForSlice(
void checkBasicAsStridedValidForSlice(
const Tensor& physical_tensor,
int64_t num_batch_dims,
IntArrayRef sizes,

View File

@ -42,8 +42,14 @@ const PythonTorchFunctionTLS& PythonTorchFunctionTLS::get_state() {
}
bool torch_function_mode_enabled() {
return PythonTorchFunctionTLS::get_disabled_state() != TorchFunctionDisabledState::ALL_DISABLED &&
PythonTorchFunctionTLS::stack_len() > 0;
// Manually flatten because gcc is refusing to inline here. Note
// that we are still calling __tls_get_addr twice here with GCC,
// presumably because of
// https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81501 (which says
// the fix ships in GCC 16), but forcing inlining still improves
// performance.
const auto& ptfs = pythonTorchFunctionState;
return ptfs.disabled_state_ != TorchFunctionDisabledState::ALL_DISABLED && !ptfs.stack_.empty();
}
// This is needed to disambiguate the ternary torch function disabled states

View File

@ -27,6 +27,7 @@ struct TORCH_API PythonTorchFunctionTLS {
TorchFunctionDisabledState disabled_state_ =
TorchFunctionDisabledState::ENABLED;
std::vector<std::shared_ptr<c10::SafePyObject>> stack_;
friend TORCH_API bool torch_function_mode_enabled();
};
TORCH_API bool torch_function_mode_enabled();

View File

@ -13,7 +13,7 @@ namespace {
// and left at true for the rest of the execution.
// It's an optimization so that users who never use default hooks don't need to
// read the thread_local variables pack_hook_ and unpack_hook_.
static bool is_initialized(false);
bool is_initialized(false);
}
static void assertSavedTensorHooksNotDisabled() {

View File

@ -56,7 +56,7 @@ inline void get_strides(int64_t* strides, ArrayRef<OperandInfo> operands, int64_
}
}
static OptionalTensorRef make_otr(const TensorBase &tensor) {
OptionalTensorRef make_otr(const TensorBase &tensor) {
if (tensor.defined()) {
return OptionalTensorRef(tensor);
} else {

View File

@ -36,7 +36,7 @@ namespace {
using weakref_type = c10::weak_intrusive_ptr<TensorImpl, UndefinedTensorImpl>;
using val_type = std::tuple<weakref_type, Tensor>;
static ska::flat_hash_map<TensorImpl*, val_type>& get_cached_casts() {
ska::flat_hash_map<TensorImpl*, val_type>& get_cached_casts() {
static ska::flat_hash_map<TensorImpl*, val_type> cached_casts;
return cached_casts;
}

View File

@ -6,9 +6,9 @@ namespace at {
namespace {
static std::array<HostAllocator*, at::COMPILE_TIME_MAX_DEVICE_TYPES>
std::array<HostAllocator*, at::COMPILE_TIME_MAX_DEVICE_TYPES>
allocator_array{};
static std::array<uint8_t, at::COMPILE_TIME_MAX_DEVICE_TYPES>
std::array<uint8_t, at::COMPILE_TIME_MAX_DEVICE_TYPES>
allocator_priority{};
} // anonymous namespace

View File

@ -624,7 +624,14 @@ struct TORCH_API IValue final {
IValue(const c10::SymBool& i) {
if (auto mi = i.maybe_as_bool()) {
tag = Tag::Bool;
#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__
payload.u.as_int = *mi;
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
/* due to byteorder if value assigned as_int, as_bool actually is not set correctly */
payload.u.as_bool = *mi;
#else
#error Unexpected or undefined __BYTE_ORDER__
#endif
} else {
tag = Tag::SymBool;
payload.u.as_intrusive_ptr = i.toSymNodeImpl().release();

View File

@ -108,7 +108,7 @@ static hipblasStatus_t rocBLASStatusToHIPStatus(rocblas_status error)
namespace {
static cublasOperation_t _cublasOpFromChar(char op) {
cublasOperation_t _cublasOpFromChar(char op) {
// NOLINTNEXTLINE(bugprone-switch-missing-default-case)
switch (op) {
case 'n':
@ -128,7 +128,7 @@ static cublasOperation_t _cublasOpFromChar(char op) {
"_cublasOpFromChar input should be 't', 'n' or 'c' but got `", op, "`");
}
static void _cublasAdjustLdLevel2(int64_t m, int64_t n, int64_t* lda) {
void _cublasAdjustLdLevel2(int64_t m, int64_t n, int64_t* lda) {
// Note: leading dimensions generally are checked that they are > 0
// and at least as big the result requires (even if the value won't
// be used).
@ -142,7 +142,7 @@ static void _cublasAdjustLdLevel2(int64_t m, int64_t n, int64_t* lda) {
*lda = std::max<int64_t>(m, 1);
}
static void _cublasAdjustLdLevel3(
void _cublasAdjustLdLevel3(
char transa,
char transb,
int64_t m,

View File

@ -15,19 +15,19 @@ namespace cuda::detail {
namespace {
// Total number of gpus in the system.
static int64_t num_gpus;
int64_t num_gpus;
// Ensures default_gens_cuda is initialized once.
static std::deque<c10::once_flag> cuda_gens_init_flag;
std::deque<c10::once_flag> cuda_gens_init_flag;
// Default, global CUDA generators, one per GPU.
static std::vector<Generator> default_gens_cuda;
std::vector<Generator> default_gens_cuda;
/*
* Populates the global variables related to CUDA generators
* Warning: this function must only be called once!
*/
static void initCUDAGenVector() {
void initCUDAGenVector() {
// Ensures we only call cudaGetDeviceCount only once.
static bool num_gpu_init_flag [[maybe_unused]] = []() {
num_gpus = static_cast<int32_t>(c10::cuda::device_count());

View File

@ -13,6 +13,7 @@
#include <c10/core/ScalarType.h>
#include <ATen/cuda/tunable/TunableOp.h>
#include <ATen/cuda/tunable/Tunable.h>
#include <ATen/cuda/CUDABlas.h>
#include <ATen/cuda/Exceptions.h>
#include <c10/util/StringUtil.h>
@ -150,6 +151,7 @@ inline std::string ScalarTypeToBLASType(c10::ScalarType scalar_type) {
BLASType = "unknown";
}
return BLASType;
}
// Similar to Compute Type in GemmRocblas.h
@ -244,33 +246,25 @@ inline std::string to_string_epilogue(const at::cuda::blas::GEMMAndBiasActivatio
namespace detail {
static bool NumericalCheck(ScalarType dtype, void* c, void* other_c, int64_t size) {
static bool NumericalCheck(ScalarType dtype, void* c, void* other_c, int64_t size, const NumericalCheckConfig& config) {
if (!config.enabled) {
return true; // skip when disabled
}
auto options = at::TensorOptions().dtype(dtype).device(at::kCUDA);
// comparison done as 1D tensor
at::Tensor ref = at::from_blob(c, {size}, options);
at::Tensor oth = at::from_blob(other_c, {size}, options);
at::Tensor ref_float = ref.to(at::kFloat);
at::Tensor oth_float = oth.to(at::kFloat);
std::vector<double> atols{1e-1, 1e-2, 1e-3, 1e-4, 1e-5};
std::vector<double> rtols{1e-1, 1e-2, 1e-3, 1e-4, 1e-5};
double last_succeed_atol = 1;
double last_succeed_rtol = 1;
for (auto& atol : atols) {
for (auto& rtol : rtols) {
if (at::allclose(ref_float, oth_float, rtol, atol)) {
last_succeed_atol = atol;
last_succeed_rtol = rtol;
}
}
}
if (last_succeed_atol == 1) {
return false;
}
else {
TUNABLE_LOG3("├──verify numerics: atol=", last_succeed_atol, ", rtol=", last_succeed_rtol);
}
return true;
const bool ok = at::allclose(ref_float, oth_float, config.rtol, config.atol);
if (ok) {
TUNABLE_LOG3("├──verify numerics: PASSED with atol=", config.atol, ", rtol=", config.rtol);
} else {
TUNABLE_LOG3("├──verify numerics: FAILED with atol=", config.atol, ", rtol=", config.rtol);
}
return ok;
}
}
@ -355,8 +349,10 @@ struct GemmParams : OpParams {
}
TuningStatus NumericalCheck(GemmParams<T> *other) {
auto* ctx = getTuningContext();
auto cfg = ctx->GetNumericalCheckConfig();
auto c_dtype = c10::CppTypeToScalarType<T>::value;
return detail::NumericalCheck(c_dtype, c, other->c, GetSizeC()/sizeof(T)) ? OK : FAIL;
return detail::NumericalCheck(c_dtype, c, other->c, GetSizeC()/sizeof(T), cfg) ? OK : FAIL;
}
char transa{};
@ -449,8 +445,10 @@ struct GemmAndBiasParams : OpParams {
}
TuningStatus NumericalCheck(GemmAndBiasParams<T> *other) {
auto* ctx = getTuningContext();
auto cfg = ctx->GetNumericalCheckConfig();
auto c_dtype = c10::CppTypeToScalarType<T>::value;
return detail::NumericalCheck(c_dtype, c, other->c, GetSizeC()/sizeof(T)) ? OK : FAIL;
return detail::NumericalCheck(c_dtype, c, other->c, GetSizeC()/sizeof(T), cfg) ? OK : FAIL;
}
char transa{};
@ -546,8 +544,10 @@ struct GemmStridedBatchedParams : OpParams {
}
TuningStatus NumericalCheck(GemmStridedBatchedParams<T> *other) {
auto* ctx = getTuningContext();
auto cfg = ctx->GetNumericalCheckConfig();
auto c_dtype = c10::CppTypeToScalarType<C_Dtype>::value;
return detail::NumericalCheck(c_dtype, c, other->c, GetSizeC()/sizeof(T)) ? OK : FAIL;
return detail::NumericalCheck(c_dtype, c, other->c, GetSizeC()/sizeof(T), cfg) ? OK : FAIL;
}
char transa{};
@ -663,7 +663,9 @@ struct ScaledGemmParams : OpParams {
}
TuningStatus NumericalCheck(ScaledGemmParams<T> *other) {
return detail::NumericalCheck(c_dtype, c, other->c, GetSizeC()/sizeof(T)) ? OK : FAIL;
auto* ctx = getTuningContext();
auto cfg = ctx->GetNumericalCheckConfig();
return detail::NumericalCheck(c_dtype, c, other->c, GetSizeC()/sizeof(T), cfg) ? OK : FAIL;
}
char transa{};

View File

@ -145,7 +145,7 @@ programmatically since the settings become fixed. Use the C++ or Python APIs ins
| PYTORCH_TUNABLEOP_VERBOSE | Default is 0. Set to 1 to enable basic logging. 2 for basic tuning status. 3 for full trace. |
| PYTORCH_TUNABLEOP_VERBOSE_FILENAME | Default is "err" for stderr. Set to "out" for stdout or a filename for capturing verbose logging. |
| PYTORCH_TUNABLEOP_FILENAME | Default is 'tunableop_results.csv'. |
| PYTORCH_TUNABLEOP_NUMERICAL_CHECK | Default is 0. Set to 1 to enable. |
| PYTORCH_TUNABLEOP_NUMERICAL_CHECK | Default is off. Set 'atol_rtol' to enable, for example "1e-5_1e-5". |
| PYTORCH_TUNABLEOP_ROCBLAS_ENABLED | Default is 1. Set to 0 to disable rocblas being considered during tuning. |
| PYTORCH_TUNABLEOP_HIPBLASLT_ENABLED | Default is 1. Set to 0 to disable hipblaslt being considered during tuning. |
| PYTORCH_TUNABLEOP_MAX_TUNING_DURATION_MS | Default is 30. Unit is milliseconds. |
@ -173,10 +173,9 @@ All python APIs exist in the `torch.cuda.tunable` module.
| get_max_tuning_iterations() -> int | |
| set_filename(filename: str, insert_device_ordinal: bool = False) -> None | |
| get_filename() -> str | |
| set_numerical_check_tolerances(enable: bool, atol: float, rtol: float) -> None | Enable or disable numerical checking; atol and rtol default to 1e-5.
| get_results() -> Tuple[str, str, str, float] | |
| get_validators() -> Tuple[str, str] | |
| write_file_on_exit(val: bool) -> None | Default is True. |
| write_file(filename: Optional[str] = None) -> None | If filename not given, it will call get_filename(). |
| read_file(filename: Optional[str] = None) -> None | If filename not given, it will call get_filename(). |
| tune_gemm_in_file(filename: str) -> None | read an untuned file and tune GEMMs in it. |
| mgpu_tune_gemm_in_file(filename_pattern: str, num_gpus: int) -> None: -> None | read one or more untuned files and tune all unique GEMMs on one or more GPUs. |

View File

@ -107,14 +107,30 @@ void TuningResultsManager::AddImpl(const std::string& op_signature,
}
void TuningResultsManager::Add(const std::string& op_signature, const std::string& params_signature, ResultEntry best) {
std::scoped_lock l{lock_};
bool is_new = false;
ResultEntry inserted = ResultEntry::Null();
auto it = results_.find(op_signature);
if (it == results_.end()) {
it = results_.insert({op_signature, {}}).first;
// ---- mutate maps under results lock ----
{
std::scoped_lock l{lock_};
auto& km = results_[op_signature]; // creates if missing
is_new = (km.find(params_signature) == km.end());
AddImpl(op_signature, params_signature, std::move(best), km);
if (is_new) {
inserted = km.at(params_signature); // snapshot for I/O after unlocking
}
}
if (!is_new) return; // only write once per unique (op, params)
TuningContext* ctx = getTuningContext();
if (ctx->IsTuningEnabled() && !ctx->IsRecordUntunedEnabled()) {
InitRealtimeAppend(ctx->GetFilename(), ctx->GetTuningResultsValidator().GetAllValidators());
if (is_new && realtime_out_ && realtime_out_->good()) {
AppendResultLine(op_signature, params_signature, inserted);
}
}
AddImpl(op_signature, params_signature, std::move(best), it->second);
}
void TuningResultsManager::RecordUntuned( std::ofstream& untuned_file, const std::string& op_signature,
@ -150,6 +166,77 @@ void TuningResultsManager::RecordUntuned( std::ofstream& untuned_file, const std
}
}
void TuningResultsManager::InitRealtimeAppend(const std::string& filename, const std::unordered_map<std::string, std::string>& validators) {
std::scoped_lock fl{realtime_file_mutex_};
if (realtime_out_ && realtime_out_->good() && realtime_filename_ == filename) {
return;
}
if (realtime_out_ && realtime_filename_ != filename) {
realtime_out_->flush();
realtime_out_->close();
realtime_out_.reset();
validators_written_ = false;
}
bool file_exists = false;
bool file_empty = true;
{
std::ifstream check_file(filename);
if (check_file.good()) {
file_exists = true;
file_empty = (check_file.peek() == std::ifstream::traits_type::eof());
}
}
realtime_out_ = std::make_unique<std::ofstream>(filename, std::ios::out | std::ios::app);
if (!realtime_out_->good()) {
TORCH_WARN("TunableOp realtime append: failed to open '", filename,"'");
realtime_out_.reset();
return;
}
if(!file_exists || file_empty) {
for(const auto& [key, val] : validators) {
(*realtime_out_) << "Validator," << key << "," << val << std::endl;
realtime_out_->flush();
}
validators_written_ = true;
TUNABLE_LOG2("Wrote validators to realtime output file");
}
realtime_filename_ = filename;
}
void TuningResultsManager::AppendResultLine(const std::string& op_sig, const std::string& param_sig, const ResultEntry& result) {
std::scoped_lock fl{realtime_file_mutex_};
if(!realtime_out_ || !realtime_out_->good()) {
return;
}
(*realtime_out_) << op_sig << "," << param_sig << "," << result << std::endl;
realtime_out_->flush(); //ensure immediate write to disk
TUNABLE_LOG3("Realtime append: ", op_sig, "(", param_sig, ") -> ", result);
}
void TuningResultsManager::CloseRealtimeAppend() {
std::scoped_lock fl{realtime_file_mutex_};
if(realtime_out_) {
realtime_out_->flush();
realtime_out_->close();
realtime_out_.reset();
TUNABLE_LOG2("Closed realtime output file");
}
}
void TuningResultsManager::Delete(const std::string& op_signature, const std::string& params_signature) {
std::scoped_lock l{lock_};
@ -396,7 +483,6 @@ TuningContext::TuningContext() :
tuning_enable_{true},
record_untuned_enable_{false},
manager_initialized_{false},
write_file_on_exit_{true},
numerics_check_enable_{false},
max_tuning_duration_ms_{30},
max_tuning_iterations_{100},
@ -417,20 +503,8 @@ TuningContext::~TuningContext() {
// but doesn't do any computation itself.
return;
}
auto filename = GetFilename();
if (IsTunableOpEnabled() && IsTuningEnabled() && !filename.empty() && write_file_on_exit_) {
if (results_count_from_input_file_ < GetTuningResultsManager().GetSize()) {
if (results_count_from_input_file_ > 0) {
TUNABLE_LOG1("additional tuning results available, rewriting file ", filename);
}
else {
TUNABLE_LOG1("writing file ", filename);
}
if (!WriteFile(filename)) {
TUNABLE_LOG1("failed to write file ", filename);
}
}
}
TUNABLE_LOG1("Closing File");
GetTuningResultsManager().CloseRealtimeAppend(); // Since, we do instant logging by default now.
if (untuned_file_.good()) {
untuned_file_.close();
@ -511,20 +585,54 @@ std::ofstream& TuningContext::GetUntunedFile(){
return untuned_file_;
}
void TuningContext::WriteFileOnExit(bool value) {
write_file_on_exit_ = value;
}
void TuningContext::EnableNumericsCheck(bool value) {
numerics_check_enable_ = value;
}
bool TuningContext::IsNumericsCheckEnabled() const {
const auto env = c10::utils::get_env("PYTORCH_TUNABLEOP_NUMERICAL_CHECK");
if (env == "1") {
return true;
NumericalCheckConfig TuningContext::GetNumericalCheckConfig() const {
const auto env_opt = c10::utils::get_env("PYTORCH_TUNABLEOP_NUMERICAL_CHECK");
if (!env_opt.has_value()) {
return numerics_cfg_;
}
return numerics_check_enable_;
const std::string& env = env_opt.value();
if (env == "0") {
return NumericalCheckConfig(false, 1e-5, 1e-5);
}
const size_t underscore = env.find('_');
TORCH_CHECK(
underscore != std::string::npos,
"Invalid PYTORCH_TUNABLEOP_NUMERICAL_CHECK format. "
"Expected 'atol_rtol', got: ",
env);
double atol = 0.0;
double rtol = 0.0;
try {
atol = std::stod(env.substr(0, underscore));
rtol = std::stod(env.substr(underscore + 1));
} catch (const std::exception& e) {
TORCH_CHECK(false, "Failed to parse PYTORCH_TUNABLEOP_NUMERICAL_CHECK: ", e.what());
}
TORCH_CHECK( atol > 0.0 && rtol > 0.0, "Tolerance values must be positive. atol=", atol, ", rtol=", rtol);
return NumericalCheckConfig(true, atol, rtol);
}
void TuningContext::SetNumericalCheckConfig(bool enabled, double atol, double rtol) {
TORCH_CHECK(atol > 0.0 && rtol > 0.0, "Numerical check tolerances must be positive");
numerics_cfg_ = {enabled, atol, rtol};
}
bool TuningContext::IsNumericsCheckEnabled() const {
const auto cfg = GetNumericalCheckConfig();
return cfg.enabled || numerics_check_enable_;
}
void TuningContext::SetMaxTuningDurationMs(int max_duration_ms) {
@ -634,11 +742,6 @@ TuningResultsManager& TuningContext::GetTuningResultsManager() {
auto filename = GetFilename();
if (!filename.empty() && !IsRecordUntunedEnabled()) {
ReadFile(filename);
// attempt immediately to open file for writing to catch errors early
std::ofstream file(filename, std::ios::out | std::ios::app);
if (!file.good()) {
TORCH_WARN("failed to open file '", filename, "' for writing; your tuning results will not be saved");
}
}
});
return manager_;
@ -744,27 +847,6 @@ bool TuningContext::ReadFile(const std::string& filename_) {
return true;
}
bool TuningContext::WriteFile(const std::string& filename_) {
std::string filename = filename_.empty() ? GetFilename() : filename_;
std::ofstream file(filename, std::ios::out | std::ios::trunc);
if (!file.good()) {
TUNABLE_LOG1("error opening tuning results file for writing ", filename);
return false;
}
auto validators = GetTuningResultsValidator().GetAllValidators();
for (const auto& [key, val] : validators) {
file << "Validator," << key << "," << val << std::endl;
}
auto results = GetTuningResultsManager().Dump();
for (const auto& [op_sig, kernelmap] : results) {
for (const auto& [param_sig, result] : kernelmap) {
file << op_sig << "," << param_sig << "," << result << std::endl;
}
}
file.close();
return true;
}
namespace {
struct MaybeDelete {

View File

@ -103,10 +103,24 @@ class TORCH_CUDA_CPP_API TuningResultsManager {
void RecordUntuned( std::ofstream& untuned_file, const std::string& op_signature,
const std::string& params_signature, const std::string& blas_signature);
void InitRealtimeAppend(
const std::string& filename,
const std::unordered_map<std::string, std::string>& validators);
void AppendResultLine(const std::string& op_sig,
const std::string& param_sig,
const ResultEntry& result);
void CloseRealtimeAppend(); // For clean shutdown
private:
std::mutex lock_;
std::mutex realtime_file_mutex_;
std::unique_ptr<std::ofstream> realtime_out_;
std::string realtime_filename_;
ResultsMap results_;
UntunedMap untuned_results_;
bool validators_written_ = false;
};
@ -134,6 +148,16 @@ class TORCH_CUDA_CPP_API TuningResultsValidator {
GetValidateFuncs validators_;
};
struct NumericalCheckConfig {
bool enabled{false};
double atol{1e-5};
double rtol{1e-5};
NumericalCheckConfig() = default;
NumericalCheckConfig(bool e, double a, double r) : enabled(e), atol(a), rtol(r) {}
};
class TORCH_CUDA_CPP_API TuningContext {
public:
TuningContext();
@ -155,6 +179,8 @@ class TORCH_CUDA_CPP_API TuningContext {
void EnableNumericsCheck(bool value);
bool IsNumericsCheckEnabled() const;
void SetNumericalCheckConfig(bool enabled, double atol, double rtol);
NumericalCheckConfig GetNumericalCheckConfig() const;
void SetMaxTuningDurationMs(int max_duration_ms);
int GetMaxTuningDurationMs() const;
@ -185,10 +211,7 @@ class TORCH_CUDA_CPP_API TuningContext {
void SetFilename(const std::string& filename, bool insert_device_ordinal=false);
std::string GetFilename() const;
void WriteFileOnExit(bool value);
bool ReadFile(const std::string& filename={});
bool WriteFile(const std::string& filename={});
template<class... Types>
void Log(int level, Types... args) {
@ -207,7 +230,6 @@ class TORCH_CUDA_CPP_API TuningContext {
bool tuning_enable_;
bool record_untuned_enable_;
bool manager_initialized_;
bool write_file_on_exit_;
bool numerics_check_enable_;
int max_tuning_duration_ms_;
int max_tuning_iterations_;
@ -222,6 +244,8 @@ class TORCH_CUDA_CPP_API TuningContext {
std::ofstream untuned_file_;
size_t results_count_from_input_file_;
bool is_shutting_down_;
NumericalCheckConfig numerics_cfg_{};
};
TORCH_CUDA_CPP_API TuningContext* getTuningContext();

View File

@ -267,27 +267,10 @@ class TunableOp {
for (size_t i = 0; i < op_names_.size(); i++) {
auto* candidate = ops_[op_names_[i]].get(); // borrow pointer
if (do_numerics_check) {
ParamsT* numerical_params = params->DeepCopy(false);
auto status = candidate->Call(numerical_params);
if (status != OK) {
numerical_params->Delete();
TUNABLE_LOG3("├──unsupported id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
continue;
}
status = reference_params->NumericalCheck(numerical_params);
numerical_params->Delete();
if (status != OK) {
TUNABLE_LOG3("├──numerics check failed for id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
continue;
}
}
else {
auto status = candidate->Call(reusable_params[0]);
if (status != OK) {
TUNABLE_LOG3("├──unsupported id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
continue;
}
auto status = candidate->Call(reusable_params[0]);
if (status != OK) {
TUNABLE_LOG3("├──unsupported id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
continue;
}
// collect a small profile
@ -310,6 +293,22 @@ class TunableOp {
continue;
}
if (do_numerics_check) {
ParamsT* numerical_params = params->DeepCopy(false);
auto status = candidate->Call(numerical_params);
if (status != OK) {
numerical_params->Delete();
TUNABLE_LOG3("├──unsupported id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
continue;
}
status = reference_params->NumericalCheck(numerical_params);
numerical_params->Delete();
if (status != OK) {
TUNABLE_LOG3("├──numerics check failed for id=", i, ", ", op_sig, '(', params_sig, ") ", op_names_[i]);
continue;
}
}
// for warmup does user set max duration, max iters, or both?
// warmup is skipped by default, i.e. warmup_iter = 0
// warmup will be set to the non-zero value of max_warmup_duration

View File

@ -39,7 +39,7 @@ Tensor vdot_decomp(const Tensor& A, const Tensor& B) {
// NB: I wrote this like this because we *might* want its for a future matmul
// batch rule that isn't decomposed...
// "tv" = tensor @ vector
static std::tuple<Tensor, std::optional<int64_t>> tv_batch_rule(
std::tuple<Tensor, std::optional<int64_t>> tv_batch_rule(
const Tensor& self, std::optional<int64_t> self_bdim,
const Tensor& other, std::optional<int64_t> other_bdim) {
if (self_bdim && other_bdim) {
@ -66,7 +66,7 @@ static std::tuple<Tensor, std::optional<int64_t>> tv_batch_rule(
TORCH_INTERNAL_ASSERT(false, "can't get here");
}
static std::tuple<Tensor, std::optional<int64_t>> mv_batch_rule(
std::tuple<Tensor, std::optional<int64_t>> mv_batch_rule(
const Tensor& self, std::optional<int64_t> self_bdim,
const Tensor& other, std::optional<int64_t> other_bdim) {
auto self_logical_rank = rankWithoutBatchDim(self, self_bdim);
@ -79,7 +79,7 @@ static std::tuple<Tensor, std::optional<int64_t>> mv_batch_rule(
return tv_batch_rule(self, self_bdim, other, other_bdim);
}
static std::tuple<Tensor, std::optional<int64_t>> mm_batch_rule(
std::tuple<Tensor, std::optional<int64_t>> mm_batch_rule(
const Tensor& self, std::optional<int64_t> self_bdim,
const Tensor& other, std::optional<int64_t> other_bdim) {
auto self_logical_rank = rankWithoutBatchDim(self, self_bdim);
@ -94,7 +94,7 @@ static std::tuple<Tensor, std::optional<int64_t>> mm_batch_rule(
return std::make_tuple( at::matmul(self_, other_), 0 );
}
static std::tuple<Tensor, std::optional<int64_t>> bmm_batch_rule(
std::tuple<Tensor, std::optional<int64_t>> bmm_batch_rule(
const Tensor& self, std::optional<int64_t> self_bdim,
const Tensor& other, std::optional<int64_t> other_bdim) {
auto self_logical_rank = rankWithoutBatchDim(self, self_bdim);
@ -250,7 +250,7 @@ struct LinalgCheckMatrixBinaryRuleHelper<op_name, F, Func, typelist<A, B, T...>>
}
};
static void expect_at_least_rank(
void expect_at_least_rank(
const Tensor& tensor,
std::optional<int64_t> tensor_bdim,
int64_t expected_rank,
@ -472,7 +472,7 @@ atol_rtol_tensor_batch_rule(
return std::make_tuple(Func(input_, atol_, rtol_, hermitian), 0);
}
static std::tuple<Tensor, std::optional<int64_t>>
std::tuple<Tensor, std::optional<int64_t>>
pinv_batch_rule(
const Tensor& input, std::optional<int64_t> input_bdim, const std::optional<Tensor>& atol,
const std::optional<int64_t> atol_bdim, const std::optional<Tensor>& rtol,

View File

@ -213,40 +213,22 @@ static cudnn_grid_sample_backward_batch_rule(
return grid_sample_backward_helper_out(std::move(bw_out), 0, 0, bdim_size);
}
// TODO: replace with targetable functionalization
// uses functional formulation for one_hot under vmap to be compatible with
// fakeTensor/dynamic shapes and compiled functorch transforms.
// mirrors the meta path in aten/src/ATen/native/Onehot.cpp,
// but requires explicit positive num_classes under vmap to avoid
// data-dependent output shapes.
static Tensor one_hot_decomposition_hack(const Tensor &self, int64_t num_classes) {
TORCH_CHECK(self.dtype() == kLong, "one_hot is only applicable to index tensor.");
auto shape = self.sym_sizes().vec();
// empty tensor could be converted to one hot representation,
// but shape inference is not possible.
if (self.sym_numel() == 0) {
if (num_classes <= 0) {
TORCH_CHECK(false, "Can not infer total number of classes from empty tensor.");
} else {
shape.emplace_back(num_classes);
return at::empty_symint(shape, self.options());
}
}
// disallow implicit inference under vmap; this would be data-dependent
// and is intentionally guarded by Dynamo in torch/_dynamo/variables/torch.py.
TORCH_CHECK(num_classes > 0, "When vmap-ing torch.nn.functional.one_hot, please "
"provide an explicit positive num_classes argument.");
// Disabling all of the following checks. This is OK because scatter has checks too.
// Maybe one_hot should be a primitive wrt autograd so we don't have to deal with this.
// // non-empty tensor
// if (self.device().type() != at::kCUDA) {
// //for cuda, rely on device assert thrown by scatter
// TORCH_CHECK(self.min().item().toLong() >= 0, "Class values must be non-negative.");
// }
// if (self.device().type() != at::kCUDA) {
// //rely on device asserts from scatter to avoid sync here
// TORCH_CHECK(num_classes > self.max().item().toLong(), "Class values must be smaller than num_classes.");
// }
shape.emplace_back(num_classes);
Tensor ret = at::zeros_symint(shape, self.options());
return ret.scatter(-1, self.unsqueeze(-1), 1);
const auto options = self.options();
at::Tensor index = at::arange(num_classes, options);
return at::eq(self.unsqueeze(-1), index).to(at::kLong);
}
template <typename A, A a, typename C>

View File

@ -19,7 +19,7 @@
namespace at::functorch {
namespace {
static bool any_has_value(ArrayRef<std::optional<int64_t>> bdims) {
bool any_has_value(ArrayRef<std::optional<int64_t>> bdims) {
for (const auto& bdim : bdims) {
if (bdim.has_value()) {
return true;
@ -28,7 +28,7 @@ static bool any_has_value(ArrayRef<std::optional<int64_t>> bdims) {
return false;
}
static int64_t get_num_leading_nones(ArrayRef<std::optional<Tensor>> indices) {
int64_t get_num_leading_nones(ArrayRef<std::optional<Tensor>> indices) {
int64_t result = 0;
for (const auto& idx : indices) {
if (!idx.has_value() || !idx->defined()) {
@ -40,7 +40,7 @@ static int64_t get_num_leading_nones(ArrayRef<std::optional<Tensor>> indices) {
return result;
}
static int64_t get_max_index_logical_dim(
int64_t get_max_index_logical_dim(
ArrayRef<std::optional<Tensor>> indices,
ArrayRef<std::optional<int64_t>> indices_bdims) {
int64_t max_logical_dim = -1;
@ -57,7 +57,7 @@ static int64_t get_max_index_logical_dim(
return max_logical_dim;
}
static std::vector<std::optional<Tensor>> batchIndices(
std::vector<std::optional<Tensor>> batchIndices(
at::TensorOptions options,
ArrayRef<std::optional<Tensor>> indices,
ArrayRef<std::optional<int64_t>> indices_bdims,
@ -126,7 +126,7 @@ static std::vector<std::optional<Tensor>> batchIndices(
// Define an "advanced index" to be a selection object that is
// a non-trivial Tensor (i.e. it does not represent :).
static bool is_advanced_index(const std::optional<Tensor>& idx) {
bool is_advanced_index(const std::optional<Tensor>& idx) {
if (!idx.has_value()) {
return false;
}
@ -137,7 +137,7 @@ static bool is_advanced_index(const std::optional<Tensor>& idx) {
}
// See NOTE: [advanced indices adjacent] for definition
static bool are_advanced_indices_adjacent(ArrayRef<std::optional<Tensor>> indices) {
bool are_advanced_indices_adjacent(ArrayRef<std::optional<Tensor>> indices) {
int64_t num_advanced_indices_regions = 0;
bool in_advanced_indices_region = false;
for (const auto& idx : indices) {
@ -165,7 +165,7 @@ static bool are_advanced_indices_adjacent(ArrayRef<std::optional<Tensor>> indice
// - result: Tensor[B, 4, 5, 6, 2, 3, 7, 8]
// ------- ----
// region2 region1
static Tensor swap_regions(const Tensor& tensor, int64_t first_region_size, int64_t second_region_size) {
Tensor swap_regions(const Tensor& tensor, int64_t first_region_size, int64_t second_region_size) {
VmapDimVector permutation(tensor.dim(), 0);
std::iota(permutation.begin(), permutation.end(), 0);
std::rotate(
@ -553,7 +553,7 @@ Tensor &_index_put_impl__plumbing(Tensor &self, const List<std::optional<Tensor>
return self;
}
static Tensor maybe_permute_values(
Tensor maybe_permute_values(
const Tensor& values,
ArrayRef<std::optional<Tensor>> orig_indices,
ArrayRef<std::optional<int64_t>> orig_indices_bdims) {
@ -1052,7 +1052,7 @@ std::tuple<Tensor, std::optional<int64_t>> index_add_batch_rule(
other, other_bdim, alpha, false);
}
static std::tuple<Tensor,Tensor> binary_pointwise_align(
std::tuple<Tensor,Tensor> binary_pointwise_align(
const Tensor & self,
std::optional<int64_t> self_bdim,
const Tensor & mask,

View File

@ -346,7 +346,7 @@ std::tuple<Tensor, std::optional<int64_t>> slice_batch_rule(
return std::make_tuple(std::move(result), 0);
}
static bool is_allowed_dim_on_scalar_tensor(int64_t dim) {
bool is_allowed_dim_on_scalar_tensor(int64_t dim) {
return dim == 0 || dim == -1;
}

View File

@ -68,18 +68,18 @@ namespace at::functorch {
namespace{
// PyTorch allows operations to specify dim 0 and dim -1 on a scalar tensor.
static bool is_allowed_dim_on_scalar_tensor(int64_t dim) {
bool is_allowed_dim_on_scalar_tensor(int64_t dim) {
return dim == 0 || dim == -1;
}
static int64_t get_current_level() {
int64_t get_current_level() {
auto maybe_level = maybeCurrentDynamicLayer();
TORCH_INTERNAL_ASSERT(maybe_level.has_value());
return maybe_level->layerId();
}
// This check should probably go into the dispatcher...
static bool participatesInCurrentLevel(const Tensor& self) {
bool participatesInCurrentLevel(const Tensor& self) {
auto current_level = get_current_level();
auto* maybe_batched_impl = maybeGetBatchedImpl(self);
if (!maybe_batched_impl) {
@ -90,7 +90,7 @@ static bool participatesInCurrentLevel(const Tensor& self) {
return self_level == current_level;
}
static bool participatesInCurrentLevel(ITensorListRef self) {
bool participatesInCurrentLevel(ITensorListRef self) {
for (const Tensor& tensor : self) {
if (participatesInCurrentLevel(tensor)) {
return true;
@ -285,7 +285,7 @@ std::vector<Tensor> unbind_batching_rule(const Tensor& self, int64_t dim) {
// given (sizes, strides, storage_offset) returns the maximum location that
// can be indexed (or nullopt if such a location doesn't exist, e.g., tensors
// with zero-size dims).
static std::optional<c10::SymInt> maximum_indexable_location(
std::optional<c10::SymInt> maximum_indexable_location(
c10::SymIntArrayRef sizes, c10::SymIntArrayRef strides, const c10::SymInt& storage_offset) {
auto result = native::storage_size_for(sizes, strides);
if (result == 0) {
@ -298,7 +298,7 @@ static std::optional<c10::SymInt> maximum_indexable_location(
// This checks that the range of possible memory locations accessible by
// x.as_strided(sizes, strides, maybe_storage_offset)
// are within the bounds of possible memory locations accessible by x.
static void checkBasicAsStridedValidForSlice(
void checkBasicAsStridedValidForSlice(
const Tensor& physical_tensor,
int64_t num_batch_dims,
c10::SymIntArrayRef sizes,

View File

@ -71,7 +71,7 @@ Tensor linear_hack(const Tensor& input, const Tensor& weight, const std::optiona
return output;
}
static inline at::Tensor apply_loss_reduction(const at::Tensor& unreduced, int64_t reduction) {
inline at::Tensor apply_loss_reduction(const at::Tensor& unreduced, int64_t reduction) {
if (reduction == at::Reduction::Mean) {
return unreduced.mean();
} else if (reduction == at::Reduction::Sum) {
@ -127,7 +127,7 @@ namespace {
template<bool inplace>
using Ctype = std::conditional_t<inplace, Tensor&, Tensor>;
static Tensor make_feature_noise(const Tensor& input) {
Tensor make_feature_noise(const Tensor& input) {
auto input_sizes = input.sizes();
TORCH_CHECK(input.dim() >= 2, "Feature dropout requires at least 2 dimensions in the input");
std::vector<int64_t> sizes;
@ -141,7 +141,7 @@ static Tensor make_feature_noise(const Tensor& input) {
return at::empty(sizes, input.options());
}
static bool is_fused_kernel_acceptable(const Tensor& input, double p) {
bool is_fused_kernel_acceptable(const Tensor& input, double p) {
return (input.is_cuda() || input.is_xpu() || input.is_lazy() || input.is_privateuseone()) && p > 0 && p < 1 && input.numel() > 0;
}
@ -210,7 +210,7 @@ ALIAS_SPECIALIZATION(_feature_dropout, true, false)
ALIAS_SPECIALIZATION(_alpha_dropout, false, true )
ALIAS_SPECIALIZATION(_feature_alpha_dropout, true, true )
static Tensor dropout(const Tensor& input, double p, bool train) {
Tensor dropout(const Tensor& input, double p, bool train) {
auto result = [&]() {
NoNamesGuard guard;
if (train && is_fused_kernel_acceptable(input, p)) {

View File

@ -24,7 +24,7 @@ namespace at::native {
namespace {
template <typename scalar_t>
static void adaptive_avg_pool3d_out_frame(
void adaptive_avg_pool3d_out_frame(
const scalar_t* input_p,
scalar_t* output_p,
int64_t sizeD,
@ -176,7 +176,7 @@ void adaptive_avg_pool3d_out_cpu_template(
}
template <typename scalar_t>
static void adaptive_avg_pool3d_backward_out_frame(
void adaptive_avg_pool3d_backward_out_frame(
scalar_t* gradInput_p,
const scalar_t* gradOutput_p,
int64_t sizeD,

View File

@ -93,7 +93,7 @@ namespace {
// 5d tensor B x D x T x H x W
template <typename scalar_t>
static void adaptive_max_pool3d_single_out_frame(
void adaptive_max_pool3d_single_out_frame(
const scalar_t *input_p,
scalar_t *output_p,
int64_t *ind_p,
@ -170,7 +170,7 @@ static void adaptive_max_pool3d_single_out_frame(
}
template <typename scalar_t>
static void adaptive_max_pool3d_out_frame(
void adaptive_max_pool3d_out_frame(
const scalar_t *input_data,
scalar_t *output_data,
int64_t *indices_data,
@ -202,7 +202,7 @@ static void adaptive_max_pool3d_out_frame(
}
template <typename scalar_t>
static void adaptive_max_pool3d_backward_single_out_frame(
void adaptive_max_pool3d_backward_single_out_frame(
scalar_t *gradInput_p,
const scalar_t *gradOutput_p,
const int64_t *ind_p,
@ -241,7 +241,7 @@ static void adaptive_max_pool3d_backward_single_out_frame(
}
template <typename scalar_t>
static void adaptive_max_pool3d_backward_out_frame(
void adaptive_max_pool3d_backward_out_frame(
scalar_t *gradInput_data,
const scalar_t *gradOutput_data,
const int64_t *indices_data,

View File

@ -153,7 +153,7 @@ namespace at::native {
namespace {
template <typename scalar_t>
static void avg_pool3d_out_frame(
void avg_pool3d_out_frame(
const scalar_t *input_p,
scalar_t *output_p,
int64_t nslices,
@ -333,7 +333,7 @@ TORCH_IMPL_FUNC(avg_pool3d_out_cpu) (
namespace {
template <typename scalar_t>
static void avg_pool3d_backward_out_frame(
void avg_pool3d_backward_out_frame(
scalar_t *gradInput_p,
const scalar_t *gradOutput_p,
int64_t nslices,

View File

@ -143,13 +143,13 @@ Tensor& cholesky_inverse_kernel_impl(Tensor& result, Tensor& infos, bool upper)
For more info see https://github.com/pytorch/pytorch/issues/145801#issuecomment-2631781776
*/
template <typename T>
static inline
inline
std::enable_if_t<std::is_floating_point_v<T>, int> lapack_work_to_int(const T val) {
const auto next_after = std::nextafter(val, std::numeric_limits<T>::infinity());
return std::max<int>(1, std::ceil(next_after));
}
template <typename T>
static inline
inline
std::enable_if_t<c10::is_complex<T>::value, int> lapack_work_to_int(const T val) {
return lapack_work_to_int(val.real());
}
@ -343,7 +343,7 @@ void linalg_eigh_kernel(const Tensor& eigenvalues, const Tensor& eigenvectors, c
For further details, please see the LAPACK documentation for GEQRF.
*/
template <typename scalar_t>
static void apply_geqrf(const Tensor& input, const Tensor& tau) {
void apply_geqrf(const Tensor& input, const Tensor& tau) {
#if !AT_BUILD_WITH_LAPACK()
TORCH_CHECK(
false,
@ -1039,7 +1039,7 @@ void lu_solve_kernel(const Tensor& LU, const Tensor& pivots, const Tensor& B, Tr
}
template <typename scalar_t>
static void apply_svd(const Tensor& A,
void apply_svd(const Tensor& A,
const bool full_matrices,
const bool compute_uv,
const Tensor& U,

View File

@ -71,7 +71,7 @@
namespace at::native {
namespace {
static void col2im_out_cpu_template(
void col2im_out_cpu_template(
Tensor& output,
const Tensor& input_,
IntArrayRef output_size,

View File

@ -25,7 +25,7 @@ namespace at::native {
namespace {
static Tensor compute_columns2d(
Tensor compute_columns2d(
const Tensor& input,
IntArrayRef padding,
IntArrayRef stride,
@ -93,7 +93,7 @@ static Tensor compute_columns2d(
return columns.contiguous();
}
static inline void slow_conv2d_shape_check(
inline void slow_conv2d_shape_check(
const Tensor& input,
const Tensor& grad_output,
const Tensor& weight,
@ -205,7 +205,7 @@ static inline void slow_conv2d_shape_check(
}
}
static inline Tensor view_weight_2d(const Tensor& weight_,
inline Tensor view_weight_2d(const Tensor& weight_,
at::MemoryFormat memory_format = at::MemoryFormat::Contiguous) {
Tensor weight = weight_.contiguous(memory_format);
if (weight.dim() == 4) {
@ -220,7 +220,7 @@ static inline Tensor view_weight_2d(const Tensor& weight_,
}
template <typename scalar_t>
static void slow_conv2d_update_output_frame(
void slow_conv2d_update_output_frame(
TensorAccessor<const scalar_t, 3> input,
TensorAccessor<scalar_t, 3> output,
TensorAccessor<const scalar_t, 2> weight,
@ -480,7 +480,7 @@ void slow_conv2d_backward_weight_frame(
}
}
static void slow_conv2d_backward_weight_out_cpu_template(
void slow_conv2d_backward_weight_out_cpu_template(
Tensor& grad_weight,
const Tensor& input,
const Tensor& grad_output_,

View File

@ -28,7 +28,7 @@ namespace at::native {
namespace {
static Tensor compute_columns3d(
Tensor compute_columns3d(
const Tensor& input_,
IntArrayRef stride,
IntArrayRef padding,
@ -108,7 +108,7 @@ static Tensor compute_columns3d(
return columns;
}
static inline void slow_conv3d_shape_check(
inline void slow_conv3d_shape_check(
const Tensor& input,
const Tensor& grad_output,
const Tensor& weight,
@ -273,7 +273,7 @@ static inline void slow_conv3d_shape_check(
}
}
static Tensor view_weight_2d(const Tensor& weight_) {
Tensor view_weight_2d(const Tensor& weight_) {
Tensor weight = weight_.contiguous();
if (weight.dim() == 5) {
const int64_t s1 = weight.size(0);
@ -286,7 +286,7 @@ static Tensor view_weight_2d(const Tensor& weight_) {
}
template <typename scalar_t>
static void slow_conv3d_update_output_frame(
void slow_conv3d_update_output_frame(
TensorAccessor<const scalar_t, 4> input,
TensorAccessor<scalar_t, 4> output,
TensorAccessor<const scalar_t, 2> weight,
@ -515,7 +515,7 @@ void slow_conv3d_backward_weight_frame(
grad_weight.data(), ldc, grad_weight.stride(0) * n);
}
static void slow_conv3d_backward_parameters_out_cpu_template(
void slow_conv3d_backward_parameters_out_cpu_template(
Tensor& grad_weight,
const Tensor& input,
const Tensor& grad_output,

View File

@ -108,7 +108,7 @@ bool is_fast_path(const Tensor& src, const std::optional<Tensor>& scale, Tensor&
// index_add (using add_indices as the index), without creating an intermediary
// tensor to hold the selected embeddings
template <typename data_t, typename index_t>
static std::enable_if_t<std::is_same_v<data_t, double>, void>
std::enable_if_t<std::is_same_v<data_t, double>, void>
index_select_add(
const Tensor& select_indices,
const Tensor& add_indices,
@ -494,7 +494,7 @@ index_select_add(const Tensor &select_indices,
// mul (scaling by per_sample_weights)
// index_add (using add_indices as the index)
template <typename data_t, typename index_t>
static std::enable_if_t<std::is_same_v<data_t, double>, void>
std::enable_if_t<std::is_same_v<data_t, double>, void>
index_select_scale_add(
const Tensor& select_indices,
const Tensor& add_indices,

View File

@ -130,7 +130,7 @@ namespace native {
namespace {
template <typename scalar_t>
static void fractional_max_pool2d_out_single_batch_frame(
void fractional_max_pool2d_out_single_batch_frame(
const scalar_t* input,
scalar_t* output,
int64_t* indices,
@ -188,7 +188,7 @@ static void fractional_max_pool2d_out_single_batch_frame(
}
template <typename scalar_t>
static void fractional_max_pool2d_out_frame(
void fractional_max_pool2d_out_frame(
const scalar_t* input,
scalar_t* output,
int64_t* indices,
@ -220,7 +220,7 @@ static void fractional_max_pool2d_out_frame(
}
template <typename scalar_t>
static void fractional_max_pool2d_backward_out_single_batch_frame(
void fractional_max_pool2d_backward_out_single_batch_frame(
scalar_t* gradInput,
const scalar_t* gradOutput,
const int64_t* indices,
@ -247,7 +247,7 @@ static void fractional_max_pool2d_backward_out_single_batch_frame(
}
template <typename scalar_t>
static void fractional_max_pool2d_backward_out_frame(
void fractional_max_pool2d_backward_out_frame(
scalar_t* gradInput,
const scalar_t* gradOutput,
const int64_t* indices,

View File

@ -99,7 +99,7 @@ namespace at::native {
namespace {
template<typename scalar_t>
static void fractional_max_pool3d_out_single_batch_frame(
void fractional_max_pool3d_out_single_batch_frame(
const scalar_t* input,
scalar_t* output,
int64_t* indices,
@ -169,7 +169,7 @@ static void fractional_max_pool3d_out_single_batch_frame(
}
template<typename scalar_t>
static void fractional_max_pool3d_out_frame(
void fractional_max_pool3d_out_frame(
const scalar_t* input,
scalar_t* output,
int64_t* indices,
@ -257,7 +257,7 @@ TORCH_IMPL_FUNC(fractional_max_pool3d_out_cpu)(
namespace {
template<typename scalar_t>
static void fractional_max_pool3d_backward_out_single_batch_frame(
void fractional_max_pool3d_backward_out_single_batch_frame(
scalar_t* gradInput,
const scalar_t* gradOutput,
const int64_t* indices,
@ -287,7 +287,7 @@ static void fractional_max_pool3d_backward_out_single_batch_frame(
}
template<typename scalar_t>
static void fractional_max_pool3d_backward_out_frame(
void fractional_max_pool3d_backward_out_frame(
scalar_t* gradInput,
const scalar_t* gradOutput,
const int64_t* indices,

View File

@ -19,7 +19,7 @@
namespace at::native {
namespace {
static void im2col_out_cpu_template(
void im2col_out_cpu_template(
Tensor& output,
const Tensor& input_,
IntArrayRef kernel_size,

View File

@ -61,7 +61,7 @@
constexpr float EPSILON = 1e-12;
namespace {
static inline at::Tensor apply_loss_reduction(const at::Tensor& unreduced, int64_t reduction) {
inline at::Tensor apply_loss_reduction(const at::Tensor& unreduced, int64_t reduction) {
if (reduction == at::Reduction::Mean) {
return unreduced.mean();
} else if (reduction == at::Reduction::Sum) {

View File

@ -44,7 +44,7 @@ namespace {
// this ad-hoc converts from targets (l in [1]) to augmented targets (l' in [1]) note that no bound-checking is done
template<typename target_t>
static inline int64_t get_target_prime(target_t* target, int64_t offset, int64_t stride, int64_t idx, int64_t BLANK) {
inline int64_t get_target_prime(target_t* target, int64_t offset, int64_t stride, int64_t idx, int64_t BLANK) {
if (idx % 2 == 0) {
return BLANK;
} else {

View File

@ -58,7 +58,7 @@ inline scalar_t multilabel_margin_loss_forward_inner_sum_cpu(
}
template <typename scalar_t>
static void multilabel_margin_loss_forward_out_frame(
void multilabel_margin_loss_forward_out_frame(
const Tensor& input_contiguous,
const Tensor& target_contiguous,
Tensor& output,
@ -108,7 +108,7 @@ static void multilabel_margin_loss_forward_out_frame(
}
}
static void multilabel_margin_loss_forward_out_cpu_template(
void multilabel_margin_loss_forward_out_cpu_template(
const Tensor& input,
const Tensor& target,
Tensor& output,
@ -153,7 +153,7 @@ static void multilabel_margin_loss_forward_out_cpu_template(
}
template <typename scalar_t>
static void multilabel_margin_loss_backward_out_frame(
void multilabel_margin_loss_backward_out_frame(
Tensor& grad_input,
const Tensor& grad_output,
const Tensor& input_contiguous,
@ -222,7 +222,7 @@ static void multilabel_margin_loss_backward_out_frame(
}
}
static void multilabel_margin_loss_backward_out_cpu_template(
void multilabel_margin_loss_backward_out_cpu_template(
Tensor& grad_input,
const Tensor& grad_output,
const Tensor& input,

View File

@ -57,7 +57,7 @@ inline int64_t target_index_checked(
}
template <typename scalar_t>
static inline void multi_margin_loss_cpu_kernel(
inline void multi_margin_loss_cpu_kernel(
Tensor& output,
const scalar_t* input_data,
const int64_t* target_data,
@ -148,7 +148,7 @@ void multi_margin_loss_out_cpu_template(
}
template <typename scalar_t>
static void multi_margin_loss_backward_cpu_kernel(
void multi_margin_loss_backward_cpu_kernel(
scalar_t* grad_input_data,
const Tensor& grad_output,
const scalar_t* input_data,

View File

@ -159,7 +159,7 @@ inline scalar_t* optional_data(const Tensor& source) {
}
template <typename scalar_t, typename target_t>
static void nll_loss_out_frame(
void nll_loss_out_frame(
const Tensor& output,
const Tensor& total_weight,
const Tensor& input,
@ -338,7 +338,7 @@ void nll_loss_forward_out_cpu_template(
}
template <typename scalar_t, typename target_t>
static void nll_loss_backward_out_frame(
void nll_loss_backward_out_frame(
const Tensor& grad_input,
const Tensor& grad_output,
const Tensor& input,

View File

@ -99,7 +99,7 @@ inline void check_gradout_shape_nll_loss2d(
template <typename scalar_t>
static void nll_loss2d_forward_out_frame(
void nll_loss2d_forward_out_frame(
Tensor& output,
Tensor& total_weight,
const Tensor& input,
@ -280,7 +280,7 @@ void nll_loss2d_forward_out_cpu_template(
}
template <typename scalar_t>
static void nll_loss2d_backward_out_frame(
void nll_loss2d_backward_out_frame(
Tensor& grad_input,
const Tensor& grad_output,
const Tensor& input,

View File

@ -24,7 +24,7 @@
namespace at {
namespace {
static inline void slow_conv_transpose2d_shape_check(
inline void slow_conv_transpose2d_shape_check(
const Tensor& input,
const Tensor& grad_output,
const Tensor& weight,
@ -386,7 +386,7 @@ void slow_conv_transpose2d_out_cpu_template(
}
}
static void slow_conv_transpose2d_backward_out_cpu_template(
void slow_conv_transpose2d_backward_out_cpu_template(
const Tensor& input_,
const Tensor& grad_output_,
Tensor& grad_input,

View File

@ -22,7 +22,7 @@ namespace at::native {
namespace {
static inline void slow_conv_transpose3d_shape_check(
inline void slow_conv_transpose3d_shape_check(
const Tensor& input,
const Tensor& grad_output,
const Tensor& weight,

View File

@ -92,7 +92,7 @@ namespace {
arg_name, " should contain ", expected, " elements not ", actual);
}
static inline Tensor repeat_if_defined(const Tensor& t, const SymInt& repeat) {
inline Tensor repeat_if_defined(const Tensor& t, const SymInt& repeat) {
if (t.defined()) {
return t.repeat_symint(repeat);
}

View File

@ -34,16 +34,16 @@ Tensor one_hot(const Tensor &self, int64_t num_classes) {
}
}
auto shape = self.sizes().vec();
auto shape = self.sym_sizes().vec();
// empty tensor could be converted to one hot representation,
// but shape inference is not possible.
if (self.numel() == 0) {
if (self.sym_numel() == 0) {
if (num_classes <= 0) {
TORCH_CHECK(false, "Can not infer total number of classes from empty tensor.");
} else {
shape.push_back(num_classes);
return at::empty(shape, self.options());
shape.emplace_back(num_classes);
return at::empty_symint(shape, self.options());
}
}
@ -66,8 +66,8 @@ Tensor one_hot(const Tensor &self, int64_t num_classes) {
}
}
shape.push_back(num_classes);
Tensor ret = at::zeros(shape, self.options());
shape.emplace_back(num_classes);
Tensor ret = at::zeros_symint(shape, self.options());
ret.scatter_(-1, self.unsqueeze(-1), 1);
return ret;
}

View File

@ -538,7 +538,7 @@ c10::intrusive_ptr<CellParamsBase> make_quantized_cell_params_fp16(
std::move(w_ih_packed), std::move(w_hh_packed));
}
static std::unordered_map<
std::unordered_map<
std::string,
c10::intrusive_ptr<CellParamsBase> (*)(CellParamsSerializationType)>
cell_params_deserializers = {
@ -578,7 +578,7 @@ struct QRNNCellParamsWrapper {
// Gathers every two elements of a vector in a vector of pairs
template<typename T>
static std::vector<pair_of<T>> pair_vec(const std::vector<T>& vals) {
std::vector<pair_of<T>> pair_vec(const std::vector<T>& vals) {
TORCH_CHECK(vals.size() % 2 == 0, "Odd number of params or hiddens given to a bidirectional RNN");
std::vector<pair_of<T>> result;
result.reserve(vals.size() / 2);
@ -590,7 +590,7 @@ static std::vector<pair_of<T>> pair_vec(const std::vector<T>& vals) {
// Flattens a vector of pairs
template<typename T>
static std::vector<T> unpair_vec(std::vector<pair_of<T>>&& vals) {
std::vector<T> unpair_vec(std::vector<pair_of<T>>&& vals) {
std::vector<T> result;
result.reserve(vals.size() * 2);
for (const auto i : c10::irange(vals.size())) {
@ -601,7 +601,7 @@ static std::vector<T> unpair_vec(std::vector<pair_of<T>>&& vals) {
}
// Parses a flat list of parameter tensors into a list of CellParams
static std::vector<CellParams> gather_params(TensorList params, bool has_biases, bool has_projections = false) {
std::vector<CellParams> gather_params(TensorList params, bool has_biases, bool has_projections = false) {
static at::Tensor undefined;
std::vector<CellParams> result;
if (has_biases) {
@ -1894,10 +1894,10 @@ static DEFINE_QUANTIZED_RNN_CELL_DYNAMIC(quantized_rnn_tanh_cell_dynamic, simple
namespace {
[[maybe_unused]] static auto ensure_linear_params_registered =
[[maybe_unused]] auto ensure_linear_params_registered =
register_linear_params();
static auto cell_params_base_registry =
auto cell_params_base_registry =
torch::selective_class_<CellParamsBase>("rnn", TORCH_SELECTIVE_CLASS("CellParamsBase"))
.def_pickle(
[](const c10::intrusive_ptr<CellParamsBase>& self)

View File

@ -2676,7 +2676,7 @@ inline std::tuple<Tensor, Tensor, int64_t> _take_along_dim_helper(
std::move(dim));
}
static inline void checkDevice(CheckedFrom c, const Tensor& t, Device device) {
inline void checkDevice(CheckedFrom c, const Tensor& t, Device device) {
TORCH_CHECK(
!t.defined() || t.device() == device,
"Expected tensor to have ",
@ -2689,7 +2689,7 @@ static inline void checkDevice(CheckedFrom c, const Tensor& t, Device device) {
")");
}
static inline void checkDevice(
inline void checkDevice(
CheckedFrom c,
at::ArrayRef<Tensor> tensors,
Device device) {

View File

@ -3641,7 +3641,7 @@ Tensor& transpose_(Tensor& self, int64_t dim0, int64_t dim1) {
namespace {
// Transpose implementation for sparse compressed layouts
// NB: We assume that dim1,dim0 have already been wrapped
static inline Tensor sparse_compressed_transpose(
inline Tensor sparse_compressed_transpose(
const Tensor& self,
int64_t dim0,
int64_t dim1) {

View File

@ -29,7 +29,7 @@ namespace {
// grad_in does not mean that it is a gradient wrt to input,
// grad_in/grad_out is just an input/output of unfold_backward kernel.
[[maybe_unused]] static TensorIterator _make_unfold_backward_iter_over_grad_out(
[[maybe_unused]] TensorIterator _make_unfold_backward_iter_over_grad_out(
Tensor& grad_out,
const Tensor& grad_in,
int64_t dim,

View File

@ -105,7 +105,7 @@ namespace at::native {
namespace {
template <typename scalar_t>
static void upsample_bicubic2d_backward_out_frame(
void upsample_bicubic2d_backward_out_frame(
const scalar_t* odata,
scalar_t* idata,
int64_t input_height,
@ -177,7 +177,7 @@ static void upsample_bicubic2d_backward_out_frame(
});
}
static void upsample_bicubic2d_backward_kernel(
void upsample_bicubic2d_backward_kernel(
const Tensor& grad_input,
const Tensor& grad_output_,
IntArrayRef output_size,

View File

@ -39,6 +39,6 @@ int register_linear_params() {
}
namespace {
[[maybe_unused]] static auto linear_params = register_linear_params();
[[maybe_unused]] auto linear_params = register_linear_params();
} // namespace
} // namespace ao::sparse

View File

@ -30,7 +30,7 @@ namespace {
// Workaround for gcc-14.2.0 ICE during RTL pass: expand when compiling for NEON
__attribute__((optimize("no-tree-vectorize")))
#endif
static void log_sigmoid_cpu_kernel(TensorBase &output, TensorBase &buffer, const TensorBase &input) {
void log_sigmoid_cpu_kernel(TensorBase &output, TensorBase &buffer, const TensorBase &input) {
if (at::isReducedFloatingType(input.scalar_type())) {
AT_DISPATCH_REDUCED_FLOATING_TYPES(input.scalar_type(), "log_sigmoid_cpu", [&]() {
using Vec = Vectorized<scalar_t>;
@ -96,7 +96,7 @@ static void log_sigmoid_cpu_kernel(TensorBase &output, TensorBase &buffer, const
}
}
static void log_sigmoid_backward_cpu_kernel(TensorIterator& iter) {
void log_sigmoid_backward_cpu_kernel(TensorIterator& iter) {
if (at::isReducedFloatingType(iter.dtype())) {
AT_DISPATCH_REDUCED_FLOATING_TYPES(iter.dtype(), "log_sigmoid_backward_cpu", [&]() {
using Vec = Vectorized<scalar_t>;
@ -150,7 +150,7 @@ static void log_sigmoid_backward_cpu_kernel(TensorIterator& iter) {
}
}
static void threshold_kernel(
void threshold_kernel(
TensorIteratorBase& iter,
const Scalar& threshold_scalar,
const Scalar& value_scalar) {
@ -868,7 +868,7 @@ void hardswish_backward_kernel(TensorIterator& iter) {
}
}
static void leaky_relu_kernel(TensorIteratorBase& iter, const Scalar& negval_) {
void leaky_relu_kernel(TensorIteratorBase& iter, const Scalar& negval_) {
if (at::isReducedFloatingType(iter.dtype())) {
AT_DISPATCH_REDUCED_FLOATING_TYPES(iter.dtype(), "leaky_relu_cpu", [&]() {
auto zero_vec = Vectorized<float>((float)(0));
@ -907,7 +907,7 @@ static void leaky_relu_kernel(TensorIteratorBase& iter, const Scalar& negval_) {
}
}
static void leaky_relu_backward_kernel(TensorIteratorBase& iter, const Scalar& negval_) {
void leaky_relu_backward_kernel(TensorIteratorBase& iter, const Scalar& negval_) {
if (at::isReducedFloatingType(iter.dtype())) {
AT_DISPATCH_REDUCED_FLOATING_TYPES(iter.dtype(), "leaky_relu_backward_cpu", [&]() {
auto zero_vec = Vectorized<float>((float)(0));

View File

@ -369,7 +369,7 @@ void gemm_notrans_(
#endif // defined(__aarch64__) && !defined(C10_MOBILE)
#if !defined(C10_MOBILE)
static float compute_dot(const at::Half* a, const at::Half* b, int64_t len) {
float compute_dot(const at::Half* a, const at::Half* b, int64_t len) {
return at::native::CPU_CAPABILITY::fp16_dot_with_fp32_arith(
a, b, len);
}
@ -406,7 +406,7 @@ void gemm_transa_(
});
}
static float compute_dot(const at::BFloat16* a, const at::BFloat16* b, int64_t len) {
float compute_dot(const at::BFloat16* a, const at::BFloat16* b, int64_t len) {
return at::native::CPU_CAPABILITY::bf16_dot_with_fp32_arith(a, b, len);
}

View File

@ -15,12 +15,12 @@ namespace at::native {
inline namespace CPU_CAPABILITY {
namespace {
static bool reduced_input(ScalarType input_t, ScalarType output_t) {
bool reduced_input(ScalarType input_t, ScalarType output_t) {
return !at::isFloat8Type(input_t) && at::isReducedFloatingType(input_t) &&
output_t == kFloat;
}
static bool reduced_output(ScalarType input_t, ScalarType output_t) {
bool reduced_output(ScalarType input_t, ScalarType output_t) {
return !at::isFloat8Type(output_t) && at::isReducedFloatingType(output_t) &&
input_t == kFloat;
}

View File

@ -15,7 +15,7 @@ namespace at::native {
namespace {
template<typename scalar_t>
static void apply_cross(const Tensor& result, const Tensor& a, const Tensor& b, const int64_t dim) {
void apply_cross(const Tensor& result, const Tensor& a, const Tensor& b, const int64_t dim) {
int64_t total = a.numel() / 3;
int64_t a_stride = a.stride(dim);
int64_t b_stride = b.stride(dim);
@ -68,7 +68,7 @@ static void apply_cross(const Tensor& result, const Tensor& a, const Tensor& b,
});
}
static void cross_kernel_impl(const Tensor& result, const Tensor& a, const Tensor& b, const int64_t dim) {
void cross_kernel_impl(const Tensor& result, const Tensor& a, const Tensor& b, const int64_t dim) {
AT_DISPATCH_ALL_TYPES_AND_COMPLEX_AND2(kBFloat16, kHalf, result.scalar_type(), "cross", [&]() {
apply_cross<scalar_t>(result, a, b, dim);
});

View File

@ -422,19 +422,19 @@ void pdist_forward_kernel_impl(Tensor& result, const Tensor& self, const double
});
}
static void pdist_backward_kernel_impl(Tensor& result, const Tensor& grad, const Tensor& self, const double p, const Tensor& dist) {
void pdist_backward_kernel_impl(Tensor& result, const Tensor& grad, const Tensor& self, const double p, const Tensor& dist) {
AT_DISPATCH_FLOATING_TYPES(self.scalar_type(), "pdist_backward", [&] {
Dist<scalar_t>::apply_backward_pdist(result, grad, self, p, dist);
});
}
static void cdist_kernel_impl(Tensor& result, const Tensor& x1, const Tensor& x2, const double p) {
void cdist_kernel_impl(Tensor& result, const Tensor& x1, const Tensor& x2, const double p) {
AT_DISPATCH_FLOATING_TYPES(result.scalar_type(), "cdist", [&] {
Dist<scalar_t>::apply_cdist(result, x1, x2, p);
});
}
static void cdist_backward_kernel_impl(Tensor& result, const Tensor& grad, const Tensor& x1, const Tensor& x2, const double p, const Tensor& dist) {
void cdist_backward_kernel_impl(Tensor& result, const Tensor& grad, const Tensor& x1, const Tensor& x2, const double p, const Tensor& dist) {
AT_DISPATCH_FLOATING_TYPES(result.scalar_type(), "cdist_backward", [&] {
Dist<scalar_t>::apply_backward_cdist(result, grad, x1, x2, p, dist);
});

View File

@ -27,7 +27,7 @@
namespace at::native {
namespace {
static void cauchy_kernel(TensorIteratorBase& iter, double median, double sigma, std::optional<Generator> gen) {
void cauchy_kernel(TensorIteratorBase& iter, double median, double sigma, std::optional<Generator> gen) {
CPUGeneratorImpl* generator = get_generator_or_default<CPUGeneratorImpl>(gen, detail::getDefaultCPUGenerator());
templates::cpu::cauchy_kernel(iter, median, sigma, generator);
}
@ -101,7 +101,7 @@ void bernoulli_scalar_kernel(const TensorBase &self, double p, std::optional<Gen
}
#endif
static void exponential_kernel_default(TensorIteratorBase& iter, double lambda, std::optional<Generator> gen) {
void exponential_kernel_default(TensorIteratorBase& iter, double lambda, std::optional<Generator> gen) {
CPUGeneratorImpl* generator = get_generator_or_default<CPUGeneratorImpl>(gen, detail::getDefaultCPUGenerator());
templates::cpu::exponential_kernel(iter, lambda, generator);
}
@ -198,12 +198,12 @@ void exponential_kernel(TensorIteratorBase &iter, double lambda, std::optional<G
}
#endif
static void geometric_kernel(TensorIteratorBase& iter, double p, std::optional<Generator> gen) {
void geometric_kernel(TensorIteratorBase& iter, double p, std::optional<Generator> gen) {
CPUGeneratorImpl* generator = get_generator_or_default<CPUGeneratorImpl>(gen, detail::getDefaultCPUGenerator());
templates::cpu::geometric_kernel(iter, p, generator);
}
static void log_normal_kernel(TensorIteratorBase& iter, double mean, double std, std::optional<Generator> gen) {
void log_normal_kernel(TensorIteratorBase& iter, double mean, double std, std::optional<Generator> gen) {
CPUGeneratorImpl* generator = get_generator_or_default<CPUGeneratorImpl>(gen, detail::getDefaultCPUGenerator());
templates::cpu::log_normal_kernel(iter, mean, std, generator);
}
@ -218,12 +218,12 @@ void normal_kernel(const TensorBase &self, double mean, double std, std::optiona
templates::cpu::normal_kernel(self, mean, std, generator);
}
static void random_from_to_kernel(TensorIteratorBase& iter, uint64_t range, int64_t base, std::optional<Generator> gen) {
void random_from_to_kernel(TensorIteratorBase& iter, uint64_t range, int64_t base, std::optional<Generator> gen) {
CPUGeneratorImpl* generator = get_generator_or_default<CPUGeneratorImpl>(gen, detail::getDefaultCPUGenerator());
templates::cpu::random_from_to_kernel(iter, range, base, generator);
}
static void random_kernel(TensorIteratorBase& iter, std::optional<Generator> gen) {
void random_kernel(TensorIteratorBase& iter, std::optional<Generator> gen) {
CPUGeneratorImpl* generator = get_generator_or_default<CPUGeneratorImpl>(gen, detail::getDefaultCPUGenerator());
templates::cpu::random_kernel(iter, generator);
}
@ -231,7 +231,7 @@ static void random_kernel(TensorIteratorBase& iter, std::optional<Generator> gen
// This is the special kernel to handle single specific case:
// from(inclusive) = std::numeric_limits<int64_t>::lowest()
// to(exclusive) = None (= std::numeric_limits<int64_t>::max() + 1)
static void random_full_64_bits_range_kernel(TensorIteratorBase& iter, std::optional<Generator> gen) {
void random_full_64_bits_range_kernel(TensorIteratorBase& iter, std::optional<Generator> gen) {
CPUGeneratorImpl* generator = get_generator_or_default<CPUGeneratorImpl>(gen, detail::getDefaultCPUGenerator());
templates::cpu::random_full_64_bits_range_kernel(iter, generator);
}

View File

@ -85,7 +85,7 @@ struct RandomKernel {
// ==================================================== Normal ========================================================
#ifdef CPU_CAPABILITY_AVX2
static void normal_fill_16_AVX2(float *data,
void normal_fill_16_AVX2(float *data,
const __m256* two_pi,
const __m256* one,
const __m256* minus_two,
@ -136,7 +136,7 @@ void normal_fill_AVX2(const TensorBase &self, const float mean, const float std,
#endif
template <typename scalar_t>
static void normal_fill_16(scalar_t *data, const scalar_t mean, const scalar_t std) {
void normal_fill_16(scalar_t *data, const scalar_t mean, const scalar_t std) {
for (const auto j : c10::irange(8)) {
const scalar_t u1 = 1 - data[j]; // [0, 1) -> (0, 1] for log.
const scalar_t u2 = data[j + 8];

View File

@ -158,14 +158,14 @@ inline void _mul_reduce_max_fusion_kernel(
}
template <typename scalar_t>
static inline scalar_t* conditional_data_ptr(scalar_t* ptr, scalar_t* ptr2) {
inline scalar_t* conditional_data_ptr(scalar_t* ptr, scalar_t* ptr2) {
TORCH_CHECK(ptr2 == nullptr);
return ptr;
}
template <typename scalar_t,
typename std::enable_if_t<is_reduced_floating_point_v<scalar_t>, int> = 0>
static inline scalar_t* conditional_data_ptr(float* ptr, scalar_t* ptr2) {
inline scalar_t* conditional_data_ptr(float* ptr, scalar_t* ptr2) {
return ptr2;
}

View File

@ -441,7 +441,7 @@ struct ComputeLocation<scalar_t, GridSamplerPadding::Reflection, align_corners>
// See NOTE [ Grid Sample CPU Kernels ] for details.
template<typename scalar_t>
static inline void
inline void
mask_scatter_add(const scalar_t *src, scalar_t* base_addr,
const int_same_size_t<scalar_t> *offsets,
const int_same_size_t<scalar_t> *mask, int64_t len) {
@ -1030,7 +1030,7 @@ struct ApplyGridSample<scalar_t, 2, GridSamplerInterpolation::Bicubic,
// See NOTE [ Grid Sample CPU Kernels ] for details.
template<typename scalar_t, typename ApplyFn>
static inline void grid_sample_2d_grid_slice_iterator(
inline void grid_sample_2d_grid_slice_iterator(
const TensorAccessor<const scalar_t, 3>& grid_slice, const ApplyFn &apply_fn) {
int64_t out_H = grid_slice.size(0);
int64_t out_W = grid_slice.size(1);

View File

@ -259,7 +259,7 @@ void histogramdd_out_cpu_template(const Tensor& self, const std::optional<Tensor
*
* Refer to histogramdd_out_cpu_template for more details.
*/
static void histogramdd_kernel_impl(const Tensor& self, const std::optional<Tensor>& weight, bool density,
void histogramdd_kernel_impl(const Tensor& self, const std::optional<Tensor>& weight, bool density,
Tensor& hist, const TensorList& bin_edges) {
histogramdd_out_cpu_template<BINARY_SEARCH>(self, weight, density, hist, bin_edges);
}
@ -269,7 +269,7 @@ static void histogramdd_kernel_impl(const Tensor& self, const std::optional<Tens
*
* Refer to histogramdd_out_cpu_template for more details.
*/
static void histogramdd_linear_kernel_impl(const Tensor& self, const std::optional<Tensor>& weight,
void histogramdd_linear_kernel_impl(const Tensor& self, const std::optional<Tensor>& weight,
bool density, Tensor& hist, const TensorList& bin_edges, bool local_search) {
if (local_search) {
// histogramdd codepath: both hist and bin_edges are eventually returned as output,
@ -298,7 +298,7 @@ void infer_bin_edges_from_input(const Tensor& input, const int64_t N,
std::copy(max_data, max_data + N, rightmost_edges.begin());
}
static void histogram_select_outer_bin_edges_impl(const Tensor& input, const int64_t N,
void histogram_select_outer_bin_edges_impl(const Tensor& input, const int64_t N,
std::vector<double> &leftmost_edges, std::vector<double> &rightmost_edges) {
AT_DISPATCH_FLOATING_TYPES(input.scalar_type(), "histogramdd", [&]() {
infer_bin_edges_from_input<scalar_t>(input, N, leftmost_edges, rightmost_edges);

Some files were not shown because too many files have changed in this diff Show More