Compare commits

..

859 Commits

Author SHA1 Message Date
239ea930a3 Remove pytorch quant docs since we are moving to torchao
Summary:
att

Test Plan:
doc page generated from CI

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 14150e4e3a5d65f1a9eb5a4201fcd4bac09920f1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157766
2025-07-09 15:31:30 -07:00
edf7bb4f51 Fix unbound local when an error occurs before pool is initialized (#156750)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156750
Approved by: https://github.com/jamesjwu
2025-07-08 00:28:21 +00:00
bbb930aba2 Bump urllib3 from 2.2.2 to 2.5.0 in /tools/build/bazel (#156390)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.2 to 2.5.0.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.2...2.5.0)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-version: 2.5.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-07 17:13:21 -07:00
60b41de0ca remove allow-untyped-defs from torch/ao/nn/quantized/modules/rnn.py (#157234)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157234
Approved by: https://github.com/jingsh
ghstack dependencies: #157231, #157232
2025-07-08 00:11:52 +00:00
e38a335d7f remove allow-untyped-defs from torch/backends/cusparselt/__init__.py (#157232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157232
Approved by: https://github.com/jingsh
ghstack dependencies: #157231
2025-07-08 00:11:52 +00:00
9d8cf24b3b remove allow-untyped-defs from torch/_classes.py (#157231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157231
Approved by: https://github.com/jingsh
2025-07-08 00:11:52 +00:00
be56a8d7ac Automatically load and save dynamo entries via caching_precompile (#155913)
This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries.

When this configuration is turned on, we:
- Automatically create and initialize a CompilePackage on every torch.compile
- Automatically use BundledAutogradcache
- Automatically save the CompilePackage entry to DynamoCache after every compile

You can also use PrecompileContext.serialize() to manually serialize a full object.

I've added unit tests to exhibit this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913
Approved by: https://github.com/zhxchen17
2025-07-07 23:57:17 +00:00
3effe0c293 [SymmMem] find_path does not search /usr/local/lib (#157695)
This PR uses `find_library` to replace `find_path`.
It also searches for NVSHMEM host lib and device lib separately.

Tested against system install location: /usr/local/lib and /usr/local/include.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157695
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-07 23:16:45 +00:00
2fde2090d0 [inductor_collectives] Make reorder_collectives_preserve_peak pass grouping nodes (#157706)
Differential Revision: [D77861765](https://our.internmc.facebook.com/intern/diff/D77861765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157706
Approved by: https://github.com/wconstab
2025-07-07 23:13:58 +00:00
5d8d126249 Fix einops x torch.compile interaction (#157600)
Fixes https://github.com/pytorch/pytorch/issues/157451

If/when einops releases a version greater than 0.8.1, it will just break
(without this patch).

The history is:
- Between 2.6 and 2.7, we tried to delete the einops import (#142847)
- That didn't work so well, so we applied a hotfix in 2.7.1. (#153925)
- The hotfix wasn't completely correct (0.8.1 is the latest version of
  einops, so the condition in the hotfix just always evaluates to True!)
- It turns out we didn't need to delete the einops import. We already
  do not eagerly import einops.
- I reverted the code back to the state it was in in 2.6.
  https://github.com/pytorch/pytorch/blob/release/2.6/torch/_dynamo/decorators.py

Test Plan:
- We have testing in CI for einops 0.6.1, 0.7.0, and 0.8.1. Wait for CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157600
Approved by: https://github.com/guilhermeleobas, https://github.com/anijain2305
ghstack dependencies: #157416
2025-07-07 23:04:02 +00:00
378c121d5e Remove unnecessary warnings during the ATen compilation process. (#157703)
Comparing uint32_t(num_threads()) with int(kCUDABlockReduceMaxThreads) always results in a compilation warning. Just change the return type of kCUDABlockReduceMaxThreads to uint32_t to avoid it.
Fixes https://github.com/pytorch/pytorch/issues/157701

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157703
Approved by: https://github.com/malfet, https://github.com/Skylion007
2025-07-07 22:49:38 +00:00
7e83d50845 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-07-07 22:13:34 +00:00
6f05d58f2b [AOTI] Split aoti_runtime/model.h to prepare for model static linking (#157592)
Summary:
Prepare for https://github.com/pytorch/pytorch/pull/157129.

We split the file so we can re-use `model.h` part for codegen a separate header for each model in static linkage.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77761249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157592
Approved by: https://github.com/desertfire
2025-07-07 22:13:22 +00:00
a7eb153bba [MemoryViz] Add file selector button (#157647)
In some linux desktop environments like mine, there is no drag and dropping of files. Which made the memoryviz impossible for me to use. So this adds a file selector button as an alternative. Tested that it works locally, and also works with multiple files.

![image](https://github.com/user-attachments/assets/dcb61d68-6c6f-42f6-a075-1783d747d1b0)

And the button remains when something is loaded, to allow loading something else, but it moves out of the way to save vertical space:

![image](https://github.com/user-attachments/assets/4239d13c-3d80-4790-9696-0906c75e14e6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157647
Approved by: https://github.com/sraikund16
2025-07-07 22:03:51 +00:00
ed6df0e324 correctly import torch.version (#157584)
The structure is

```
torch/
  __init__.py
  version.py
```

When we import torch, only `torch/__init__.py` is executed by default.

The submodules like `version.py` are not automatically imported or attached to the torch module.

So without anything in `__init__.py`, `torch.version` may not be found. So in this PR, we make the import explicit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157584
Approved by: https://github.com/ezyang
2025-07-07 21:43:35 +00:00
5c79a55e7e [oss] Add version to metadata (#155343)
Summary: We want to add versioning to DCP to the metadata so that whenever planner logic changes, we can use the version on save to determine how to load the data

Test Plan:
added a test

Rollback Plan:

Differential Revision: D76135887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155343
Approved by: https://github.com/teja-rao
2025-07-07 20:57:30 +00:00
3d06ff82a8 [release] Triton pin update to 3.4 (#156664)
Triton pin update issue: https://github.com/pytorch/pytorch/issues/154206
Please see post: https://dev-discuss.pytorch.org/t/2-8-final-rc-release-postponed-by-a-week/3101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156664
Approved by: https://github.com/davidberard98
2025-07-07 20:52:25 +00:00
2efa5eaa65 swa avoid stream sync (#157705)
Summary:
When AveragedModel updates_parameters it calls self.n_averaged == 0 for each parameter, where n_averated is a buffer on GPU. Moving check before the cycle to call sync once

It improves update_parameter from 74ms to 57ms ~22% improvement
{F1980011097}
{F1980011111}

Test Plan:
CI

Rollback Plan:

Differential Revision: D77723025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157705
Approved by: https://github.com/albanD, https://github.com/Skylion007, https://github.com/janeyx99
2025-07-07 20:47:35 +00:00
c2510fcd86 Fix index_put propagate strategy arg unpack error (#157671)
Fix `index_put` propagate strategy didn't consider optional arg `accumulate`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157671
Approved by: https://github.com/fmassa, https://github.com/wconstab
2025-07-07 20:18:18 +00:00
510c398a4f Add max_pool3d backward pass for MPS (#157498)
Note on backward precision over fp16:

A float16 number has 10 bits of mantissa, 5 bits of exponent, and 1 bit for the sign. If the sign bit is positive, then with a mantissa $m$ and exponent $e$ represented in base 10, the number that the float16 format represents is $(1 + m / 1024)  \exp2(e)$. ([source](https://en.wikipedia.org/wiki/Half-precision_floating-point_format))

Consider adding two numbers $a$ and $b$ which have arbitrary mantissas, and say their exponents are $e_a = 1$ (so $2 \le a \lt 4$) and $e_b=-3$ (so $0.175 \le b \lt 0.25$). Assume that the result has the same exponent as $a$. Since the exponents differ by 4, we'll effectively need to truncate the 4 rightmost bits of $b$'s mantissa, which would introduce a maximum error on the order of $(2^4 / 1024)  \exp2(-3) \approx 0.002$.

The error is nearly the same if $e_b = -2$ (so $0.25 \le b \lt 0.5$), where the 3 rightmost bits are truncated, giving a maximum error on the order of $(2^3 / 1024)  \exp2(-2) \approx 0.002$. Same for $e_b=-1$.

So if we're adding up nine different numbers that all have exponents -3, -2, or -1, and they sum to a number with exponent 1, then we would expect a maximum error of several times greater than 0.002. In my comments above, summing those particular nine numbers in different ways gave results that ranged between 3.1816 and 3.1758, a difference of $0.0058 \approx 2.9  * 0.002$.

That's within the acceptable bounds, and we can safely just increase the error tolerance used in test_output_grad_match for the case of max_pool3d_backward with float16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157498
Approved by: https://github.com/malfet
2025-07-07 19:46:44 +00:00
63a96eaeb8 [DeviceMesh] Add error when users try to slice non contiguous flattened dim submesh (#157523)
With https://github.com/pytorch/pytorch/issues/157393, we want to first throw a clearer error for users and then fix it in the long-term

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157523
Approved by: https://github.com/fegin
ghstack dependencies: #157501
2025-07-07 19:43:51 +00:00
2b8d3b1b2b [DeviceMesh] Use user set backend and pg option even for the global mesh (#157501)
Short term solution to https://github.com/pytorch/pytorch/issues/156593.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157501
Approved by: https://github.com/fegin, https://github.com/lw
2025-07-07 19:43:51 +00:00
bf1ebe0531 Fix typo: 'paramter' → 'parameter' in dynamo variable comment (#157651)
This PR fixes a minor typo in a comment in `torch/_dynamo/variables/torch.py`, changing 'paramter' to the correct spelling 'parameter'.

These small but meaningful changes help improve code readability and maintain the overall quality of the codebase.

Thanks for your time and review!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157651
Approved by: https://github.com/Skylion007
2025-07-07 19:42:44 +00:00
433a247102 [logging] [redo] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156840)
Summary: This is a redo of https://github.com/pytorch/pytorch/pull/156517, but with pt2_compile_events logging disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156840
Approved by: https://github.com/jamesjwu
2025-07-07 19:09:48 +00:00
8a47f9d03b [CI] Fix xpu ci test sccache issue (#157693)
With PR #157341 land, it broken the PXU CI test on sccache which has been disabled by #143851. Re-disable it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157693
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-07-07 18:29:38 +00:00
9e5f4a844c [FSDP2] Fix issue with set_reduce_scatter_divide_factor errors and MixedPrecisionPolicy (#155964)
fix https://github.com/pytorch/pytorch/issues/155223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155964
Approved by: https://github.com/weifengpy
2025-07-07 17:09:29 +00:00
cyy
7c1f627828 Fix 'dllimport attribute ignored on inline function' (#157670)
There are lots of warnings in builds:
```
 2025-07-05T16:59:46.9208806Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5043,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209030Z  5043 | inline at::Tensor & Tensor::less_(const at::Scalar & other) const {
2025-07-05T16:59:46.9209104Z       |                             ^
2025-07-05T16:59:46.9209671Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5048,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209860Z  5048 | inline at::Tensor & Tensor::less_(const at::Tensor & other) const
```
This PR has fixed them and turned the warning into an error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157670
Approved by: https://github.com/albanD
2025-07-07 16:57:48 +00:00
b3b4d28f4c [submodule][cutlass] Update pin to b995f93 v4.0.0 (#157376)
@Skylion007 seems afk. https://github.com/pytorch/pytorch/pull/153541

https://github.com/NVIDIA/cutlass/releases/tag/v4.0.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157376
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-07-07 16:55:47 +00:00
ae1094b72b Revert "[WIP] Automatically load and save dynamo entries via caching_precompile (#155913)"
This reverts commit e466dab164d9236bfe5817ec8e4d24c7b9d3e392.

Reverted https://github.com/pytorch/pytorch/pull/155913 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail a test in trunk ([comment](https://github.com/pytorch/pytorch/pull/155913#issuecomment-3045914878))
2025-07-07 16:53:35 +00:00
eda0a9cc90 [list] Add list.__delitem__ (#156339)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156339
Approved by: https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242, #156270, #156271
2025-07-07 14:51:32 +00:00
d74ccf4ffe [list] Add list.__mul__ and list.__imul__ (#156271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156271
Approved by: https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242, #156270
2025-07-07 14:51:32 +00:00
689fba032d Implement list.__add__ and list.__iadd__ (#156270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156270
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #153969, #156148, #156242
2025-07-07 14:51:25 +00:00
c1d69d5dd5 [list] Implement list.remove (#156242)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156242
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #153969, #156148
2025-07-07 14:51:17 +00:00
e49acfc5c5 [list] Raise exception in invalid list method call (#156148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156148
Approved by: https://github.com/zou3519
ghstack dependencies: #153969
2025-07-07 14:51:10 +00:00
034e996d37 [list] Implement list.count (#153969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153969
Approved by: https://github.com/zou3519, https://github.com/XuehaiPan
2025-07-07 14:51:03 +00:00
16c3b4143b [gtest][listing] Enable gtest json listing for the fbcode/caffe2 project (#156816)
***SUMMARY***

The main function in this tests overrides that of the Gtest framework which contains it's `RUN_ALL_TESTS()` function. The main function in this test is called conditionally when conditions apply, in this case, when the C10_MOBILE directive is provided. This is wrong as we always want to call the `RUN_ALL_TEST()` function.

In this PR, we only make the test suite available for cases that apply, i.e if the C10_MOBILE directive exist which represents the caching allocator and is only exposed on mobile

***TEST PLAN***

This tests should run in modes where it applies which should be covered in the CI run.

Below shows a sample run in the dev-nosan mode which do not have the cache allocator

BEFORE
```
buck test fbcode//caffe2:cpu_caching_allocator_test
Discovered 0. Pass 0. Fail 0. Fatal 0. Skip 0. Timeout 0
⚠ Listing failed: caffe2:cpu_caching_allocator_test
Listing tests failed with error:
Failed to read from /data/users/ysuleiman/fbsource/buck-out/v2/test/buck-out/v2/test_discovery/fbcode/6dcc55a61c1b90b3/default/tpx_execution_dir/gtest_output_file.json. Listing process stdout: , stderr:
```

AFTER
```
buck test '@fbcode//mode/dev-nosan' fbcode//caffe2:cpu_caching_allocator_test
Analyzing targets. Remaining      0/46242                                                                                1871690 actions, 2251668 artifacts declared
Executing actions. Remaining      0/257870                                                                               83:28:24.4s exec time total
Command: test.     Finished 10 remote, 112314 cache (99% hit)                                                            83:22:43.5s exec time cached (99%)
Time elapsed: 2:57.7s
Tests finished: Pass 0. Fail 0. Fatal 0. Skip 0. Build failure 0
NO TESTS RAN
```

Rollback Plan:
steps:
  - manual.note:
      content: Revert this diff

Reviewed By: patskovn

Differential Revision: D77229077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156816
Approved by: https://github.com/kimishpatel
2025-07-07 14:16:43 +00:00
54a4d34d10 [fbcode] switch to cutlass-4 (#157579)
Summary: Update cutlass version to 4. For most use cases.

Test Plan:
testing in progress

Rollback Plan:

Differential Revision: D77605011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157579
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2025-07-07 14:12:33 +00:00
78684e27ac [xla hash update] update the pinned xla hash (#156584)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156584
Approved by: https://github.com/pytorchbot
2025-07-07 12:09:20 +00:00
40e39ae21f Update slow tests (#157696)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157696
Approved by: https://github.com/pytorchbot
2025-07-07 12:09:06 +00:00
e466dab164 [WIP] Automatically load and save dynamo entries via caching_precompile (#155913)
This PR adds a new config option, `caching_precompile`, and a `DynamoCache`, which loads and saves Dynamo Cache entries automatically. It also hooks up DynamoCache to PrecompileContext, so that we can save multiple cache entries.

When this configuration is turned on, we:
- Automatically create and initialize a CompilePackage on every torch.compile
- Automatically use BundledAutogradcache
- Automatically save the CompilePackage entry to DynamoCache after every compile

You can also use PrecompileContext.serialize() to manually serialize a full object.

I've added unit tests to exhibit this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155913
Approved by: https://github.com/zhxchen17
2025-07-07 11:56:30 +00:00
d27d36136c Don't try installing missing cuda dependencies on s390x (#157540)
Don't try installing missing cuda dependencies on s390x

Fixes #157409

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157540
Approved by: https://github.com/seemethere, https://github.com/huydhn
2025-07-07 09:16:38 +00:00
815545f2dd [inductor] enable bf32 for mkldnn linear pointwise/binary in inductor (#127294)
When `torch.backends.mkldnn.matmul.fp32_precision=='bf16'`, we also enabled mkldnn linear in inductor path and allow to run with bf16 computation data type.

Testplan:
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_unary
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_fp32
python test/inductor/test_mkldnn_pattern_matcher.py -k test_multi_linear_share_same_input
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127294
Approved by: https://github.com/jgong5, https://github.com/jansel

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-07 06:03:41 +00:00
d26ca5de05 Support transpose and pack for bit8 (#156065)
To be used by CPU INT8 SDPA in torchao. https://github.com/pytorch/ao/pull/2380

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156065
Approved by: https://github.com/mingfeima, https://github.com/ezyang
2025-07-07 01:40:47 +00:00
Lei
2022588295 Fix: Ensure writeback handles NO_SHARD correctly by flattening tensors before copying (#154369)
Fixes #151223

Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters.

Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154369
Approved by: https://github.com/weifengpy
2025-07-06 09:20:31 +00:00
02715d0876 [BE][5/6] fix typos in test/ (test/dynamo/) (#157639)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157639
Approved by: https://github.com/yewentao256, https://github.com/jansel
ghstack dependencies: #157638
2025-07-06 06:34:25 +00:00
17687eb792 [BE][4/6] fix typos in test/ (test/inductor/) (#157638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157638
Approved by: https://github.com/yewentao256, https://github.com/jansel
2025-07-06 06:34:25 +00:00
7cda4017dd Fix torch.utils.cpp_extension parser for clang version 20.1.7+libcxx (#157666)
When CC and CXX compiler is set to clang, and clang was compiled with libc++, compilation of torchvision fails with:

```
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 585, in build_extensions
    compiler_name, compiler_version = self._check_abi()
                                      ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1034, in _check_abi
    _, version = get_compiler_abi_compatibility_and_version(compiler)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 449, in get_compiler_abi_compatibility_and_version
    if tuple(map(int, version)) >= minimum_required_version:
       ^^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: '7+libcxx'
```

Compiler identification is a valid semantic version:
```
$ clang -dumpfullversion -dumpversion
20.1.7+libcxx
```

After adjusting parser of version, clang is able to compile extensions successfully.

Fixes #157665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157666
Approved by: https://github.com/msaroufim
2025-07-06 01:35:00 +00:00
3e56a9cdfb More testing of Python arithmetic operators between tensors and scalars (see 157266) (#157632)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157632
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2025-07-05 17:48:27 +00:00
ee9ac36c23 Fixing misspelling in documentation (#157565)
Fixes #157564

Fixes misspelling of the word parameter in documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157565
Approved by: https://github.com/awgu, https://github.com/cyyever
2025-07-05 17:04:13 +00:00
9be5860bc3 [dynamo] Fix dynamic shapes handling in after_aot repro generation (#157136)
Summary:
- Extract symbolic variables directly from graph placeholders and arguments
- Add symbolic variable definitions to generated repro code
- Add unit tests with ToyModel for testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157136
Approved by: https://github.com/xmfan
ghstack dependencies: #157021
2025-07-05 15:38:41 +00:00
548c9d8281 Fix typo: 'paramter' → 'parameter' in quantization model report test (#157646)
This PR addresses a minor typo in the file `test/quantization/fx/test_model_report_fx.py`:

- Corrected the word "paramter" to "parameter" for better readability and accuracy.

While it's a small change, correcting such typographical errors contributes to maintaining the overall quality and professionalism of the codebase.

Thank you for your time and consideration in reviewing this PR. I'm happy to make any further adjustments if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157646
Approved by: https://github.com/yewentao256, https://github.com/ezyang
2025-07-05 12:28:36 +00:00
71a650ad56 Fix typo: 'Intializing' → 'Initializing' in test_parametrization.py (#157362)
This pull request fixes a minor typo in the doc comments of `test/nn/test_parametrization.py`.

- Replaced `'Intializing'` with `'Initializing'` in two docstring comments to improve clarity and maintain consistency across the codebase.

This is a non-functional change and does not impact behavior or test outcomes.

Thank you for maintaining such a high-quality codebase. Please let me know if any adjustments are needed. I'd be happy to help!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157362
Approved by: https://github.com/ezyang
2025-07-05 12:21:15 +00:00
2471cc3355 [pc] verify max autotune is in generated source code (#157650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157650
Approved by: https://github.com/aorenste
ghstack dependencies: #157305, #157614, #157619
2025-07-05 07:55:11 +00:00
db00e1699a [pc] introduce ProgressiveCompilationState and clear callback (#157619)
followup from https://github.com/pytorch/pytorch/pull/157305 where
@aorenste correctly suggested clearing callback. this refactor
introduces a new dataclass so we don't need to check nullability for
each field

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157619
Approved by: https://github.com/aorenste
ghstack dependencies: #157305, #157614
2025-07-05 07:55:11 +00:00
5ea832e5f6 [pc] migrate progression futures from list to deque (#157614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157614
Approved by: https://github.com/aorenste
ghstack dependencies: #157305
2025-07-05 07:55:03 +00:00
a952956d05 Add isnan exit condition to special ops (#157464)
They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind
7081b8233a/aten/src/ATen/native/Math.h (L2969-L2973)
which were tested by `test_compare_cpu` using following values (as sample index 16)
7081b8233a/torch/testing/_internal/opinfo/core.py (L2079)

Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below
```
python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))"
tensor(nan)
```
Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7

Before
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s

OK (skipped=344)
```
After
```
$ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 45.580s

OK (skipped=72, expected failures=8)
```

Fixes https://github.com/pytorch/pytorch/issues/79528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157488
2025-07-05 04:19:50 +00:00
63e87d6d05 [Refactor] Add maybe unused flag to remove warning (#157655)
Fixes #157653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157655
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-07-05 03:23:39 +00:00
f7127b9b94 [Refactor] Remove unused variables (#157654)
Fixes #157653

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157654
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-07-05 02:12:15 +00:00
44f5b93122 fix: correct sentence punctuation in cuDNN note (#157623)
Fixes #ISSUE_NUMBER
This PR fixes a small punctuation issue in the PyTorch README.

Specifically:

Added a missing full stop at the end of the sentence:
"Note: You could refer to the cuDNN Support Matrix for cuDNN versions with the various supported CUDA, CUDA driver and NVIDIA hardware."

Added comma for clarity between "CUDA driver" and "NVIDIA hardware".

These edits improve the readability and grammatical correctness of the documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157623
Approved by: https://github.com/Skylion007
2025-07-05 01:37:33 +00:00
e0fd48be7d Fix typo: 'occurances' → 'occurrences' in mobile model test (#157629)
This PR addresses a typo in the file `test/mobile/model_test/gen_test_model.py`.

### Changes:
- Corrected "occurances" to the correct spelling "occurrences"
- Renamed associated variables to reflect this change for consistency and clarity

This is a non-functional, cleanup-only PR to improve code readability.

Thanks to the PyTorch team for maintaining such a high-quality codebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157629
Approved by: https://github.com/Skylion007
2025-07-05 01:36:42 +00:00
43f7216327 Fix typo: 'paramters' → 'parameters' in ATen tunable README (#157575)
This PR addresses a minor typo in the documentation file aten/src/ATen/cuda/tunable/README.md, where paramters has been corrected to parameters for improved clarity and consistency.

Context
Accurate and clear documentation is crucial for helping developers and contributors understand PyTorch internals. This small fix contributes to the overall quality and readability of the project.

Thank you to the PyTorch team and maintainers for your continued efforts in building such an incredible framework. I'm happy to contribute in any way I can — even if just with a small doc improvement like this one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157575
Approved by: https://github.com/eqy
2025-07-05 01:14:45 +00:00
8a8fac1131 [SymmMem] Move code to where it is used (#157611)
`maybe_initialize_env_vars` and `initialize_nvshmem_with_store` are only used in `NVSHMEMSymmetricMemory.cu`. Moving them there.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157611
Approved by: https://github.com/Skylion007
ghstack dependencies: #157513
2025-07-04 23:37:49 +00:00
bcc98bb2a4 Update _linux-test to support B200 runner (#157341)
This unblocks https://github.com/pytorch/test-infra/issues/6869.  The key changes to call out:

* B200 needs OIDC to access ECR and upload stats to S3, so we need to set `id-token: write` in `_linux-test`.  All workflows calling `_linux-test` also need to be updated accordingly
* Connecting sccache to S3 on B200 doesn't seem to work, so I disable it.  It still works locally though.

### Testing

https://github.com/pytorch/pytorch/actions/runs/16055549292/job/45312298376
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157341
Approved by: https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet
2025-07-04 23:19:24 +00:00
524e827095 [build] modernize build-backend: setuptools.build_meta:__legacy__ -> setuptools.build_meta (#155998)
Change `build-system.build-backend`: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta`. Also, move static package info from `setup.py` to `pyproject.toml`.

Now the repo can be installed from source via `pip` command instead of `python setup.py develop`:

```bash
python -m pip install --verbose --editable .

python -m pip install --verbose --no-build-isolation --editable .
```

In addition, the SDist is also buildable:

```bash
python -m build --sdist
python -m install dist/torch-*.tar.gz  # build from source using SDist
```

Note that we should build the SDist with a fresh git clone if we will upload the output to PyPI. Because all files under `third_party` will be included in the SDist. The SDist file will be huge if the git submodules are initialized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155998
Approved by: https://github.com/ezyang, https://github.com/cyyever, https://github.com/atalman
ghstack dependencies: #157557
2025-07-04 19:25:14 +00:00
9968edd002 Fix #153942 (#153943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153943
Approved by: https://github.com/malfet
2025-07-04 18:25:18 +00:00
7275f28045 Fix cuda 12.9 aarch64 GPU builds. Update CUDA_STABLE variable. (#157630)
This contains 2 fixes that required in main and will need to be cherry-picked to Release 2.8 branch:
1. The PR https://github.com/pytorch/pytorch/pull/155819 missed to include triton change.
2. CUDA STABLE variable needs to be set to 12.8. Updating CUDA stable updates full static build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157630
Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt
2025-07-04 18:08:31 +00:00
7be862ab8f [dynamo] Relax DUPLICATED_INPUT to be serializable. (#157492)
Since we don't actually rely on any real data while building DUPLICATE_INPUT guard, we can safely serialize it with sources and it should be able to reconstruct the guard correctly in the new process. Therefore we don't really need to prevent serializing it.

Differential Revision: [D77683302](https://our.internmc.facebook.com/intern/diff/D77683302/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157492
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2025-07-04 15:19:34 +00:00
336f1e2d35 [AOTI] Fix AOT inductor CMake build dependency order (#157557)
compile_model.py -> aoti_custom_class -> torch

The custom command requires `torch` to be installed.

8408522976/test/cpp/aoti_inference/compile_model.py (L1-L7)

Fixes CI failure on trunk:

- https://github.com/pytorch/pytorch/actions/runs/16041370426/job/45275085572#step:22:18348

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157557
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-07-04 14:33:36 +00:00
a46ea8a364 Fix typo: 'initalized' → 'initialized' in alias analysis test (#157628)
This PR corrects a small spelling error in `test/jit/test_alias_analysis.py`.

- "initalized" → "initialized"

This is a minor comment correction and does not affect functionality or logic.

Thank you for maintaining this amazing codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157628
Approved by: https://github.com/Skylion007
2025-07-04 13:41:53 +00:00
f41d017aa6 Add device check in mse_loss (#155089)
Fixes #154978

## Test Result

```python
>>> import torch
>>> import numpy as np
>>> import torch.nn as nn
>>> import torch.distributions.normal as norm
>>> device = torch.device(('cuda' if torch.cuda.is_available() else 'cpu'))
>>> print('Using {}'.format(device))
Using cuda
>>> m = nn.Sequential(nn.Linear(1, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh(), nn.Linear(128, 128).cuda(), nn.Tanh())
>>> m.to(device, dtype=None, non_blocking=False)
Sequential(
  (0): Linear(in_features=1, out_features=128, bias=True)
  (1): Tanh()
  (2): Linear(in_features=128, out_features=128, bias=True)
  (3): Tanh()
  (4): Linear(in_features=128, out_features=128, bias=True)
  (5): Tanh()
)
>>> opt = torch.optim.Adam(m.parameters(), lr=0.001)
>>> print('Number of trainable parameters: ', sum((p.numel() for p in m.parameters() if p.requires_grad)))
Number of trainable parameters:  33280
>>> input_tensor = torch.tensor(77.0, device=device)
>>> target = torch.tensor(66.0)
>>> loss_function = nn.MSELoss()
>>> print('Loss Function: ', loss_function)
Loss Function:  MSELoss()
>>> loss = loss_function(input_tensor, target)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1767, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/module.py", line 1778, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/modules/loss.py", line 610, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zong/code/pytorch/torch/nn/functional.py", line 3903, in mse_loss
    return torch._C._nn.mse_loss(
           ^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155089
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-07-04 12:37:48 +00:00
52e4e41cbc [dynamo] do not issue lru_cache warning for functions in the top-level torch namespace (#157598)
`lru_cache` usage warning was being raised for `torch.get_device_module()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157598
Approved by: https://github.com/Sidharth123-cpu
2025-07-04 08:17:50 +00:00
64f2ec77f8 [inductor] Fix fractional_max_pool2d 3D input causing assertion error (#156912)
Fixes #156682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156912
Approved by: https://github.com/angelayi
2025-07-04 06:09:28 +00:00
fdc5b42a8f _broadcast_shapes gso generalizations (#157008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157008
Approved by: https://github.com/ColinPeppler
ghstack dependencies: #155590
2025-07-04 05:56:42 +00:00
d58ed04d89 [async-compile] add progressive compile mode (#157305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157305
Approved by: https://github.com/aorenste
2025-07-04 04:18:50 +00:00
386bc9e2e9 [audio hash update] update the pinned audio hash (#156905)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156905
Approved by: https://github.com/pytorchbot
2025-07-04 04:06:59 +00:00
f2e712ca14 Revert "Fix is_unaligned usage of statically_known_true (#157400)"
This reverts commit b359571c6043b40c4ae4fbb07135fd0f04902e21.

Reverted https://github.com/pytorch/pytorch/pull/157400 on behalf of https://github.com/malfet due to It break tests, see 99c1a6bdd9/1 ([comment](https://github.com/pytorch/pytorch/pull/157400#issuecomment-3034353539))
2025-07-04 03:57:08 +00:00
99c1a6bdd9 [SymmMem] Find NVSHMEM from system installation (#157513)
Previously we only search for NVSHMEM from pip install location.
This PR adds search in system locations deemed default by CMake.
Related: #157453 untars NVSHMEM into `/usr/local` on our CI machines.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157513
Approved by: https://github.com/atalman, https://github.com/Skylion007
2025-07-04 03:34:44 +00:00
4ed1b03f72 Add missing graph and memory related symbols to cuda_to_hip_mappings (#157435) (#157573)
Summary: This PR adds missing CUDA symbols in `cuda_to_hip_mappings`.

Test Plan: Tested in D77642700.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157573
Approved by: https://github.com/Skylion007

Co-authored-by: Geon-Woo Kim <gwkim@meta.com>
2025-07-04 03:03:04 +00:00
8f9a191db6 [SymmMem] Fix CI name mismatch; remove TORCH_SYMMMEM requirement (#157597)
Thanks @huydhn for spotting two name mismatches in the CI configs.
We were matching against "test_h100_symm_mem" instead of "h100-symm-mem".

Also, replaced `TORCH_SYMMMEM` env setting with programmatic method:
`symm_mem.set_backend(...)`

Further, skips a hanged test in `test_nvshmem_trion.py`. (#TODO @codingwithsurya )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157597
Approved by: https://github.com/fduwjj, https://github.com/huydhn
2025-07-04 01:43:08 +00:00
ef97bd4713 [torch] Add MTIA to the list of devices supporting foreach/fused kernels (#157583)
Summary: We currently have foreach kernel implementations for MTIA, and for when we don't we internally decompose the ops. Anyone using this list for compatibility checks should be sending through the foreach kernels.

Reviewed By: egienvalue, scottxu0730

Differential Revision: D77751248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157583
Approved by: https://github.com/egienvalue
2025-07-04 01:15:24 +00:00
f0b388665e Add dynamo_timed to bytecode hook (#157587)
Test Plan:
- ran tlparse on vLLM and saw this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157587
Approved by: https://github.com/jingsh, https://github.com/BoyuanFeng
2025-07-04 01:11:03 +00:00
c9a5bf09ba [FP8] FP8 for SwishLayerNorm (#157574)
Summary: Add a pass use_triton_fp8_swish_replace_normal_swish to replace _triton_swish_rms_norm with its counterpart that supports fp8 triton_swish_rms_norm, and turn on fp8 during inference.

Test Plan:
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12.4 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=899072727_0 --node-replacement-dict="{}" --gpu-trace --add-passes=use_triton_fp8_swish_replace_normal_swish
```
The perf improvement on the 100x model with this pass is roughly ~7%, details are recorded [here](https://docs.google.com/document/d/1eIV_OTQyQcf_DlEDxwycTwhyGxT5OJkLzs8cPL6EMYc/edit?tab=t.0)

Rollback Plan:

Reviewed By: frank-wei

Differential Revision: D76531303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157574
Approved by: https://github.com/frank-wei
2025-07-04 01:06:21 +00:00
dfcda613b6 Ensure Dynamo can trace through explicit dunder method call (#154366)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154366
Approved by: https://github.com/zou3519
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065, #154066, #154263
2025-07-04 00:46:05 +00:00
0e7f02fe2e [Dynamo] [FrozensetSubclass] Add support for user defined frozensets (#154263)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154263
Approved by: https://github.com/williamwen42
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065, #154066
2025-07-04 00:46:05 +00:00
308b88bde9 [Dynamo] [Set] Add comparison for set subclass (#154066)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154066
Approved by: https://github.com/Skylion007
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064, #154065
2025-07-04 00:45:58 +00:00
c51da57b55 [Dynamo] [Set] Raise TypeError in set.union(...) and "__or__" (#154065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154065
Approved by: https://github.com/williamwen42
ghstack dependencies: #153150, #152991, #154539, #153553, #154063, #154064
2025-07-04 00:45:50 +00:00
f9544f1f0c [Dynamo] [Set] Raise TypeError if object is unhashable (#154064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154064
Approved by: https://github.com/Skylion007
ghstack dependencies: #153150, #152991, #154539, #153553, #154063
2025-07-04 00:45:42 +00:00
11c71053e0 [Dynamo] [Set] Implement some binop operators for dict/set/frozenset/dict_keys (#154063)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154063
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #153150, #152991, #154539, #153553
2025-07-04 00:45:34 +00:00
22abe6ded4 [Dynamo] [SetSubclass] Add support for user defined sets (#153553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153553
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #153150, #152991, #154539
2025-07-04 00:45:25 +00:00
2b82c61f04 [Generator] Implement generator.__contains__ (#154539)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154539
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #153150, #152991
2025-07-04 00:45:18 +00:00
f651e28f80 [FrozenSet] Fixes for FrozenSet (#152991)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152991
Approved by: https://github.com/zou3519
ghstack dependencies: #153150
2025-07-04 00:45:11 +00:00
e7167dbacf [Set] Support sets in VariableBuilder (#153150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153150
Approved by: https://github.com/zou3519
2025-07-04 00:45:03 +00:00
6c42afe196 Introduce sync_cross_rank_decision (#156287)
Summary:
This is an improvement over `_broadcast_rank0_decision` where we uses the rank0's decision to broadcast to every rank. The issue of `_broadcast_rank0_decision` is that we observed large variance on the peak memory usage. One cause is that different ranks receive different dynamic shaped tensors and the hints of those tensors are different in different ranks. If we only rely on rank0's decision and it's unlucky to get unrepresentative hints, then the decision it makes may not be suitable for other ranks.

Here, we introduce `sync_cross_rank_decision` which comes up with the decision after comparing all ranks' local decision, it will:
1. all gather decisions from all ranks;
2. test each decision on the current rank and get its estimated memory usage;
3. all reduce estimated memory usage with ReduceOp.MAX, so that we know the maximum memory usage of each decision on all ranks;
4. pick the decision which gives us minimum maximum memory memory usage;

A graph to show more details
https://internalfb.com/excalidraw/EX484509

After applying sync_cross_rank_decision, we observed that the variance are much smaller

Rollback Plan:

Differential Revision: D76714005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156287
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
2025-07-03 23:43:53 +00:00
f7130c097e [nativert] Move Executor to PyTorch core (#157514)
Test Plan:
CI

Rollback Plan:

Differential Revision: D77693984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157514
Approved by: https://github.com/zhxchen17
2025-07-03 23:31:54 +00:00
ad86c05b78 efficient zero_mask implementation for vec128_*_neon (#155766)
Differential Revision: [D76481039](https://our.internmc.facebook.com/intern/diff/D76481039/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155766
Approved by: https://github.com/malfet
2025-07-03 23:27:03 +00:00
b359571c60 Fix is_unaligned usage of statically_known_true (#157400)
Summary:
- symbolic shapes statically_known_true usage  is wrong, this API is meant to be used for SymNodes. what is needed is V.graph.sizevars.statically_known_true. or  V.graph.sizevars.statically_known_Equals or ideally  V.graph.sizevars.statically_known_multiple_of.

- The construction using == 0 is not symbolic, this used to always return false for symbolic inputs.

Differential Revision: D77619293

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157400
Approved by: https://github.com/ColinPeppler
2025-07-03 23:26:36 +00:00
a6fab82b16 [BE]: Fix NVSHMEM builds, add missing 12.9 dependency and update to latest for 2.8RC (#157453)
Fixed our bad builds of nvshmem, (we were not building or testing before) and also updates to the latest version. Newest versions has critical support for things that would actually make it useful, like bfloat16 and float16 support.

This is a proper fix for: https://github.com/pytorch/pytorch/pull/157411
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157453
Approved by: https://github.com/kwen2501, https://github.com/atalman
2025-07-03 22:55:18 +00:00
dd3e7170c2 Add async checkpointing impl to experimental checkpointer and add a builder API (#156927)
1. Adds an AsyncCheckpointer with out-of-process checkpointing and state_dict_stager with shared memory, pinned memory and Zero Overhead Support.

2. Adds two conveinient functions to create sync/async checkpointers

Differential Revision: [D77336833](https://our.internmc.facebook.com/intern/diff/D77336833/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156927
Approved by: https://github.com/pradeepfn
2025-07-03 22:49:20 +00:00
7081b8233a [BE] Accelerator agnostic timer.py (#157131)
Farewell to a lot of if statements - benefit is this now also supports mps synchronization

Still need to think of a good test strategy for the privateUse1 removal, granted I'm not sure what the semantics of something like https://docs.pytorch.org/docs/stable/generated/torch.cpu.synchronize.html actually since CPU is probably synchronous?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157131
Approved by: https://github.com/albanD
2025-07-03 22:23:04 +00:00
7b392bac13 all_gather_bucketing fx pass (#157396)
Porting passes to bucket all_gathers

The main logic of the pass is done via
1. Searching for all all_gathers from the buckets

Copying tests from @wconstab PR to test compatibility with reordering.
Test checks only compatibility, as because of (3) the joint all_gather will be scheduled already as early as possible and no space for reordering.

Pass changes:
Using mutation ops to match performance of fsdp, in future the perfect scenario will be to have only functional graph, that inductor does all memory optimizations on its own without mutable ops.

Inductor changes:
Adding foreach_copy_ lowering

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157396
Approved by: https://github.com/wconstab
2025-07-03 22:07:42 +00:00
19ae5afdaa Fix typo: 'recieve' → 'receive' in comments (#157544)
This PR corrects minor typos in developer-facing comments:

- Replaces 'recieve' with 'receive' in:
  - `FunctionalTensorWrapper.cpp`
  - `make_boxed_from_unboxed_functor.h`

These changes improve code readability and maintain comment correctness.

Thank you for reviewing!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157544
Approved by: https://github.com/soulitzer
2025-07-03 19:11:15 +00:00
3fd84a8592 [BE][PYFMT] migrate PYFMT for torch/[a-c]*/ to ruff format (#144554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144554
Approved by: https://github.com/soulitzer
2025-07-03 18:56:07 +00:00
d56f11a1f2 [MPS] Implement logcumsumexp metal kernel (#156858)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156858
Approved by: https://github.com/malfet
ghstack dependencies: #157512
2025-07-03 18:16:25 +00:00
794b95d54b Enable Half dtype for logcumsumexp_backward (#157512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157512
Approved by: https://github.com/malfet
2025-07-03 18:13:38 +00:00
e3fe001d9e Add einops x torch.compile testing in PyTorch CI (#157416)
Fixes #146782. This PR adds testing for multiple einops versions in
PyTorch CI. This occurs in a new "einops" CI job that runs for both
Python 3.9 and 3.13 (aka, what we test Dynamo over).

Test Plan:
- wait for CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157416
Approved by: https://github.com/guilhermeleobas, https://github.com/arogozhnikov, https://github.com/anijain2305
2025-07-03 17:36:39 +00:00
660dbea909 [cutlass backend] modify presets ahead of cutlass 4 upgrade (#157522)
Differential Revision: [D77707409](https://our.internmc.facebook.com/intern/diff/D77707409/)

Also asking in https://github.com/NVIDIA/cutlass/issues/2435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157522
Approved by: https://github.com/coconutruben
2025-07-03 17:13:24 +00:00
5cfe4377d6 [dtensor] Rework partial propagation in pointwise op and support mul (#157340)
I am trying to see if I can easily add the linearity support for aten.mul to allow Partial placement to propagate through. But it turns out that I have to completely rework the current linearity propagation.

In short, before this PR, linearity mainly support aten.add and some trival ops. It is done by allowing input Partial to propagate, and in the meanwhile, redistribute Replicate inputs to Partial to preserve the single device semantic, i.e suppose we want to execute `aten.add(lhs, rhs)` on 2 ranks:
* `lhs` is partial, value on rank 0: `r0`, lhs value on rank 1: `r1`
* `rhs` is replicate, value: `a`

Then in order to preserve single device semantic (which should produce the value of `a + r0 + r1`), we do `rhs/world_size` first, then add `rhs` to `lhs`. This means every operand would first need be partial, then we can add them together.

But this become non-true for multiplicative operations, like `aten.mul`, for `aten.mul`, assuming the same `aten.mul(lhs, rhs)` and value, we don't need to divide lhs by world_size to preserve single device semantic, b.c. `a* (r0+r1) = a* r0 + a* r1`

So to accomodate the difference of add/mul, in this PR I:
* change linearity to be a int to support different linearity types, add linearity and multiplicative are separate
* add checks to ensure only a subset of partial types can support linearity (namely partial-sum/avg)
* handle the linearity type plumbing through the pointwise ops.
* add `mul.Tensor/Scalar` to be the multiplicative linearity
* added the tests to show that the partial placements can be propagated with `aten.mul`

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157340
Approved by: https://github.com/zpcore
2025-07-03 17:04:08 +00:00
898179331e [cutlass backend] fix CutlassTensor post-renaming (#157408)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157408
Approved by: https://github.com/mlazos
ghstack dependencies: #157402
2025-07-03 17:02:21 +00:00
2e64e45b0b Revert "[build] modernize build-backend: setuptools.build_meta:__legacy__ -> setuptools.build_meta (#155998)"
This reverts commit 404008e3efdabeaf5b140a3aff77131461c33a0a.

Reverted https://github.com/pytorch/pytorch/pull/155998 on behalf of https://github.com/malfet due to Broke inductor_cpp, wrapper see e472daa809/1 ([comment](https://github.com/pytorch/pytorch/pull/155998#issuecomment-3032915058))
2025-07-03 16:47:07 +00:00
e472daa809 [dynamo] Add fx_graph_runnable test coverage (#157021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157021
Approved by: https://github.com/StrongerXi, https://github.com/xmfan

Co-authored-by: Simon Fan <xmfan@meta.com>
2025-07-03 16:42:06 +00:00
ec816d73b4 [MPS] Add shifted_chebyshev_polynomial_[tuvw] (#157488)
For eager and inductor

As for all other chebyshev ops, logic is simply compiled from 94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488
Approved by: https://github.com/dcci
2025-07-03 15:48:37 +00:00
f17f658125 [profiler] add more CUDA API for kernel launcher (#156016)
Add more kernel detection options, resolving TODO
- References : [NVIDIA - docs](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__EXECUTION.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156016
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-07-03 15:26:42 +00:00
c9174a20f7 Revert "[BE] Unskip special ops (#157464)"
This reverts commit e124a0d88ca2aa04bfaca2dcabf5de6244048e45.

Reverted https://github.com/pytorch/pytorch/pull/157464 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](e124a0d88c) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))
2025-07-03 15:24:15 +00:00
b6276a425f Revert "[MPS] Add shifted_chebyshev_polynomial_[tuvw] (#157488)"
This reverts commit 9620994067b18e846a097d1e99af85ec2426ef0a.

Reverted https://github.com/pytorch/pytorch/pull/157488 on behalf of https://github.com/clee2000 due to caused slow test config to time out [GH job link](https://github.com/pytorch/pytorch/actions/runs/16037776972/job/45254574100) [HUD commit link](e124a0d88c) ([comment](https://github.com/pytorch/pytorch/pull/157464#issuecomment-3032676989))
2025-07-03 15:24:15 +00:00
a0e0abd037 Fix typo: 'intialized' → 'initialized' in test_modules.py (#157226)
This PR fixes a minor typo in `test/jit/test_modules.py`:

- Before: `intialized`
- After:  `initialized`

There are no functional code changes — this is a comment-only fix to improve clarity and consistency.

Thank you to the PyTorch team for maintaining this outstanding project.
Please let me know if anything else is needed.

With appreciation,
Abhishek Nandy
[@abhitorch81](https://github.com/abhitorch81)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157226
Approved by: https://github.com/Skylion007
2025-07-03 14:56:02 +00:00
b221be9140 Fix typo: 'intial_query_grad' → 'initial_query_grad' in test_transformers.py (#157306)
This is a minor typo fix in `test/test_transformers.py`:

- Renamed `intial_query_grad` to `initial_query_grad` for improved clarity and correctness in test variable naming.

There are **no functional or logic changes** — this PR is aimed purely at improving readability and maintaining code quality.

Thanks to the PyTorch team for their work and review time
Please feel free to suggest if this needs any adjustment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157306
Approved by: https://github.com/Skylion007
2025-07-03 14:08:12 +00:00
8408522976 Remove +PTX from CUDA 12.8 builds (#157516)
Remove +PTX from CUDA 12.8 builds and small refactor in build_cuda.sh.
Removing +PTX reduces binary size required to be able to upload binaries to pypi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157516
Approved by: https://github.com/malfet, https://github.com/ptrblck, https://github.com/tinglvv
2025-07-03 13:19:19 +00:00
c329a8f19c Fix CPU bitwise shifts for out-of-limit values in VSX-vec (#157463)
Similar to #96659 this implements the conditionals handling the out-of-limit values in the shift amounts (rhs) for the vectorized VSX code using the same logic as the scalar code.

Fixes #109777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157463
Approved by: https://github.com/jgong5
2025-07-03 10:41:33 +00:00
5dfd8a9c7a Remove is_jit_trace option (#157387)
Summary: Title

Test Plan:
CI

Rollback Plan:

Differential Revision: D77319249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157387
Approved by: https://github.com/pianpwk
2025-07-03 09:20:27 +00:00
8c2e450082 [PT][FSDP] fail set_allocate_memory_from_process_group if used together with custom comm hooks (#157487)
Summary:
This is a follow up after the PR to add comm override support: https://github.com/pytorch/pytorch/pull/155189

The previous PR loosely checks the allocation mixin classes, which isn't really safe as the actual hook may still override the behavior.
This may lead to unnecessary confusion for no good use case. So for now we just make the 2 sets of APIs largely incompatible:
1. setting custom comms after `set_allocate_memory_from_process_group_for_comm()` is ok.
2. setting `set_allocate_memory_from_process_group_for_comm()` after custom comms is ko.

Basically `set_allocate_memory_from_process_group_for_comm` is like a drop in hammer while the `set_custom_all_gather/reduce_scatter()` are like finer-grained scalpels that require more code crafted.

We can revisit this if there's use case in between but for now they can be largely viewed independent from each other (even tho we do share some of the underlying pieces for now, that could be subject to change and should not be exposed to end users).

Test Plan: added UT

Differential Revision: D77681620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157487
Approved by: https://github.com/weifengpy
2025-07-03 07:00:35 +00:00
2bb33e7a08 Fixed triton kernel in ET due to Triton version change. (#157484)
Summary: Fixed triton kernel in ET due to Triton version change.

Test Plan:
buck2 run mode/opt param_bench/fb/integration_tests:test_et_replay

Rollback Plan:

Differential Revision: D77398841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157484
Approved by: https://github.com/davidberard98
2025-07-03 06:16:23 +00:00
4ce6e6ec88 XCCL changes for DDP (#155497)
Add XCCL documentation for DDP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155497
Approved by: https://github.com/guangyey, https://github.com/AlannaBurke

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-07-03 05:18:08 +00:00
382598ef87 Fix unsafe collective reorder past wait (#157489)
Covers the case where the output of one collective feeds the input of another collective.
e.g. TP + FSDP - all_gather(tp+dp sharded param on TP dim) -> allgather dp_sharded buffer on DP dim

Fixes a bug where the reordering pass specifically exempted wait nodes from dependencies.
Note:  this exemption was incorrect, so it should be removed. But it was also put there for a reason, to help move collectives past wait nodes that are not related to that collective.  After this fix, reordering performance may be worse and we need to find a smarter way to decide if a particular wait node is a blocker for a given collective.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157489
Approved by: https://github.com/IvanKobzarev
ghstack dependencies: #156879
2025-07-03 05:04:19 +00:00
dc524efb4d Move logging into inner method for reorder pass (#156879)
The reason for inner/outer method is to keep the outer method conforming
to the typedef for a comms graph pass which returns one obj, while
allowing unit tests to call the inner method that returns more metadata
useful for testing the pass.  The logs should be in the inner part, so
they are functional also during unit testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156879
Approved by: https://github.com/IvanKobzarev
2025-07-03 05:04:19 +00:00
5d5a5b3501 Fix GITHUB_OUTPUT syntax in create_release.yml workflow (#157466)
#149919 fixed a number of linting issues, however, the conversion of the deprecated `::set-output` command to the new `>> $GITHUB_OUTPUT` redirect syntax went wrong, resulting in [failing uploads of the 2.8.0 rc1-rc3 pre-release tarballs](https://github.com/pytorch/pytorch/actions/runs/15892205745/job/44816789782).

This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157466
Approved by: https://github.com/clee2000, https://github.com/atalman
2025-07-03 04:57:52 +00:00
404008e3ef [build] modernize build-backend: setuptools.build_meta:__legacy__ -> setuptools.build_meta (#155998)
Change `build-system.build-backend`: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta`. Also, move static package info from `setup.py` to `pyproject.toml`.

Now the repo can be installed from source via `pip` command instead of `python setup.py develop`:

```bash
python -m pip install --verbose --editable .

python -m pip install --verbose --no-build-isolation --editable .
```

In addition, the SDist is also buildable:

```bash
python -m build --sdist
python -m install dist/torch-*.tar.gz  # build from source using SDist
```

Note that we should build the SDist with a fresh git clone if we will upload the output to PyPI. Because all files under `third_party` will be included in the SDist. The SDist file will be huge if the git submodules are initialized.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155998
Approved by: https://github.com/ezyang, https://github.com/cyyever, https://github.com/atalman
2025-07-03 04:10:44 +00:00
b642a5c118 [cutlass backend] Add dynamo timed (#157410)
Differential Revision: [D77631592](https://our.internmc.facebook.com/intern/diff/D77631592/)

Before:
![Screenshot 2025-07-01 at 4 08 06 PM](https://github.com/user-attachments/assets/8f6445aa-50c7-456f-b5ac-b2749eb9bf40)

After (different run):
![Screenshot 2025-07-01 at 5 11 09 PM](https://github.com/user-attachments/assets/7513d312-c4dc-4e39-9718-c63eb641bc30)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157410
Approved by: https://github.com/jingsh
2025-07-03 04:03:20 +00:00
493f42a541 [symm_mem] Create a one side get api for symm mem (#157294)
Doing similar like what we did in https://github.com/pytorch/pytorch/pull/156443 so that we can also have a one-side get API for symmetric memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157294
Approved by: https://github.com/kwen2501
2025-07-03 03:52:05 +00:00
662c1cfed2 [c10d][PGNCCL] Add waitcounter for watchdog and heartbeat monitoring thread (#157480)
We want to have a wait counter for both side thread so that we can monitor its lifecycle.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157480
Approved by: https://github.com/d4l3k
2025-07-03 02:47:06 +00:00
5cc4e856fd Add device_id to XPU device properties (#156481)
# Motivation

Some older Intel iGPUs may share the same device name across different hardware products.
(See [device name example](aaa01c06f9/shared/source/dll/devices/devices_base.inl (L190-L199)))
To help disambiguate which specific iGPU product is being used, we introduce the use of a
[device id](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/sycl_ext_intel_device_info.md#device-id). This device id corresponds to the Device ID in [official Intel product specification](https://www.intel.com/content/www/us/en/products/sku/232155/intel-core-i71360p-processor-18m-cache-up-to-5-00-ghz/specifications.html) and enables more accurate identification and troubleshooting for user issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156481
Approved by: https://github.com/EikanWang, https://github.com/albanD
2025-07-03 01:22:11 +00:00
7597988f1b [fake tensor] fix issue of no attribute tags (#156689)
Fixes #156688

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156689
Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman
2025-07-03 01:16:01 +00:00
9620994067 [MPS] Add shifted_chebyshev_polynomial_[tuvw] (#157488)
For eager and inductor

As for all other chebyshev ops, logic is simply compiled from 94716db222/aten/src/ATen/native/cuda/Math.cuh (L2821)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157488
Approved by: https://github.com/dcci
ghstack dependencies: #157464
2025-07-02 23:29:35 +00:00
e124a0d88c [BE] Unskip special ops (#157464)
They were slow on CUDA-11.3, which has long been gone, let's see if they work now

Before
```
$ python test_ops.py -k chebyshev_polynomial_
ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss
----------------------------------------------------------------------
Ran 432 tests in 8.575s

OK (skipped=344)
```
After
```
$ python test_ops.py -k chebyshev_polynomial_
ssssssss........................ssssssssssssssss......../home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.)
  return torch._C._get_cublas_allow_tf32()
........................................................................................ssssssss................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss
----------------------------------------------------------------------
Ran 432 tests in 42.379s

OK (skipped=80)
```

Fixes https://github.com/pytorch/pytorch/issues/79528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157464
Approved by: https://github.com/Skylion007
2025-07-02 23:16:52 +00:00
7cfd054075 [attempt 2] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#157472)
Summary:
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.

sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.

ex:
 bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.

In this PR I only handle default contiguity, will follow up with changes for other formats like  channel_last .
We use this patter in this PR for several locations to avoid DDEs.

Test Plan:
contbuild & OSS CI,

Rollback Plan:

Reviewed By: malfet

Differential Revision: D77639021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157472
Approved by: https://github.com/aorenste
2025-07-02 23:12:29 +00:00
d40aaa42ee [BE][16/16] fix typos in torch/ (torch/utils/) (#156606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156606
Approved by: https://github.com/albanD
ghstack dependencies: #156318, #156320, #156602, #156604
2025-07-02 22:55:29 +00:00
11c07c848c [BE][14/16] fix typos in torch/ (torch/fx/) (#156604)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156604
Approved by: https://github.com/jingsh
ghstack dependencies: #156318, #156320, #156602
2025-07-02 22:55:29 +00:00
db259bd6b8 [BE][12/16] fix typos in torch/ (#156602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156602
Approved by: https://github.com/justinchuby, https://github.com/albanD
ghstack dependencies: #156318, #156320
2025-07-02 22:55:29 +00:00
d5cdc36943 [BE][10/16] fix typos in torch/ (torch/csrc/jit/) (#156320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156320
Approved by: https://github.com/albanD
ghstack dependencies: #156318
2025-07-02 22:55:29 +00:00
541584d22e [BE][8/16] fix typos in torch/ (torch/csrc/jit/) (#156318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156318
Approved by: https://github.com/albanD
2025-07-02 22:55:29 +00:00
c0e155a8d2 [cutlass backend] Use alignment of D for EVT / Float8 (#157402)
I encountered an C++ compile error from running cutlass backend tests when upgrading cutlass version. It seems like Nvidia added
"static_assert(detail::is_aligned<ElementC_, AlignmentC, ElementD_, AlignmentD>(),"

b995f93317/include/cutlass/epilogue/collective/builders/sm90_builder.inl (L297)

However, it seems codegen have the wrong alignment for D. For C, 1 is okay since it is void. But for D, this is probably wrong.
```
    void, cutlass::layout::ColumnMajor, 1,
    cutlass::bfloat16_t, cutlass::layout::RowMajor, 1,
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157402
Approved by: https://github.com/ColinPeppler, https://github.com/mlazos
2025-07-02 22:55:00 +00:00
48560eef80 [dynamo] Fix bug in dict(mapping_proxy) (#157467)
Fixes https://github.com/pytorch/pytorch/issues/157284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157467
Approved by: https://github.com/jansel, https://github.com/StrongerXi

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2025-07-02 22:13:02 +00:00
fd4f704905 [ez][CI] Print set output in CI (#157477)
Print what the output that's getting set is for better debugging

It's probably bad there are 4 of these, but I'm also not sure if imports will behave correctly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157477
Approved by: https://github.com/huydhn
2025-07-02 21:47:19 +00:00
60e66d11ab [CI] Keep-going on main (#157470)
Run an experiment where we turn on keep going on main.  Revert this PR to cancel the experiment

There have been a couple of changes that make it so that HUD will show the failure early even while the job is in progress, so triaging for reverts should still be able to happen quickly
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157470
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/malfet
2025-07-02 21:42:46 +00:00
4b4c2a7b1d Support complex numbers in DTensor redistribute (#157329)
Add complex number unwrapping in functional collectives used by DTensor.

Complex tensors are not directly supported by underlying comm kernels
(e.g. nccl) but complex tensors can be viewed as real tensors of a
higher rank (added size-2 tensor dim represents real vs im component).
Collective output is then viewed as complex to restore the
original/expected shape and dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157329
Approved by: https://github.com/XilunWu
2025-07-02 21:37:16 +00:00
af9c92b4cb [CI] Remove redundant accuracy benchmarks for cpp_wrapper (#155966)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155966
Approved by: https://github.com/desertfire
2025-07-02 20:58:08 +00:00
c09cf29d7d [ez][BE] Tag deletion script to delete any old ciflow + autorevert tags (#157468)
Change the branch/tag deletion script that runs once per day to delete more tags

Previous: only delete ciflow tags that didn't correspond to an open PR
New: delete ciflow tags attached to commits that are > 7 days old.  Also delete `trunk/<sha>` (I think they are for autorevert) tags that are attached to commits that are > 7 days old

It's hard to figure out when the actual tag was pushed or created, so instead it looks at the commit date, which might lead to unexpected behavior if the tag was pushed much later than the commit (ex triggering periodic later to bisect).  I think it's ok though since you don't really need the tag after the workflow runs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157468
Approved by: https://github.com/izaitsevfb
2025-07-02 20:42:32 +00:00
6f60cfe9b1 [ez] Add super().setUp() in test_ops::TestFakeTensor (#157475)
Noticed some disable issues getting a bunch of comments, so I took a look

One day I'll write a better check for this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157475
Approved by: https://github.com/huydhn
2025-07-02 20:34:00 +00:00
e20784f228 [dynamo] Support BUILTIN_MATCH serialization. (#157016)
Serialize BUILTIN_MATCH since they are all stored in __builtin__ dict.

Also fixed an issue that the wrong global scope is passed to CheckFunctionManager while loading guards. Previously we can always reuse the compile-time global scope for evaluating guards because the compile-time and runtime global scope are always the same.

For precompile, we need to serialize the compile-time global scope for loading only. We need to point the CheckFunctionManager to the new global scope after loading is finished for evaluating guards.

Differential Revision: [D77159313](https://our.internmc.facebook.com/intern/diff/D77159313/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157016
Approved by: https://github.com/jansel, https://github.com/jamesjwu
2025-07-02 20:24:24 +00:00
172853547a [inductor] more size_hint_or_throw usage (#157394)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157394
Approved by: https://github.com/jingsh
2025-07-02 20:20:59 +00:00
e0ab1b538a [ez][BE] Remove max jobs override for CI build jobs (#157473)
Basically reverts #147487 since it's not needed anymore

Not an exact revert because some things have already been removed in a different PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157473
Approved by: https://github.com/huydhn
2025-07-02 20:12:28 +00:00
3f569f9af7 [BE] Remove extra semicolon (#157486)
Fixes
```
/Users/nshulga/git/pytorch/pytorch/torch/nativert/executor/GraphExecutorBase.cpp:16:58: warning: extra ';' outside of a function is incompatible with C++98 [-Wc++98-compat-extra-semi]
   16 |       execPlan_(ExecutionPlanner{graph_}.createPlan()) {};
      |                                                          ^
1 warning generated.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157486
Approved by: https://github.com/seemethere, https://github.com/atalman, https://github.com/Skylion007
2025-07-02 19:56:21 +00:00
94716db222 [BE][DCE] eliminate remnants of global gemm cache (#157327)
Summary: The global gemm cache has not been maintained in ~1 year, and the only entry point (`search_autotune_cache`) was recently deprecated. Meaning, this is now dead code that we can remove.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77520979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157327
Approved by: https://github.com/jansel
2025-07-02 19:52:35 +00:00
06f39a71b6 Add Release 2.8 CUDA matrix. Update Release schedule for 2.7.1 and 2.9 (#157482)
This PR:
- Adds Release 2.8 CUDA matrix
- Update Release 2.9 schedule, to make it more similar to 2.5 release schedule. Mid Oct release
- Update 2.7.1 release day
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157482
Approved by: https://github.com/Camyll
2025-07-02 19:52:24 +00:00
36dd598bda layernorm tests: Tweak test thresholds for comparing tensors (#156699)
After I landed this PR: https://github.com/pytorch/pytorch/pull/156600, this test was failing internally on large tensors because the differences were greater than tolerances on some cuda devices.

We now raise the tolerances for larger tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156699
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-07-02 19:33:38 +00:00
32983ea698 [nativert] continue to move generated static dispatch kernels (#157460)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77623080

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157460
Approved by: https://github.com/zhxchen17
2025-07-02 19:28:13 +00:00
5e636d664a [BE] @serialTest decorator must be called (#157388)
Otherwise it turns test into a trivial one(that always succeeds), as following example demonstrates
```python
import torch
from torch.testing._internal.common_utils import serialTest, run_tests, TestCase

class MegaTest(TestCase):
    @serialTest
    def test_foo(self):
        if hasattr(self.test_foo, "pytestmark"):
            print("foo has attr and it is", self.test_foo.pytestmark)
        print("foo")

    @serialTest()
    def test_bar(self):
        if hasattr(self.test_bar, "pytestmark"):
            print("bar has attr and it is", self.test_bar.pytestmark)
        print("bar")

if __name__ == "__main__":
    run_tests()
```

That will print
```
test_bar (__main__.MegaTest.test_bar) ... bar has attr and it is [Mark(name='serial', args=(), kwargs={})]
bar
ok
test_foo (__main__.MegaTest.test_foo) ... ok

----------------------------------------------------------------------
Ran 2 tests in 0.013s

```

Added assert that arg is boolean in the decorator to prevent such silent skips in the future

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157388
Approved by: https://github.com/clee2000
2025-07-02 19:15:19 +00:00
eaf32fffb7 fixed a tiny typo in torch.compiler.md (#157462)
Fixes #157444

there was a typo in [docs/source/torch.compiler.md](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler.md) : see -> seen
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157462
Approved by: https://github.com/Skylion007, https://github.com/svekars
2025-07-02 19:15:15 +00:00
0e9d8032a3 [build] remove cmake cache and reconfigure again if it is invalid (#156958)
See also:

- astral-sh/uv#14269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156958
Approved by: https://github.com/Skylion007
ghstack dependencies: #156742
2025-07-02 18:46:32 +00:00
0105cd89ab [ONNX] Fix conversion of attention - 4D (#157130)
Fixes a wrong conversion to onnx while investigation #149662.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157130
Approved by: https://github.com/gramalingam, https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2025-07-02 18:05:10 +00:00
d5d14ee823 [nativert] create persistent value helper (#157286)
Summary: att

Test Plan: CI

Reviewed By: georgiaphillips

Differential Revision: D74300519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157286
Approved by: https://github.com/SherlockNoMad
2025-07-02 17:15:52 +00:00
156bc243f0 Back out "Include c++ stack traces when we hit constraint violation (#155603)" (#157406)
Summary:
Original commit changeset: 4b3fdaa8f2c6

Original Phabricator Diff: D76434787

Meta:
https://fb.workplace.com/groups/1286739428954016/permalink/1535462614081695/

Test Plan:
Meta:
Revert D76434787 for S536719

Rollback Plan:

Differential Revision: D77626334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157406
Approved by: https://github.com/bobrenjc93
2025-07-02 16:51:07 +00:00
bd6b5fddbf [Precompile] [easy] Serialize requires_grad for tensors when serializing guards (#157372)
Need to keep requires_grad on the tensor when serializing/deserializing guards. This matters when there's a TENSOR_MATCH guard on a tensor that requires_grad. Added a unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157372
Approved by: https://github.com/jansel, https://github.com/zhxchen17
ghstack dependencies: #156433
2025-07-02 16:34:37 +00:00
54701a0c94 Add is_hidden_event method to KinetoEvent Python interface (#155214)
Fixes #155213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155214
Approved by: https://github.com/sraikund16
2025-07-02 16:29:21 +00:00
0edc1b91f7 [Inductor] Disable decompose_k for AMD (#157283)
Differential Revision: D77544250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157283
Approved by: https://github.com/bdhirsh
2025-07-02 15:21:46 +00:00
9f5276dc07 Fix typo: 'Intializes' → 'Initializes' in _distributed_c10d.pyi docst… (#157455)
Description:

This PR fixes a small documentation typo in torch/_C/_distributed_c10d.pyi, correcting:

Intializes → Initializes

This helps improve clarity in internal docstrings for maintainers and contributors.
Let me know if further changes are needed. Thanks for your time and the amazing work on PyTorch!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157455
Approved by: https://github.com/Skylion007, https://github.com/malfet
2025-07-02 15:19:05 +00:00
9d175bc7e6 Fixes for CPython int/float tests (#155978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978
Approved by: https://github.com/zou3519
2025-07-02 15:04:00 +00:00
b096341963 [BE] use pathlib.Path instead of os.path.* in setup.py (#156742)
Resolves:

- https://github.com/pytorch/pytorch/pull/155998#discussion_r2164376634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156742
Approved by: https://github.com/malfet
2025-07-02 14:57:58 +00:00
82eefaedd9 [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)
Fixes #155006

Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
2025-07-02 14:02:01 +00:00
c553c55be7 Revert "Fix full_like decomposition to preserve strides (#144765)"
This reverts commit 01b0f09931d47bd2716398a0c335b2807dc3074d.

Reverted https://github.com/pytorch/pytorch/pull/144765 on behalf of https://github.com/jeanschmidt due to Seems to be breaking internal tests see [D77652778](https://www.internalfb.com/diff/D77652778), @jansel may you help get this PR merged? ([comment](https://github.com/pytorch/pytorch/pull/144765#issuecomment-3027975098))
2025-07-02 13:56:03 +00:00
d5a89178b0 Revert "[dynamo] Add fx_graph_runnable test coverage (#157021)"
This reverts commit 77676753ecabf6a6645bdd3abfe01939e5751e76.

Reverted https://github.com/pytorch/pytorch/pull/157021 on behalf of https://github.com/jeanschmidt due to New tests are red internally, more details on [D77652793](https://www.internalfb.com/diff/D77652793). Maybe codev could be a better strategy to merge this PR faster... ([comment](https://github.com/pytorch/pytorch/pull/157021#issuecomment-3027952946))
2025-07-02 13:48:41 +00:00
bdb7819166 [dynamo, nested graph breaks] remove recursive cell/freevar in instruction tx (#154078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154078
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-07-02 13:36:14 +00:00
34c8033fd3 Fix a div_mod bug in generic_math.h (#157383)
Summary: There is a bug in integer div_mod that when the remainder is 0 and the divisor is negative, mod operation produces a negative number. Fixed in this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157383
Approved by: https://github.com/angelayi, https://github.com/jingsh
2025-07-02 12:22:57 +00:00
ab2294d828 [dynamo] fix _torchdynamo_orig_callable naming issues (#156901)
`_torchdynamo_orig_callable` was being used in two distinct places:
- to get the original user function from nested eval_frame.py decorators
- to get the original backend from nested convert_frame.py callbacks

We rename ~the first usage to `_torchdynamo_orig_fn`~ and the second to `_torchdynamo_orig_backend` in order to distinguish these cases.

UPDATE: seems like both internal and OSS users depend on `_torchdynamo_orig_callable`, but it only seems in the first context. We should thus keep the original name for the first case then.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156901
Approved by: https://github.com/StrongerXi, https://github.com/jansel
2025-07-02 09:53:55 +00:00
3173616532 [nativert] start to move generated static dispatch kernels (#157403)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77622952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157403
Approved by: https://github.com/georgiaphillips
2025-07-02 08:42:01 +00:00
8c0df6fe17 Revert "[dynamo][fsdp] Consistent behavior of int attributes (#157262)"
This reverts commit 42b48ee67229286127390000f103a11dfc8901f5.

Reverted https://github.com/pytorch/pytorch/pull/157262 on behalf of https://github.com/jeanschmidt due to Newly introduced tests are red in internal runs, check D77593713 ([comment](https://github.com/pytorch/pytorch/pull/157262#issuecomment-3026944993))
2025-07-02 08:30:39 +00:00
0364db7cd1 [PT] support custom all_gather and reduce_scatter comms (#155189)
Summary:
This change introduces 2 comm override APIs: `set_custom_all_gather` and `set_custom_reduce_scatter` to allow for custom behavior respectively.

This allow users to control how the comm buffers are allocated and the exact comm implementation for flexibility.
For details, see docstring in `Comm` in `_fsdp_api.py`

Related PR:
https://github.com/pytorch/pytorch/pull/150564

Test Plan: CI

Differential Revision: D75714362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155189
Approved by: https://github.com/weifengpy
2025-07-02 06:58:45 +00:00
f8c0a4bd28 [inductor] enable bf32 test for mkldnn conv (#127293)
Enable more test on inductor conv + bf32
Testplan:
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv2d_unary_cpu
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv3d_unary_cpu
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv_transpose2d_unary
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv2d_binary
python test/inductor/test_mkldnn_pattern_matcher.py -k test_conv3d_binary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127293
Approved by: https://github.com/jgong5
ghstack dependencies: #126050, #126054

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-02 01:49:01 +00:00
4c8eb65efb allow to use bf16 as fp32 internal precision for mkldnn conv backward (#126054)
Used for CI since depends on ideep update.

Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"`

### TestPlan
python test/test_mkldnn.py -k conv

### Benchmarking

FP32 conv2d backward vs. BF16 internal computation conv backward on SPR

Single core:

Input | fp32 ms | bf16 internal  ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 461.6734| 358.3779| 1.48
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 358.3779 | 247.8631| 1.46
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 4.3783| 3.8513| 1.14

56 cores:
Input | fp32 ms | bf16 internal ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 16.6119 | 12.2047 | 1.38
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 12.0016 | 8.6711 | 1.38
IC:   256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 20.5947 | 15.9366 | 1.29
IC: 1024, OC: 256, kernel: 1, stride: 1,   N: 256, H: 14, W: 14, G: 1, pad: 0 | 40.0952 | 32.2222 | 1.24
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 162.7449 | 142.3054 | 1.14

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126054
Approved by: https://github.com/jgong5
ghstack dependencies: #126050

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-02 01:40:13 +00:00
5a2db5152d allow to use bf16 as fp32 internal precision for mkldnn conv (#126050)
Allow to use `BF16` as the internal computation data types by `torch.backends.mkldnn.conv.fp32_precision="bf16"`

### TestPlan
python test/test_mkldnn.py -k conv

### Benchmarking

FP32 conv2d vs. BF16 internal computation conv2d on SPR

Single core:

Input | fp32 ms | bf16 internal  ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 56, W: 56, G: 1, pad: 0 | 185.5071 | 83.4749 | 2.22
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 194.7558 | 79.1683| 2.46
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 1.9213 | 1.3690 | 1.40

56 cores:
Input | fp32 ms | bf16 internal ms | Speed up
-- | -- | -- | --
IC:   64, OC: 256, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 6.5804  | 7.4349 | 0.89
IC:   128, OC: 512, kernel: 1, stride: 1, N: 256, H: 28, W: 28, G: 1, pad: 0 | 4.9940  | 3.8093 | 1.31
IC:   256, OC: 1024, kernel: 1, stride: 1, N: 256, H: 14, W: 14, G: 1, pad: 0 | 8.8359 | 5.5802 | 1.58
IC: 1024, OC: 256, kernel: 1, stride: 1,   N: 256, H: 14, W: 14, G: 1, pad: 0 | 16.5800 | 9.2367 | 1.80
IC: 256, OC: 256, kernel: 3, stride: 1,   N: 1, H: 16, W: 16, G: 1, pad: 0 | 79.5436 | 38.3861  | 2.07

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126050
Approved by: https://github.com/jgong5, https://github.com/jansel

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-07-02 01:31:23 +00:00
0a63053fe9 Don't store flamegraph to tmp folder (#157374)
Where it's accessible(and mutable) by multiple users. Instead use
`~/.cache` folder instead

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157374
Approved by: https://github.com/eqy
ghstack dependencies: #157373
2025-07-02 00:46:51 +00:00
bb476310a4 [dynamo][guards] Stash root guard manager pointer in the LeafGuard (#157325)
Preparing to simplify the recompilation reason codebase. This PR was 95% done by using AI tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157325
Approved by: https://github.com/jansel
2025-07-02 00:42:43 +00:00
fa1c20ae92 Fix test consolidate hf safetensors (#157386)
Need to change an argument name that was changed in the test so that it doesn't throw

Differential Revision: [D77604210](https://our.internmc.facebook.com/intern/diff/D77604210/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157386
Approved by: https://github.com/meetv18
ghstack dependencies: #154743, #156705
2025-07-02 00:16:21 +00:00
77676753ec [dynamo] Add fx_graph_runnable test coverage (#157021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157021
Approved by: https://github.com/StrongerXi, https://github.com/xmfan

Co-authored-by: Simon Fan <xmfan@meta.com>
2025-07-02 00:10:01 +00:00
617e3f69f8 [FP8] Fix Benchmarking for certain Priors (#155722)
Summary: For priors like layer norm, the order of the weight quantization kernel might be different and therefore have a different suffix, so we use regular expression instead.

Test Plan:
Trying this on model id 737772166 with
```
buck2 run mode/opt  mode/inplace -c fbcode.platform010_cuda_version=12 -c fbcode.nvcc_arch=h100 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark -- --lower-backend=AOT_INDUCTOR   --model-snapshot-id=737772166_0 --trace-aot-inductor-module=True --disable-acc-tracer=False --batch-size=1024 --node_replacement_dict "{'(autotune)':{'(1000+,1000+)':'fp8_float_model_dynamic_quantization_rowwise'}"
```
will allow more linears to be correctly replaced with fp8.
An example of the gpu trace can be found in https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/hpc/new/models/feed/benchmark/libkineto_activities_773108_f58b57e208c04787acd3bcb01a3e8771.json.gz&bucket=gpu_traces.

Rollback Plan:

Differential Revision: D76092551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155722
Approved by: https://github.com/Skylion007
2025-07-02 00:01:23 +00:00
ab6cb34480 Revert "[inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)"
This reverts commit 563fd95563c5edd732ae260b3bd3d0c38822ab57.

Reverted https://github.com/pytorch/pytorch/pull/157322 on behalf of https://github.com/davidberard98 due to fails on rocm ([comment](https://github.com/pytorch/pytorch/pull/157322#issuecomment-3025826951))
2025-07-01 23:21:37 +00:00
c6a27bae36 Revert "[do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590)"
This reverts commit d0a9629435aaceb5acbf31aad70f2109cb8a3ea2.

Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/laithsakka due to was asked by to land this internally  ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3025796794))
2025-07-01 22:58:14 +00:00
563fd95563 [inductor][user triton] sanitize triple-quoted docstrings in kernel definitions (#157322)
Fixes #155006

Inductor sometimes codegens triton kernel definitions into a triple-quoted text block. If the text block itself contains triple-quotes, this breaks. Notably, this can happen for user-defined triton kernels, where the user may have added a docstring in their triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157322
Approved by: https://github.com/zou3519, https://github.com/drisspg
2025-07-01 22:51:11 +00:00
6ef70edd9a Revert "Inductor logging + analysis of torch.profile (#149697)"
This reverts commit 47f10d0ad0dda281c886ff08ac2f938207027316.

Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/malfet due to Looks like it's breaking ROCM tests, see https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=rocm%20%2F%20linux-jammy ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3025673908))
2025-07-01 22:11:53 +00:00
3df6360e8c [BE][Easy][setup] use super().method(...) in command subclasses in setup.py (#156044)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156044
Approved by: https://github.com/albanD
ghstack dependencies: #156741
2025-07-01 22:09:10 +00:00
d0a9629435 [do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590)
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.

sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.

ex:
 bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.

In this PR I only handle default contiguity, will follow up with changes for other formats like  channel_last .
We use this patter in this PR for several locations to avoid DDEs.
Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590
Approved by: https://github.com/ezyang
2025-07-01 21:39:38 +00:00
22edb457c9 [invoke_subgraph][partitioner] Add meta val on run_and_save_rng ops (#157319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157319
Approved by: https://github.com/zou3519
2025-07-01 21:02:08 +00:00
e5f6ffd810 [BE] Replace checkcall("chmod") with os.chmod (#157373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157373
Approved by: https://github.com/clee2000, https://github.com/eqy, https://github.com/Skylion007
2025-07-01 20:46:25 +00:00
019e30e3b8 [BE] Decorate LargeTensorTest with serialTests (#157382)
May be it'll help make M2-15 jobs more stable, as that was the last test run before OOM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157382
Approved by: https://github.com/clee2000
2025-07-01 20:35:42 +00:00
4500a4aa50 remove allow-untyped-defs from torch/backends/mps/__init__.py (#157227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157227
Approved by: https://github.com/Skylion007
2025-07-01 20:00:19 +00:00
6bc263809d [SymmMem] Add NVSHMEM_CHECK macro (#157174)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157174
Approved by: https://github.com/fduwjj, https://github.com/fegin
2025-07-01 19:50:28 +00:00
ffac0de07e [export] Remove stack trace from input/output (#157302)
Fixes https://github.com/pytorch/pytorch/issues/157183

https://github.com/pytorch/pytorch/pull/156257 consolidated the path for saving stack traces, but missed the part where stacktraces are not added to placeholder/output nodes in proxy_tensor tracing [(code)](https://github.com/pytorch/pytorch/pull/156257/files#diff-6960ce90e7162c0953b1ca07e92e7f0f2f6ba63b427b42df593e20cc6a096bb7L1107).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157302
Approved by: https://github.com/yushangdi
2025-07-01 19:16:28 +00:00
01b0f09931 Fix full_like decomposition to preserve strides (#144765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144765
Approved by: https://github.com/amjames, https://github.com/jansel
2025-07-01 19:13:22 +00:00
6401d1d53d Revert "Fused RMSNorm implementation (#153666)"
This reverts commit e1aee86646aa6d1b9cb9d34351e43936401c5efc.

Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/davidberard98 due to causing build failures on main branch [GH job link](https://github.com/pytorch/pytorch/actions/runs/16007148842/job/45156382001) [HUD commit link](e1aee86646) ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3025146176))
2025-07-01 18:46:45 +00:00
3a5677a380 Revert "ci: Add ability to test images for build-triton-wheel (#156894)"
This reverts commit 0e47312ae5a687f0aed61db753d03180118cddc4.

Reverted https://github.com/pytorch/pytorch/pull/156894 on behalf of https://github.com/seemethere due to causing issues in downstream builds see https://github.com/pytorch/pytorch/pull/156664 for more info ([comment](https://github.com/pytorch/pytorch/pull/156894#issuecomment-3025137790))
2025-07-01 18:43:34 +00:00
02608e560a [ROCm] Add more shards for inductor dashboard, more frequent runs (#157288)
Also increases regularity of dashboard runs on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157288
Approved by: https://github.com/jeffdaily
2025-07-01 18:27:30 +00:00
e1aee86646 Fused RMSNorm implementation (#153666)
Relevant #72643

Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.

```py
import torch
import torch.nn as nn

class RMSNorm(nn.Module):
    def __init__(self, dim, eps=1e-5):
        super().__init__()
        self.eps = eps
        self.scale = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        norm_x = x.norm(2, dim=-1, keepdim=True)
        rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
        x_normed = x / (rms_x + self.eps)
        return self.scale * x_normed

def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
    rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
    input_data = torch.randn(input_shape, device='cuda', dtype=dtype)

    for _ in range(warmup_iterations):
        _ = rms_norm_layer(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = rms_norm_layer(input_data)

    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- RMSNorm CUDA Benchmark ---")
    print(f"Input Shape: {input_shape}")
    print(f"Normalized Dimension: {normalized_dim}")
    print(f"Benchmark Iterations: {num_iterations}")
    print(f"--- Fused Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
    for _ in range(warmup_iterations):
        _ = compiled_rms_norm(input_data)
    torch.cuda.synchronize()

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    start_event.record()
    for _ in range(num_iterations):
        _ = compiled_rms_norm(input_data)
    end_event.record()
    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iterations

    print(f"--- TorchCompile Implementation ---")
    print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
    print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")

    print("-" * 50)

if __name__ == '__main__':
    parameter_sets = [
        {'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
        {'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
        {'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
        {'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
    ]

    num_benchmark_iterations = 200
    num_warmup_iterations = 20

    for params in parameter_sets:
        batch_size = params['batch_size']
        sequence_length = params['sequence_length']
        hidden_features = params['hidden_features']
        data_type = params.get('dtype', torch.float16)

        shape = (batch_size, sequence_length, hidden_features)
        norm_dim_to_normalize = hidden_features

        print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
        benchmark_rmsnorm_cuda(input_shape=shape,
                               normalized_dim=norm_dim_to_normalize,
                               num_iterations=num_benchmark_iterations,
                               warmup_iterations=num_warmup_iterations,
                               dtype=data_type)
```

Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code

torch.manual_seed(0)

device = torch.device("cuda")

for batch in range(0, 9):
    for i in range(9, 16):
        normalized_shape_arg = (2**batch, 2**i)
        input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
        weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)

        model = torch.nn.functional.rms_norm
        compiled_model = torch.compile(model)
        loss = torch.randn_like(input_tensor)

        num_iter = 5
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        start_event = torch.cuda.Event(enable_timing=True)
        end_event = torch.cuda.Event(enable_timing=True)
        start_event.record()
        num_iter = 10
        for j in range(num_iter):
            output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
            output.backward(loss)

        end_event.record()
        torch.cuda.synchronize()

        elapsed_time_ms = start_event.elapsed_time(end_event)
        avg_time_ms = round(elapsed_time_ms / num_iter, 5)
        print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel
2025-07-01 18:22:24 +00:00
1c8844d9e7 [MPS] Switch Cholesky decomp to column wise (#157014)
Everything should go thru a generalized kernels, and Metal kernels should work with the same sizes and strides as CPU or CUDA backends to avoid problems with `torch.compile` that relies on the meta kernels to tell what its ouput going to look like.

To avoid returning tensors with different layout depending on whether upper parameter is true or false, templatize `factorDiagonalBlock`, `applyTRSM` and `applySYRK` to take upper/lower (actually row-wise vs column-wise) as template argument and call appropriate templates from host

TODOs:
 - Rename upper parameter to something more sensible and add comments
 - Use simd_groupsize instead of hardcoded 32 everywhere

Fixes https://github.com/pytorch/pytorch/issues/156658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157014
Approved by: https://github.com/Skylion007, https://github.com/dcci
ghstack dependencies: #157179
2025-07-01 18:00:59 +00:00
720c2c46b1 [Inductor UT][XPU] Reduce the runtime of the test case test_comprehensive_nn_functional_max_pool2d_xpu. (#157357)
This test case has over a thousand input samples, causing it to run for more than 30 minutes, which triggers the timeout mechanism and breaks the XPU CI. This PR limit the sample number as one for this XPU case .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157357
Approved by: https://github.com/chuanqi129, https://github.com/jansel
2025-07-01 17:47:49 +00:00
3bc6bdc866 [BE] add type annotations and run mypy on setup.py (#156741)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156741
Approved by: https://github.com/aorenste
2025-07-01 17:09:05 +00:00
47f10d0ad0 Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-07-01 16:51:03 +00:00
0f9c1b374f [dynamo] Ensure global state guard is preserved across serialization. (#157285)
Currently, every time we construct a GLOBAL_STATE guard, we always create a fresh guard based on the current global state. For precompile, we want to create a GLOBAL_STATE guard always based on some external sources, e.g. serialized global states. This can also be applied with the normal case where we just pass in the global state guard from Python.

Differential Revision: [D77400988](https://our.internmc.facebook.com/intern/diff/D77400988/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157285
Approved by: https://github.com/jansel
2025-07-01 15:46:34 +00:00
b146e1a264 [BE] remove duplicates in generated torch._VF.__all__ (#157365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157365
Approved by: https://github.com/Skylion007
2025-07-01 15:43:20 +00:00
c78fce9e79 [dynamo] show frame information when recompilation is triggered on fail_on_recompile (#156433)
adding more information to the error message for debugging.

example error message:
```
Detected recompile when torch.compile stance is 'fail_on_recompile'. filename: 'caffe2/test/dynamo/test_misc.py', function name: 'fn', line number: 0
Failed on the following precompiled guards:

TREE_GUARD_MANAGER:
+- RootGuardManager
| +- LAMBDA_GUARD: isinstance(L['x'], bool)
GuardDebugInfo(
result=0,
verbose_code_parts=["isinstance(L['x'], bool)"],
num_guards_executed=1)
```

Differential Revision: [D76987126](https://our.internmc.facebook.com/intern/diff/D76987126/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156433
Approved by: https://github.com/jamesjwu
2025-07-01 15:15:58 +00:00
023887fc5a Revert "Switch to standard pep517 sdist generation (#152098)"
This reverts commit f16053f0c9a09fa337fbf85aaf64f88712b8dcdb.

Reverted https://github.com/pytorch/pytorch/pull/152098 on behalf of https://github.com/malfet due to IMO this PR needs to be split into few helper ones, with better test plan ([comment](https://github.com/pytorch/pytorch/pull/152098#issuecomment-3024223880))
2025-07-01 14:14:52 +00:00
1586521461 Revert "Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590)"
This reverts commit 2c76f31221e117b217b8a6a96a5405f626d2218a.

Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/jeanschmidt due to Breaking 1000s of internal builds, it cant be properly landed internally, there are no options except revert and codev. ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3023503929))
2025-07-01 11:23:00 +00:00
534c454e77 Revert "[xla hash update] update the pinned xla hash (#156584)"
This reverts commit b1a54fab9bcb0cc167773f9a885d4170447e1c68.

Reverted https://github.com/pytorch/pytorch/pull/156584 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/155590 ([comment](https://github.com/pytorch/pytorch/pull/156584#issuecomment-3023492421))
2025-07-01 11:20:05 +00:00
13bf2655c1 Revert "HF loads dcp - don't do a full deserialize on every file (#155942)"
This reverts commit 117db5601d78cbc746b35eef71fc815e042e903f.

Reverted https://github.com/pytorch/pytorch/pull/155942 on behalf of https://github.com/jeanschmidt due to Newly introduced tests are red internally, more details on D76442012 ([comment](https://github.com/pytorch/pytorch/pull/155942#issuecomment-3023473036))
2025-07-01 11:15:08 +00:00
0bce390269 Revert "[dynamo] Add fx_graph_runnable test coverage (#157021)"
This reverts commit 20e40492b046b9287726d3ec656117e4dc38f0e2.

Reverted https://github.com/pytorch/pytorch/pull/157021 on behalf of https://github.com/jeanschmidt due to New tests are red internally, more details on D77471538 ([comment](https://github.com/pytorch/pytorch/pull/157021#issuecomment-3023455082))
2025-07-01 11:10:45 +00:00
a767e50adc remove allow-untyped-defs from torch/fx/experimental/migrate_gradual_types/util.py (#157236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157236
Approved by: https://github.com/ezyang
2025-07-01 10:36:48 +00:00
210632fae1 [ROCm] support experimental CU carveout (#149466)
Fixes #149280.  Follow up to #147966, but now available for ROCm.

Since hipblaslt does not support HIPBLASLT_MATMUL_DESC_CU_COUNT_TARGET, we instead create a hipStream that has a CU mask applied.  We pass this masked stream to hipblaslt instead of pytorch's current stream.  We ensure stream ordering between streams using hipEvents and stream synchronization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149466
Approved by: https://github.com/malfet, https://github.com/atalman
2025-07-01 08:54:52 +00:00
0596323c35 Better fix for __index__ SymInt issue (#157201)
This improves on #156928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157201
Approved by: https://github.com/ezyang
2025-07-01 07:06:46 +00:00
c202a7329a Revert "Fixes for CPython int/float tests (#155978)"
This reverts commit 23491519d288dedb2a54cfad5fef7fcb2ad8eade.

Reverted https://github.com/pytorch/pytorch/pull/155978 on behalf of https://github.com/XuehaiPan due to sys.get_int_max_str_digits is not always available ([comment](https://github.com/pytorch/pytorch/pull/155978#issuecomment-3021990027))
2025-07-01 06:16:49 +00:00
754699610b [BE] always use uv pip if possible in pip_init.py for lintrunner init (#157199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157199
Approved by: https://github.com/ezyang
2025-07-01 06:07:29 +00:00
8f0998aafe Check F2C BLAS for OpenBLAS and other vendors (#143846)
This issue came from https://github.com/conda-forge/pytorch-cpu-feedstock/issues/180. MKL follows the F2C convention for returning single precision floats as doubles and uses the G77 convention for returning complex valued scalars. OpenBLAS does the opposite. There is a check for this already, but it's done only when the Generic BLAS vendor code path is used and this PR moves that code to `Dependencies.cmake` to make it work when the BLAS vendor is OpenBLAS and others

Pull Request resolved: https://github.com/pytorch/pytorch/pull/143846
Approved by: https://github.com/rgommers, https://github.com/atalman
2025-07-01 05:56:24 +00:00
04bd7e6850 [ROCm] Remove use of warpsize on host-side compilation (#156979)
Changes needed for ROCm7.0:
* `warpSize` is _not_ a compile-time constant on device-side compilation for ROCm anymore
* `warpSize` is _not_ defined on host-side compilation, hence `at::cuda::warp_size()` must be used to query warpsize at runtime
* Redefining `C10_WARP_SIZE` to be a compile-time constant, with a reasonable value for device-side compilation, but an unreasonable value of 1 for host-side compilation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156979
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-07-01 04:55:31 +00:00
c811f41cf5 [BE] Remove unused variable from Pooling.metal (#157332)
Fixes following compilation warning
```
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:101:21: warning: unused variable 'indices_sizes' [-Wunused-variable]
  constant int64_t* indices_sizes = params.indices_sizes.data();
                    ^

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157332
Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/dcci
2025-07-01 04:28:04 +00:00
4d5d627e5f Remove super spammy log (#157157)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157157
Approved by: https://github.com/davidberard98
2025-07-01 03:51:58 +00:00
b40981c630 Fix incorrect stride handling in adaptive_avg_pool3d (#157326)
Fixes #157248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157326
Approved by: https://github.com/eqy
ghstack dependencies: #157242
2025-07-01 03:03:48 +00:00
b5ce77c1f5 [ROCm] Initial AITER Integration for mha_bwd asm kernels (#152630)
Generates AITER plumbing via cmake. Calls into fav3 asm bwd CK kernels.

Update submodule composable kernel for this change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152630
Approved by: https://github.com/xw285cornell, https://github.com/yoyoyocmu
2025-07-01 02:53:27 +00:00
f40efde2a4 [CI] Add prebuild command option, set prebuild command option for CI to build flash attention (#156236)
Build flash attention separately in build using 2 jobs since it OOMs on more, then the rest of the job uses 6
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156236
Approved by: https://github.com/malfet
2025-07-01 02:53:22 +00:00
3ed4384f5b [dynamo] temporarily disabling generation of weblinks for torch v2.8 release (#157299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157299
Approved by: https://github.com/williamwen42
2025-07-01 02:31:17 +00:00
c174f3a6a5 [ONNX] Delete deprecated tutorial page link (#157310)
Related to https://github.com/pytorch/tutorials/issues/3420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157310
Approved by: https://github.com/justinchuby
2025-07-01 01:18:26 +00:00
6dc2b22269 [ROCm][SymmetricMemory] Performance improvements for two-shot allreduce (#156746)
The biggest bottleneck that we found with two-shot allreduce was that the compiler was serializing all the load operations for some reason. To avoid these load delays, we've added de-serialization of loads. Along with this improvement, we also found that on AMD GPUs a different block and thread size gives a nice performance boost. Here are the bandwidth numbers I am getting with this PR:
![image](https://github.com/user-attachments/assets/57005856-4cb5-43cd-8e9c-46869f75ab0b)

The rows that are green are the tensor sizes that we are interested in because two-shot is only used for bigger sizes (one-shot is used for smaller sizes). As we can see, our baseline numbers wrt to fbgemm numbers were consistently underperforming. However, with this deserialize change, most of the tensor sizes have a performance boost (positive %) for the green tensors. There's one tensor with negative performance, but that's within error margin.

co-authored by: @amd-hhashemi
https://github.com/pytorch/FBGEMM/issues/4072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156746
Approved by: https://github.com/jeffdaily

Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com>
2025-07-01 00:37:30 +00:00
f860992db5 Add a custom profiler configuration option (#151656)
We aim to pass some configuration options to our custom Kineto backend via ExperimentalConfig,, so we added a `custom_profiler_config` parameter.

Requires https://github.com/pytorch/kineto/pull/1077 ,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151656
Approved by: https://github.com/sraikund16
2025-07-01 00:36:09 +00:00
b60569ed94 HF - consolidate shards of safetensors files to full tensors in finish step (#156705)
Title - we can consolidate the shards to a full tensors, optionally behind a flag, in the finish step of DCP.save
also adds the thread count argument which is configurable for users, before we were just using the default of 1.
Re-creating https://github.com/pytorch/pytorch/pull/155940 bc it got into a bad detached state

Differential Revision: [D77231774](https://our.internmc.facebook.com/intern/diff/D77231774/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156705
Approved by: https://github.com/saumishr
ghstack dependencies: #154743
2025-07-01 00:30:48 +00:00
4ebd269065 [Testing] Remove duplicate MPSInductor tests (#157328)
They were added there before test_torchinductor were running in CI, but
now the same are covered by `GPUTests.test_pointwise_*_mps`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157328
Approved by: https://github.com/huydhn
2025-07-01 00:21:22 +00:00
7709ff5512 [remove untyped defs] batch 1 (#157011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157011
Approved by: https://github.com/Skylion007
2025-06-30 23:54:40 +00:00
fee2377f9e Reapply D77381084 / #156964: Rename torch::standalone to headeronly (#157251)
Was reverted due to internal failure which should be fixed now. I believe Jane wants this reapplied and picked to release, and she's out this week.

Original summary:

headeronly is more clear, let's change the name before anyone depends on standalone

Differential Revision: [D77520173](https://our.internmc.facebook.com/intern/diff/D77520173/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157251
Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/desertfire
2025-06-30 23:25:30 +00:00
3dda80e990 Overload mul_overflows for size_t (#155736)
Partially fixes https://github.com/pytorch/executorch/pull/11537.

We want to extend `mul_overflows` to support `size_t` in ExecuTorch. The current workflow in ET checks that the `c10` mirrors exactly as in PT, so the tests are failing.

See comment: https://github.com/pytorch/executorch/pull/11537#issuecomment-2963821312
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155736
Approved by: https://github.com/swolchok
2025-06-30 22:57:28 +00:00
42b48ee672 [dynamo][fsdp] Consistent behavior of int attributes (#157262)
Reimpl of https://github.com/pytorch/pytorch/pull/150954

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157262
Approved by: https://github.com/bdhirsh
2025-06-30 22:32:52 +00:00
a9352bd25e Script for consolidation of sharded safetensor files (#154743)
Script to consolidate sharded safetensors files with DCP into full tensors. This relies on file system operations to read and copy bytes directly instead of the traditional approach of loading and re-sharding and then saving again, because users will have models that are larger than allotted memory.

Differential Revision: [D75536985](https://our.internmc.facebook.com/intern/diff/D75536985/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154743
Approved by: https://github.com/saumishr
2025-06-30 22:25:58 +00:00
f096820d0f [precompile] Detect source code changes for save/load. (#156432)
Go through all dynamo traced functions and compute checksum for them. While loading a precompilation back to memory, we will always check the checksum and refuse to load when
source code changes are detected.

Differential Revision: [D76987123](https://our.internmc.facebook.com/intern/diff/D76987123/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156432
Approved by: https://github.com/jansel, https://github.com/jamesjwu
2025-06-30 21:16:15 +00:00
d3efd73234 Revert "[cutlass backend][BE][ez] Make matmul layouts be row x column (#156656)"
This reverts commit 84c588e5eada9e7921608065edc444a15c22cb1c.

Reverted https://github.com/pytorch/pytorch/pull/156656 on behalf of https://github.com/henrylhtsang due to breaking fbcode A100 tests ([comment](https://github.com/pytorch/pytorch/pull/156656#issuecomment-3020769914))
2025-06-30 21:16:04 +00:00
3684be056d [dynamo] Fix source for lru_cache method (#157292)
Fixes - https://github.com/pytorch/pytorch/issues/157273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157292
Approved by: https://github.com/zou3519, https://github.com/malfet, https://github.com/jansel
2025-06-30 20:53:57 +00:00
23491519d2 Fixes for CPython int/float tests (#155978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978
Approved by: https://github.com/zou3519
2025-06-30 19:42:11 +00:00
f16053f0c9 Switch to standard pep517 sdist generation (#152098)
Generate source tarball with PEP 517 conform build tools instead of the custom routine in place right now.

Closes #150461.

The current procedure for generating the source tarball consists in creation of a source tree by manual copying and pruning of source files.

This PR replaces that with a call to the standard [build tool](https://build.pypa.io/en/stable/), which works with the build backend to produce an sdist. For that to work correctly, the build backend also needs to be configured. In the case of Pytorch, the backend currently is (the legacy version of) the setuptools backend, the source dist part of which is mostly configured via the `MANIFEST.in` file.

The resulting source distribution can be used to install directly from source with `pip install ./torch-{version}.tar.gz` or to build wheels directly from source with `pip wheel ./torch-{version}.tar.gz`; both should be considered experimental for now.

## Issues

### sdist name
According to PEP 517, the name of the source distribution file must coincide with the project name, or [more precisely](https://peps.python.org/pep-0517/#source-distributions), the source distribution of a project that generates `{NAME}-{...}.whl` wheels are required to be named `{NAME}-{...}.tar.gz`. Currently, the source tarball is called `pytorch-{...}.tar.gz`, but the generated wheels and python package are called `torch-{...}`.

### Symbolic Links
The source tree at the moment contains a small number of symbolic links. This [has been seen as problematic](https://github.com/pypa/pip/issues/5919) largely because of lack of support on Windows, but also because of [a problem in setuptools](https://github.com/pypa/setuptools/issues/4937). Particularly unfortunate is a circular symlink in the third party `ittapi` module, which can not be resolved by replacing it with a copy.

PEP 721 (now integrated in the [Source Distribution Format Specification](https://packaging.python.org/en/latest/specifications/source-distribution-format/#source-distribution-archive-features)) allows for symbolic links, but only if they don't point outside the destination directory and if they don't contain `../` in their target.

The list of symbolic links currently is as follows:

<details>

|source|target|problem|solution|
|-|-|-|-|
| `.dockerignore` | `.gitignore` |  ok (individual file) ||
| `docs/requirements.txt` | `../.ci/docker/requirements-docs.txt` |`..` in target|swap source and target[^1]|
| `functorch/docs/source/notebooks` | `../../notebooks/` |`..` in target|swap source and target[^1]|
| `.github/ci_commit_pins/triton.txt` | `../../.ci/docker/ci_commit_pins/triton.txt` |  ok (omitted from sdist)||
| `third_party/flatbuffers/docs/source/CONTRIBUTING.md` | `../../CONTRIBUTING.md` |`..` in target|omit from sdist[^2]|
| `third_party/flatbuffers/java/src/test/java/DictionaryLookup` | `../../../../tests/DictionaryLookup` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/MyGame` | `../../../../tests/MyGame` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceA` | `../../../../tests/namespace_test/NamespaceA` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/NamespaceC` | `../../../../tests/namespace_test/NamespaceC` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/optional_scalars` | `../../../../tests/optional_scalars` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/java/src/test/java/union_vector` | `../../../../tests/union_vector` |`..` in target|omit from sdist[^3]|
| `third_party/flatbuffers/kotlin/benchmark/src/jvmMain/java` | `../../../../java/src/main/java` |`..` in target|omit from sdist[^3]|
| `third_party/ittapi/rust/ittapi-sys/c-library` | `../../` |`..` in target|omit from sdist[^4]|
| `third_party/ittapi/rust/ittapi-sys/LICENSES` | `../../LICENSES` |`..` in target|omit from sdist[^4]|
| `third_party/opentelemetry-cpp/buildscripts/pre-merge-commit` | `./pre-commit` | ok (individual file)||
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-cmake/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_client.cc` | `../../push/tests/integration/sample_client.cc` |`..` in target|omit from sdist[^5]|
| `third_party/opentelemetry-cpp/third_party/prometheus-cpp/cmake/project-import-pkgconfig/sample_server.cc` | `../../pull/tests/integration/sample_server.cc` |`..` in target|omit from sdist[^5]|
| `third_party/XNNPACK/tools/xngen` | `xngen.py` |  ok (individual file)||

</details>

The introduction of symbolic links inside the `.ci/docker` folder creates a new problem, however, because Docker's `COPY` command does not allow symlinks in this way. We work around that by using `tar ch` to dereference the symlinks before handing them over to `docker build`.

[^1]: These resources can be naturally considered to be part of the docs, so moving the actual files into the place of the current symlinks and replacing them with (unproblematic) symlinks can be said to improve semantics as well.

[^2]: The flatbuffers docs already actually use the original file, not the symlink and in the most recent releases, starting from flatbuffers-25.1.21 the symlink is replaced by the actual file thanks to a documentation overhaul.

[^3]: These resources are flatbuffers tests for java and kotlin and can be omitted from our sdist.

[^4]: We don't need to ship the rust bindings for ittapi.

[^5]: These are demonstration examples for how to link to prometheus-cpp using cmake and can be omitted.

### Nccl
Nccl used to be included as a submodule. However, with #146073 (first released in v2.7.0-rc1), the submodule was removed and replaced with a build time checkout procedure in `tools/build_pytorch_libs.py`, which checks out the required version of nccl from the upstream repository based on a commit pin recorded in `.ci/docker/ci_commit_pins/nccl-cu{11,12}.txt`.
This means that a crucial third party dependency is missing from the source distribution and as the `.ci` folder is omitted from the source distribution, it is not possible to use the build time download.
However, it *is* possible to use a system provided Nccl using the `USE_SYSTEM_NCCL` environment variable, which now also is the default for the official Pytorch wheels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152098
Approved by: https://github.com/atalman
2025-06-30 19:07:34 +00:00
c7b6c98d10 [tp] improve parallelize_module API to support more cases (#157182)
This PR improves the parallelize_module API to support more corner cases:
1. if the plan entry specified as "", it should apply the style to the current module
2. if the plan entry does not have a corresponding submodule to apply, raise a warning and ignore this plan entry

As working on this PR, I also found that the while-loop inside is actually not necessary and could produce some nasty on the fly modifying while iterating behavior.. So I removed the while loop

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157182
Approved by: https://github.com/tianyu-l
2025-06-30 18:10:44 +00:00
d5e6f42094 Revert "Use std::string_view in torchgen (#157050)"
This reverts commit 064288cbab94c9931ca2296a2b9723e864f9050a.

Reverted https://github.com/pytorch/pytorch/pull/157050 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, more details on D77449943. @ezyang may I count on your help to get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/157050#issuecomment-3020222668))
2025-06-30 18:08:54 +00:00
efbf07e7ea Revert "[dynamo] Fix issue with tensors passed as view() shapes (#156928)"
This reverts commit 75f3e5a88df60caef27fd9c9df3fd51161378fcc.

Reverted https://github.com/pytorch/pytorch/pull/156928 on behalf of https://github.com/jeanschmidt due to Breaks a internal test, more details can be found on D77449971 ([comment](https://github.com/pytorch/pytorch/pull/156928#issuecomment-3020186268))
2025-06-30 17:56:01 +00:00
5e18bc3331 [PowerPC] Fixed build issue for vsx vec256 complexfloat and scaled_mm_out_cpu (#155255)
Pytorch build is failing on power system from this commit ec24f8f58a74502c5a2488f5d9e85a817616dda0

***Build Failure Logs***

**Error related to mkldnn**
```
pytorch/aten/src/ATen/native/Blas.cpp:302:26: error: ‘cpuinfo_has_x86_amx_int8’ was not declared in this scope
  302 |     if ((!mixed_dtype && cpuinfo_has_x86_amx_int8()) ||
      |                          ^~~~~~~~~~~~~~~~~~~~~~~~
pytorch/aten/src/ATen/native/Blas.cpp:303:25: error: ‘cpuinfo_has_x86_amx_fp16’ was not declared in this scope
  303 |         (mixed_dtype && cpuinfo_has_x86_amx_fp16())) {
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~

```

**Error related to vec256 complex float redefinition**
```
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: specialization of ‘at::vec::DEFAULT::Vectorized<c10::complex<float> >’ after instantiation
   19 | class Vectorized<ComplexFlt> {
      |       ^~~~~~~~~~~~~~~~~~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:19:7: error: redefinition of ‘class at::vec::DEFAULT::Vectorized<c10::complex<float> >’

aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:633:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
  633 |   auto abs_a = a.abs_2_();
      |                  ^~~~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:634:18: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘abs_2_’
  634 |   auto abs_b = b.abs_2_();
      |                  ^~~~~~

/aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:666:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  666 |       vec_add(a.vec0(), b.vec0()), vec_add(a.vec1(), b.vec1())};
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:673:17: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  673 |       vec_sub(a.vec0(), b.vec0()), vec_sub(a.vec1(), b.vec1())};
      |                 ^~~~
aten/src/ATen/cpu/vec/vec256/vsx/vec256_complex_float_vsx.h:680:27: error: ‘const class at::vec::DEFAULT::Vectorized<c10::complex<float> >’ has no member named ‘vec0’
  680 |       vec_and(a.vec0(), b.vec0()), vec_and(a.vec1(), b.vec1())};
```

***With  this changes build logs***
```
Building wheel torch-2.8.0a0+gita3098a7
-- Building version 2.8.0a0+gita3098a7
-- Checkout nccl release tag: v2.26.5-1
cmake -GNinja -DBLAS=OpenBLAS -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/torch -DCMAKE_PREFIX_PATH=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/lib/python3.12/site-packages -DPython_EXECUTABLE=/home/avanish/OfficeWork2025/JuneWork/pyenv/pytorch_5Jun/bin/python -DTORCH_BUILD_VERSION=2.8.0a0+gita3098a7 -DUSE_MKLDNN=ON -DUSE_MKLDNN_CBLAS=ON -DUSE_NUMPY=True -DUSE_OPENMP=ON /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch
cmake --build . --target install --config Release
running build_ext
-- Building with NumPy bindings
-- Not using cuDNN
-- Not using CUDA
-- Not using XPU
-- Using MKLDNN
-- Not using Compute Library for the Arm architecture with MKLDNN
-- Using CBLAS in MKLDNN
-- Not using NCCL
-- Building with distributed package:
  -- USE_TENSORPIPE=True
  -- USE_GLOO=True
  -- USE_MPI=False
-- Building Executorch
-- Not using ITT
Copying functorch._C from functorch/functorch.so to /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
copying functorch/functorch.so -> /home/avanish/OfficeWork2025/JuneWork/pytorch_5Jun/pack/torch_night_5Jun/pytorch/build/lib.linux-ppc64le-cpython-312/functorch/_C.cpython-312-powerpc64le-linux-gnu.so
building 'torch._C' extension
creating build/temp.linux-ppc64le-cpython-312/torch/csrc

```

This patch will fix the pytorch build issue on power, and i am able to build successfully.

Hi @malfet  @albanD

Please review this PR for pytorch build issue that we are observing on power.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155255
Approved by: https://github.com/albanD, https://github.com/malfet
2025-06-30 17:54:37 +00:00
2815eea0d0 [dtensor] relax device_mesh argument constraint in local_map (#157049)
This PR relaxes the device_mesh argument constraint in the local_map API. The current restriction is too strict, i.e. all the input arguments must have the same device mesh if they are DTensors. But many times user might want to pass in DTensors to this function that lives on different device mesh, i.e. weight and activation could live in different device mesh.

When using the local_map, we are extracting the local tensors from DTensors, and as long as the placements user specified match with the actual DTensor placements, user knows clearly that the inputs are intended to live in different mesh. So this PR removes the same mesh check and update doc to clearly document the behavior.

The `device_mesh` argument now serves for a main purpose, allow user to specify the device_mesh for the output DTensor reconstruction

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157049
Approved by: https://github.com/Chillee, https://github.com/zpcore
2025-06-30 17:51:48 +00:00
f8cc4c0af8 [inductor] Update triton_key import to support latest Triton (#157242)
With Triton main things were failing with:
```py
  File "/home/jansel/pytorch/torch/_inductor/codecache.py", line 205, in get_system
    from triton.compiler.compiler import triton_key
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
ImportError: cannot import name 'triton_key' from 'triton.compiler.compiler' (/home/jansel/pytorch/triton/compiler/compiler.py)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157242
Approved by: https://github.com/aorenste
2025-06-30 17:51:43 +00:00
117db5601d HF loads dcp - don't do a full deserialize on every file (#155942)
Differential Revision: [D76442012](https://our.internmc.facebook.com/intern/diff/D76442012/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155942
Approved by: https://github.com/saumishr
ghstack dependencies: #155707
2025-06-30 17:45:10 +00:00
ed5d6d2a20 python definitely_contiguous-> is_contiguous_or_false (#156515)
We probably can avoid having those in python as well and  just depend on c++ impl after we land https://github.com/pytorch/pytorch/pull/155590 but that is for a different PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156515
Approved by: https://github.com/bobrenjc93
2025-06-30 17:31:51 +00:00
c038719731 Revert "Inductor logging + analysis of torch.profile (#149697)"
This reverts commit 347ace4c7ac2dbb14799089c30bd01a9ac312791.

Reverted https://github.com/pytorch/pytorch/pull/149697 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to fail on ROCm ([comment](https://github.com/pytorch/pytorch/pull/149697#issuecomment-3020006655))
2025-06-30 16:58:54 +00:00
b54eac2a5e Upgrade to DLPack 1.0. (#145000)
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:

- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
  producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
    - Fallback to old implementation if no `max_version` or if version
      lower than 1.0
    - Check that the to-be-consumed capsule is of version up to 1.X

In order to accommodate these new specifications, this PR adds the
following main changes:

- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
2025-06-30 16:58:06 +00:00
39b71d11fc [Inductor] add pedantic to limit inductor code follow standard. (#156914)
### Background:

During my development work, I found Windows msvc don't support to compile zero size array, please reference: https://github.com/pytorch/pytorch/issues/153180

As discussed with MSFT engineer, we found zero size array don't align to c++ standard, though gcc/clang can support it. When we add `-pedantic` option to gcc, it should check and raise c++ standard strictly. Reference: https://github.com/pytorch/pytorch/issues/153180#issuecomment-2986676878

So this PR add `-pedantic` to torch inductor build option list to constraint codegen generate c++ standard well code.
Additional, It also fixed a halide zero size array code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156914
Approved by: https://github.com/jansel
2025-06-30 16:29:08 +00:00
e3afbb0362 [inductor] Add typing to _inductor/ir.py (#149958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149958
Approved by: https://github.com/Skylion007
2025-06-30 15:56:35 +00:00
eqy
3b4b5f8d47 [SDPA] Fix alloc_with_matching_layout stride sorting (#157145)
Otherwise dims with "zero" stride get moved before contiguous dims (stride 1).

Need to move the fix from #149282 to here as #154340 moved the original definition from `MHA.cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157145
Approved by: https://github.com/Skylion007
2025-06-30 15:43:29 +00:00
da1f337bc4 Revert "Fixes for CPython int/float tests (#155978)"
This reverts commit fab53dfdf1d89cecd5e82b12cced9b6dd217e87c.

Reverted https://github.com/pytorch/pytorch/pull/155978 on behalf of https://github.com/guilhermeleobas due to failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/155978#issuecomment-3019457531))
2025-06-30 14:49:44 +00:00
fab53dfdf1 Fixes for CPython int/float tests (#155978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978
Approved by: https://github.com/zou3519
2025-06-30 14:15:47 +00:00
ffaed8c569 Update slow tests (#155448)
This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml).
Update the list of slow tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155448
Approved by: https://github.com/pytorchbot
2025-06-30 12:08:52 +00:00
b1a54fab9b [xla hash update] update the pinned xla hash (#156584)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156584
Approved by: https://github.com/pytorchbot
2025-06-30 11:23:06 +00:00
ccb67f39b4 Enable the AMP precision with freezing for CPU nightly test (#152298)
Hi, @desertfire. Since we recommend users to use AMP precision and run with `--freezing` for CPU x86 Inductor inference, we suggest adding the AMP freezing test to the CPU nightly tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152298
Approved by: https://github.com/desertfire, https://github.com/huydhn

Co-authored-by: zengxian <xiangdong.zeng@intel.com>
2025-06-30 09:17:17 +00:00
f79689bd3d updated matplotlib version in docs requirements (#155931)
Fixes #155199

The issue on main is due an outdated version of matplotlib. I have bumped the version so that it is compatible with Numpy 2.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155931
Approved by: https://github.com/malfet
2025-06-30 02:05:53 +00:00
a1282b1823 [MPS] Add boilerplate sparse code support (#157238)
This PR makes minimal changes to support sparse tensors on MPS. In the followup PRs I'll start adding different operations slowly so we can fix the issue of
https://github.com/pytorch/pytorch/issues/129842
which is highly requested(I assume because of whisper using sparse tensors)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157238
Approved by: https://github.com/malfet
2025-06-30 01:53:45 +00:00
771be85704 [AOTI] Print out error msg when nvcc compiler fails (#157203)
Summary: To debug https://github.com/pytorch/pytorch/issues/156930. Not able to reproduce the problem locally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157203
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2025-06-30 01:30:55 +00:00
86ced14453 increment pending_callbacks_counter before initation the pt2 compile callbacks (#157185)
Summary: Since we increment the counter after performing the callback, it leads to the assertion error when callback raises an error and increment never happens. Let's increment first to avoid it.

Test Plan:
tba

Rollback Plan:

Differential Revision: D77475650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157185
Approved by: https://github.com/xmfan
2025-06-30 01:23:59 +00:00
12cb06e574 [inductor] Increase tolerance for test_comprehensive_nn_functional_linear_cuda_float16 (#156962)
Fixes #156514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156962
Approved by: https://github.com/jamesjwu
2025-06-30 00:54:20 +00:00
cyy
c27f83dd91 Remove old ASAN Docker images (#157197)
The old ASAN jobs have been replaced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157197
Approved by: https://github.com/Skylion007
2025-06-30 00:30:56 +00:00
11f7e2f145 [caffe][executorch] rename to avoid shadow in irange (#157107)
Summary:
D76832520 switched Executorch to use the caffe c10 headers. This copy contains a shadow, which is treated as an error for certain embedded compile flows.

Simple rename to avoid.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77446104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157107
Approved by: https://github.com/Skylion007
2025-06-30 00:17:09 +00:00
018e9826a2 [nativert] hook up memory planning to execution frame (#157053)
Summary: pretty simple. if planner exists, which implies that planning is enabled, create a manager for each frame. the associated serial executor will use the withMemoryPlannner fn to ensure the deallocation is done after execution completes.

Test Plan: CI

Differential Revision: D73635809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157053
Approved by: https://github.com/henryoier, https://github.com/georgiaphillips
2025-06-30 00:06:37 +00:00
41f6acef83 Update pr_time_benchmarks expected results (#157214)
The job has been unstable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157214
Approved by: https://github.com/laithsakka
2025-06-29 19:12:13 +00:00
29f76ec0f3 Revert "[BE] use pathlib.Path instead of os.path.* in setup.py (#156742)"
This reverts commit 2380115f9738f97cf706affefd647d2cb6dfbb3f.

Reverted https://github.com/pytorch/pytorch/pull/156742 on behalf of https://github.com/malfet due to Looks like it broke all ROCM tests, see 721d2580db/1 ([comment](https://github.com/pytorch/pytorch/pull/156742#issuecomment-3016937704))
2025-06-29 18:10:03 +00:00
721d2580db [dynamo][callbacks] temporarily disable TRITON_AUTOTUNING (#157186)
Differential Revision: D77476551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157186
Approved by: https://github.com/burak-turk
2025-06-29 17:20:55 +00:00
aec569da23 [Triton] [Inductor[ Add tt.descriptor_store to get_tma_stores (#157212)
Summary: Fixes a gap in the Triton update where the traverse would break because `get_tma_stores` didn't handle both TMA APIs.

Test Plan:
`buck test -m ovr_config//triton:beta  'fbcode//mode/dev-nosan' fbcode//ads_mkl/ops/tests:gdpa_dcpp_test -- --exact 'ads_mkl/ops/tests:gdpa_dcpp_test - test_gdpa_dcpp (ads_mkl.ops.tests.gdpa_dcpp_test.GdpaDCPPTest)'`

Rollback Plan:

Differential Revision: D77501582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157212
Approved by: https://github.com/davidberard98
2025-06-29 16:44:52 +00:00
b147b6c0e3 Increase tolerance for test_corrcoef_cuda_int32 (#157206)
Fixes #156988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157206
Approved by: https://github.com/Skylion007
2025-06-29 16:30:54 +00:00
e959dd017d [TSAN][live speech translation] Fix A data race in caffe2 (#156378)
Summary: noticed that context quantized_engine is accessed and written from multiple threads

Test Plan:
➜  fbsource buck test --flagfile fbcode/mode/dev-tsan //xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests -- --exact 'fbsource//xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests - Translate/LiveSpeechTranslationTests.LiveSpeechTranslationEnFr/silence___fr_en'

Rollback Plan:

Differential Revision: D76921416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156378
Approved by: https://github.com/jerryzh168, https://github.com/cyyever
2025-06-29 07:23:20 +00:00
9d677389cb [async compile] make it more obvious that we support backwards (#157204)
current failing with

```
(/home/bobren/local/a/pytorch-env) [13:02] devgpu009:/home/bobren/local/a/pytorch python test/inductor/test_compile_subprocess.py -k GPUTests.test_async
/home/bobren/local/a/pytorch/torch/backends/cudnn/__init__.py:115: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
/home/bobren/local/a/pytorch/torch/_inductor/ops_handler.py:741: UserWarning: undefined OpHandler.__getstate__, please add missing op schema
  warnings.warn(f"undefined OpHandler.{name}, please add missing op schema")
/home/bobren/local/a/pytorch/torch/_inductor/ops_handler.py:741: UserWarning: undefined OpHandler.__getstate__, please add missing op schema
  warnings.warn(f"undefined OpHandler.{name}, please add missing op schema")
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] Unable to pickle input graph or example inputs
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] Traceback (most recent call last):
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]   File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx_ext.py", line 484, in serialize_compile
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]     ).serialize()
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]   File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx_ext.py", line 210, in serialize
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]     return _WireProtocolPickledInput(GraphPickler.dumps(self))
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]   File "/home/bobren/local/a/pytorch/torch/fx/_graph_pickler.py", line 124, in dumps
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0]     pickler.dump(obj)
W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] AttributeError: Can't pickle local object 'make_opaque_bitwise_fn.<locals>.BitwiseFn'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157204
Approved by: https://github.com/aorenste
2025-06-29 05:38:54 +00:00
347ace4c7a Inductor logging + analysis of torch.profile (#149697)
Prereqs:
 - https://github.com/pytorch/pytorch/pull/152708

Features:
1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses.
1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`.
1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`.
1. Extends Triton `torch.profiler` logging to `DebugAutotuner`.
1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side:
```python
Device(NVIDIA H100, 0):
 Kernel Name                              | resnet Kernel Count | resnet FLOPS       | resnet bw gbps        | resnet Dur (ms)    | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS    | newresnet bw gbps     | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth %
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 triton_poi_fused__native_batch_norm_legi | 24                  | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                       | 0.003401572611382541        | 24                     | 0                  | 0.11395268248131513   | 2.5919166666666666 | 0                          | 0.003401572611382541
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142                 | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583     | 0.007716441266265022        | 142                    | 16932673552.422373 | 0.2585007824198784    | 12.441619718309857 | 0.08683422334575583        | 0.007716441266265022
 triton_red_fused__native_batch_norm_legi | 39                  | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                       | 0.004176126863316074        | 39                     | 0                  | 0.13990024992108846   | 5.752589743589743  | 0                          | 0.004176126863316074
 triton_poi_fused__native_batch_norm_legi | 25                  | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                       | 0.009499718184339253        | 25                     | 0                  | 0.31824055917536503   | 2.5291999999999994 | 0                          | 0.009499718184339253
 void cutlass::Kernel2<cutlass_80_tensoro | 98                  | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874     | 0.012827592254037562        | 98                     | 16211056473.596165 | 0.42972434051025826   | 7.130408163265306  | 0.08313362294151874        | 0.012827592254037562
 triton_red_fused__native_batch_norm_legi | 73                  | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                       | 0.009628003963020014        | 73                     | 0                  | 0.3225381327611705    | 9.987068493150682  | 0                          | 0.009628003963020014
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                       | 0.043257347302946926        | 15                     | 0                  | 1.4491211346487216    | 4.439333333333333  | 0                          | 0.043257347302946926
 void cutlass::Kernel2<cutlass_80_tensoro | 186                 | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027     | 0.007961586274361157        | 186                    | 14501701145.337954 | 0.2667131401910989    | 7.873865591397849  | 0.07436769818122027        | 0.007961586274361157
 triton_poi_fused__native_batch_norm_legi | 33                  | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                       | 0.044550915039384846        | 33                     | 0                  | 1.4924556538193923    | 4.3101515151515155 | 0                          | 0.044550915039384846
 triton_red_fused__native_batch_norm_legi | 29                  | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                       | 0.007630624036606301        | 29                     | 0                  | 0.25562590522631107   | 6.296275862068965  | 0                          | 0.007630624036606301
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                       | 0.01752406619162008         | 13                     | 0                  | 0.5870562174192726    | 2.7397692307692307 | 0                          | 0.01752406619162008
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 0.41409928846284      | 2.853588235294117  | 0                       | 0.012361172789935523        | 34                     | 0                  | 0.41409928846284      | 2.853588235294117  | 0                          | 0.012361172789935523
 triton_per_fused__native_batch_norm_legi | 34                  | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                       | 0.0034941238826919864       | 34                     | 0                  | 0.11705315007018151   | 3.460647058823529  | 0                          | 0.0034941238826919864
 triton_poi_fused__native_batch_norm_legi | 16                  | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                       | 0.005136672596156592        | 16                     | 0                  | 0.17207853197124584   | 2.3459375000000002 | 0                          | 0.005136672596156592
 triton_per_fused__native_batch_norm_legi | 30                  | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                       | 0.007879744244842555        | 30                     | 0                  | 0.2639714322022256    | 6.131199999999999  | 0                          | 0.007879744244842555
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100                 | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531     | 0.005819245035648175        | 100                    | 11875430356.891787 | 0.19494470869421385   | 16.36534           | 0.06089964285585531        | 0.005819245035648175
 triton_poi_fused__native_batch_norm_legi | 8                   | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                       | 0.029415213809625928        | 8                      | 0                  | 0.9854096626224687    | 3.2757500000000004 | 0                          | 0.029415213809625928
 void cublasLt::splitKreduce_kernel<32, 1 | 56                  | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628     | 0.024806865808245714        | 56                     | 34377923395.147064 | 0.8310300045762317    | 3.4199999999999986 | 0.17629704305203628        | 0.024806865808245714
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                       | 0.02968359094286896         | 23                     | 0                  | 0.9944002965861103    | 3.2431304347826084 | 0                          | 0.02968359094286896
 triton_per_fused__native_batch_norm_legi | 10                  | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                       | 0.00545313748934644         | 10                     | 0                  | 0.1826801058931057    | 4.428800000000001  | 0                          | 0.00545313748934644
 triton_poi_fused__native_batch_norm_legi | 10                  | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                       | 0.009459622642884923        | 10                     | 0                  | 0.3168973585366449    | 2.5471999999999997 | 0                          | 0.009459622642884923
 triton_poi_fused__native_batch_norm_legi | 34                  | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                       | 0.03421974596124114         | 34                     | 0                  | 1.1463614897015777    | 4.124323529411764  | 0                          | 0.03421974596124114
 void cask_plugin_cudnn::xmma_cudnn::init | 44                  | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194     | 0.06167532194133924         | 44                     | 44045510816.64277  | 2.0661232850348643    | 3.6887499999999993 | 0.22587441444432194        | 0.06167532194133924
 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95                  | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802     | 0.014014750913273854        | 95                     | 7876855400.165316  | 0.4694941555946739    | 18.224315789473682 | 0.04039413025725802        | 0.014014750913273854
 triton_per_fused__native_batch_norm_legi | 41                  | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                       | 0.002037513395819492        | 41                     | 0                  | 0.06825669875995298   | 3.0384146341463416 | 0                          | 0.002037513395819492
 triton_poi_fused__native_batch_norm_legi | 23                  | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                       | 0.0026292999141582997       | 23                     | 0                  | 0.08808154712430301   | 2.3275652173913044 | 0                          | 0.0026292999141582997
 triton_per_fused__native_batch_norm_legi | 40                  | 0                  | 0.18179321034952417   | 4.556825           | 0                       | 0.005426662995508183        | 40                     | 0                  | 0.18179321034952417   | 4.556825           | 0                          | 0.005426662995508183
 triton_poi_fused__native_batch_norm_legi | 15                  | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                       | 0.017574373598370836        | 15                     | 0                  | 0.5887415155454232    | 2.783866666666667  | 0                          | 0.017574373598370836
 void cutlass::Kernel2<cutlass_80_tensoro | 38                  | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546      | 0.007659474756834           | 38                     | 14242013806.264643 | 0.256592404353939     | 7.217631578947369  | 0.0730359682372546         | 0.007659474756834
 triton_poi_fused__native_batch_norm_legi | 21                  | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                       | 0.017441376040091088        | 21                     | 0                  | 0.5842860973430516    | 2.7779047619047623 | 0                          | 0.017441376040091088
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                       | 0.0034356313950705724       | 16                     | 0                  | 0.11509365173486417   | 3.5959375000000002 | 0                          | 0.0034356313950705724
 triton_poi_fused__native_batch_norm_legi | 14                  | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                       | 0.00508857313505646         | 14                     | 0                  | 0.1704672000243914    | 2.4044285714285714 | 0                          | 0.00508857313505646
 triton_poi_fused__native_batch_norm_legi | 58                  | 0                  | 2.307520779930795     | 8.190706896551722  | 0                       | 0.06888121731136704         | 58                     | 0                  | 2.307520779930795     | 8.190706896551722  | 0                          | 0.06888121731136704
 triton_per_fused__native_batch_norm_legi | 29                  | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                       | 0.001111738775280038        | 29                     | 0                  | 0.037243248971881276  | 3.0277586206896556 | 0                          | 0.001111738775280038
 triton_poi_fused__native_batch_norm_legi | 20                  | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                       | 0.0014154327747549007       | 20                     | 0                  | 0.04741699795428918   | 2.2911500000000005 | 0                          | 0.0014154327747549007
 triton_per_fused__native_batch_norm_legi | 25                  | 0                  | 0.13357016893727824   | 3.37536            | 0                       | 0.003987169222008305        | 25                     | 0                  | 0.13357016893727824   | 3.37536            | 0                          | 0.003987169222008305
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                       | 0.009223469457612694        | 13                     | 0                  | 0.3089862268300253    | 2.8111538461538457 | 0                          | 0.009223469457612694
 triton_poi_fused__native_batch_norm_legi | 17                  | 0                  | 0.3129385387909844    | 2.673              | 0                       | 0.009341448919133863        | 17                     | 0                  | 0.3129385387909844    | 2.673              | 0                          | 0.009341448919133863
 triton_per_fused__native_batch_norm_legi | 19                  | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                       | 0.0066136363060691275       | 19                     | 0                  | 0.2215568162533158    | 3.8837368421052636 | 0                          | 0.0066136363060691275
 std::enable_if<!(false), void>::type int | 23                  | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447   | 0.030203868944223014        | 23                     | 504916805.19297093 | 1.0118296096314707    | 8.113913043478261  | 0.0025893169497075447      | 0.030203868944223014
 triton_poi_fused_add_copy__38            | 56                  | 0                  | 0                     | 2.132482142857143  | 0                       | 0                           | 56                     | 0                  | 0                     | 2.132482142857143  | 0                          | 0
 triton_poi_fused_convolution_0           | 18                  | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                       | 0.012972719640279667        | 18                     | 0                  | 0.43458610794936897   | 2.773333333333334  | 0                          | 0.012972719640279667
 triton_poi_fused_convolution_1           | 17                  | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                       | 0.0008601884319153051       | 17                     | 0                  | 0.028816312469162712  | 2.6145882352941174 | 0                          | 0.0008601884319153051
 void convolve_common_engine_float_NHWC<f | 44                  | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169     | 0.0007382250748795709       | 44                     | 8641868995.31118   | 0.024730540008465626  | 25.87327272727273  | 0.04431727689903169        | 0.0007382250748795709
 triton_per_fused__native_batch_norm_legi | 12                  | 0                  | 0.6809930918986744    | 4.82675            | 0                       | 0.020328151996975356        | 12                     | 0                  | 0.6809930918986744    | 4.82675            | 0                          | 0.020328151996975356
 triton_per_fused__native_batch_norm_legi | 14                  | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                       | 0.0008606061486377935       | 14                     | 0                  | 0.02883030597936608   | 2.6651428571428575 | 0                          | 0.0008606061486377935
 triton_per_fused__native_batch_norm_legi | 16                  | 0                  | 0.0014658988233201874 | 2.098              | 0                       | 4.375817383045335e-05       | 16                     | 0                  | 0.0014658988233201874 | 2.098              | 0                          | 4.375817383045335e-05
 triton_poi_fused__native_batch_norm_legi | 13                  | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                       | 0.02963073785159611         | 13                     | 0                  | 0.9926297180284697    | 3.2367692307692306 | 0                          | 0.02963073785159611
 triton_poi_fused__native_batch_norm_legi | 9                   | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                       | 0.03883228983781048         | 9                      | 0                  | 1.3008817095666507    | 3.0863333333333336 | 0                          | 0.03883228983781048
 void at::native::(anonymous namespace):: | 98                  | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                       | 0.0027386076458833994       | 98                     | 0                  | 0.09174335613709389   | 4.408520408163265  | 0                          | 0.0027386076458833994
 void at::native::vectorized_elementwise_ | 7                   | 0                  | 0                     | 1.7278571428571428 | 0                       | 0                           | 7                      | 0                  | 0                     | 1.7278571428571428 | 0                          | 0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149697
Approved by: https://github.com/eellison, https://github.com/shunting314
2025-06-29 05:00:47 +00:00
f8293116f5 [BE][13/16] fix typos in torch/ (torch/ao/) (#156603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156603
Approved by: https://github.com/msaroufim
2025-06-29 04:34:04 +00:00
1913c915e0 Fixes issue #156414: Fixes bug in implementation of _combine_histograms. (#156457)
Fixes #156414

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156457
Approved by: https://github.com/jerryzh168
2025-06-29 04:30:28 +00:00
2796f31b5e [DCP] OSS Zero Overhead Checkpointing Implementation (#156207)
Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing

Test Plan: Test with TorchTitan on this PR: https://github.com/pytorch/torchtitan/pull/1287

Differential Revision: D72391401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156207
Approved by: https://github.com/teja-rao
2025-06-29 03:19:48 +00:00
bccb8473fe [ROCm] Allow use of rocSOLVER for Cholesky inversion. (#157154)
Fixes https://github.com/pytorch/pytorch/issues/155046

This change allows Cholesky inversion to use rocSOLVER. This is now also the default on ROCm for Cholesky inversion which aligns with the behavior on NVIDIA (which defaults to cuSOLVER for this linear algebra operation). This fix also gets around a memory access fault encountered in MAGMA for large matrices.

MAGMA can still be forced on ROCm by doing:
```
torch.backends.cuda.preferred_linalg_library(backend='magma')
```

Ran all Cholesky UT on ROCm and there were no regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157154
Approved by: https://github.com/jeffdaily
2025-06-29 01:53:02 +00:00
6cc490d40b simplify max(1,x) to x when x known >=1 (#157189)
Creating contiguous strides creates an expression max(1, x). Often we know that x >= 1, in
 which case we should simplify max(1, x) to x.

This appeared in two situations:
1) An internal user complained about statically_known_true(x == max(1, x)) failing (internal link: https://fb.workplace.com/groups/1028545332188949/permalink/1232958568414290).
This https://github.com/pytorch/pytorch/pull/155938 won't be needed with this.

3) Not simplifying the above could result in wrong ConstraintViolationErrors.
Because we assume non-trival single arg guards shall evaporate see the logic in the function
issue_guard in symbolic_shapes.py

with this change we longer throw ConstraintViolationErrors with the program bellow
this is blocking landing this [PR](https://github.com/pytorch/pytorch/pull/155590) from landing
internally. Due to internal export tests throwing ConstraintViolationErrors.
like
```
Constraints violated (width)!
  - Not all values of width = L['x'].size()[3] in the specified range 224 <= width <= 455 satisfy the generated guard max(1, 1 + (((-1) + L['x'].size()[3]) // 2)) == (1 + (((-1) + L['x'].size()[3]) // 2)).
````

```
x = torch.rand(10)
torch._dynamo.mark_dynamic(x, 0, max=20, min=5)

@torch.compile(fullgraph=True, dynamic=True)
def func(x):
    if max(1, (-1 + x.size()[0]//2)) == (-1+x.size()[0]//2):
        return x*400
    else:
        return (x*10)*100

func(x)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157189
Approved by: https://github.com/pianpwk
2025-06-29 01:16:30 +00:00
836bb1941b [hop] support torch.func.functional_call in hop subgraph (#155886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155886
Approved by: https://github.com/zou3519
2025-06-28 23:47:46 +00:00
2380115f97 [BE] use pathlib.Path instead of os.path.* in setup.py (#156742)
Resolves:

- https://github.com/pytorch/pytorch/pull/155998#discussion_r2164376634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156742
Approved by: https://github.com/malfet
2025-06-28 23:31:15 +00:00
90b973a2e2 [BE] parse CMake version from cmake -E capabilities instead of cmake --version (#157073)
`cmake -E capabilities` produces a JSON format that is more machine-friendly.

```console
$ cmake --version
cmake version 4.0.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).
$ cmake -E capabilities | jq '.version.string'
"4.0.3"
$ cmake -E capabilities | jq
{
  "debugger": true,
  "fileApi": {
    "requests": [
      {
        "kind": "codemodel",
        "version": [
          {
            "major": 2,
            "minor": 8
          }
        ]
      },
      {
        "kind": "configureLog",
        "version": [
          {
            "major": 1,
            "minor": 0
          }
        ]
      },
      {
        "kind": "cache",
        "version": [
          {
            "major": 2,
            "minor": 0
          }
        ]
      },
      {
        "kind": "cmakeFiles",
        "version": [
          {
            "major": 1,
            "minor": 1
          }
        ]
      },
      {
        "kind": "toolchains",
        "version": [
          {
            "major": 1,
            "minor": 0
          }
        ]
      }
    ]
  },
  "generators": [
    {
      "extraGenerators": [],
      "name": "Watcom WMake",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [
        "Kate"
      ],
      "name": "Ninja Multi-Config",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [
        "CodeBlocks",
        "CodeLite",
        "Eclipse CDT4",
        "Kate",
        "Sublime Text 2"
      ],
      "name": "Ninja",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [],
      "name": "Xcode",
      "platformSupport": false,
      "toolsetSupport": true
    },
    {
      "extraGenerators": [
        "CodeBlocks",
        "CodeLite",
        "Eclipse CDT4",
        "Kate",
        "Sublime Text 2"
      ],
      "name": "Unix Makefiles",
      "platformSupport": false,
      "toolsetSupport": false
    }
  ],
  "serverMode": false,
  "tls": true,
  "version": {
    "isDirty": false,
    "major": 4,
    "minor": 0,
    "patch": 3,
    "string": "4.0.3",
    "suffix": ""
  }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157073
Approved by: https://github.com/Skylion007
2025-06-28 23:20:10 +00:00
772d590415 [CUTLASS] [CUDA] SM100 GroupMM (#156203)
Closes https://github.com/pytorch/pytorch/issues/156202

PR adds blackwell support for GroupMM

Most of the code that is used for SM90 can be reused, kernel schedule has to be changed in accordance with https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html

Did some preliminary benchmarking of H200 vs B200

Script
```py
import torch
print(torch.__file__)
device = torch.device("cuda")
dtype = torch.bfloat16

shapes = [
    (16, 128000, 7168, 7168),
    (128, 1, 2048, 7168)
]

for batch, M, N, K in shapes:
    a = torch.randn(batch, M, K, device=device, dtype=dtype)
    b = torch.randn(batch, N, K, device=device, dtype=dtype)

    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    for i in range(5): c = torch._grouped_mm(a, b)

    num_iter = 50
    start_event.record()

    for i in range(num_iter): c = torch._grouped_mm(a, b)
    end_event.record()

    torch.cuda.synchronize()
    elapsed_time_ms = start_event.elapsed_time(end_event)
    avg_time_ms = elapsed_time_ms / num_iter
    print(f"batch: {batch}\tM: {M}\tN: {N}\tK: {K}")
    print(f"Time per Iteration:\t {avg_time_ms:.4f} ms")
```

On H200
```
batch: 16	M: 128000	N: 7168	K: 7168
Time per Iteration:	 298.6668 ms
batch: 128	M: 1	N: 2048	K: 7168
Time per Iteration:	 4.1462 ms
```

B200
```
batch: 16       M: 128000       N: 7168 K: 7168
Time per Iteration:      190.7458 ms
batch: 128      M: 1    N: 2048 K: 7168
Time per Iteration:      3.0680 ms
```
nsys nvprof
```
root@16930b42ffc6:/workspace/pytorch# nsys nvprof python gemm_test.py
WARNING: python and any of its children processes will be profiled.

Collecting data...
batch: 16	M: 128000	N: 7168	K: 7168
Time per Iteration:	 192.6420 ms
batch: 128	M: 1	N: 2048	K: 7168
Time per Iteration:	 1.2255 ms
Generating '/tmp/nsys-report-6a53.qdstrm'
[1/7] [========================100%] report1.nsys-rep
[2/7] [========================100%] report1.sqlite
[3/7] Executing 'nvtx_sum' stats report
SKIPPED: /workspace/pytorch/report1.sqlite does not contain NV Tools Extension (NVTX) data.
[4/7] Executing 'cuda_api_sum' stats report

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)      Med (ns)    Min (ns)   Max (ns)    StdDev (ns)                 Name
 --------  ---------------  ---------  ------------  ------------  --------  -----------  ------------  ---------------------------------
     98.9      10586895744          2  5293447872.0  5293447872.0  73786464  10513109280  7381715954.2  cudaDeviceSynchronize
      1.0        104084608          5    20816921.6    33552480.0    100800     34786208    18048125.3  cudaMalloc
      0.1          5694304          4     1423576.0     1416656.0   1258560      1602432      181668.1  cudaGetDeviceProperties_v2_v12000
      0.1          5430496        130       41773.0        4560.0      2496      3854368      345761.8  cudaLaunchKernel
      0.0           587584        110        5341.7        4992.0      4224        16992        1482.0  cudaLaunchKernelExC_v11060
      0.0           119200        660         180.6         128.0        96         4128         206.7  cudaGetDriverEntryPoint_v11030
      0.0            68352        660         103.6          64.0        32         4928         224.6  cuTensorMapEncodeTiled
      0.0            34976         49         713.8         224.0       160         6720        1343.4  cudaStreamIsCapturing_v10000
      0.0            32992          4        8248.0        7456.0      4128        13952        4804.4  cudaEventRecord
      0.0            16928          4        4232.0        3600.0      1728         8000        2764.7  cudaEventQuery
      0.0            16288          4        4072.0        3568.0      1952         7200        2396.1  cudaEventCreateWithFlags
      0.0            13632          4        3408.0        2672.0       544         7744        3408.7  cudaEventDestroy
      0.0             1056          1        1056.0        1056.0      1056         1056           0.0  cuModuleGetLoadingMode

[5/7] Executing 'cuda_gpu_kern_sum' stats report

 Time (%)  Total Time (ns)  Instances   Avg (ns)     Med (ns)    Min (ns)   Max (ns)   StdDev (ns)                                                  Name
 --------  ---------------  ---------  -----------  -----------  ---------  ---------  -----------  ----------------------------------------------------------------------------------------------------
     99.0      10549232845         55  191804233.5  192944479.0  165746368  203645313    5353204.3  void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm…
      0.6         67327135         55    1224129.7    1330656.0     924320    1364928     182180.4  void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm…
      0.3         34854783         20    1742739.1    1597856.0      10080    3899616     818421.2  void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat…
      0.0           354880        110       3226.2       3296.0       1920       4160        554.4  void at::cuda::detail::prepare_grouped_gemm_data<cutlass::bfloat16_t, cutlass::bfloat16_t, cutlass:…
```

The kernel names are too long to be shown via nvprof, I pasted this from nsight systems
```
small kernel 1SM
100.0%	1.286 ms	1	1.286 ms	1.286 ms	1.286 ms	1.286 ms	0 ns	void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)3, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> *, cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)128, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params)

large kernel 2SM
100.0%	194.178 ms	1	194.178 ms	194.178 ms	194.178 ms	194.178 ms	0 ns	void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)5, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)256>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> *, cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_2x1SM_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)256, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156203
Approved by: https://github.com/syed-ahmed, https://github.com/drisspg
2025-06-28 23:02:00 +00:00
996206e66f cublaslt/hipblaslt persistent workspace (#156495)
Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo.

- fixes hipblaslt issue where memory use increased during graph capture
- preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE
- moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs
  - size_t getCUDABlasLtWorkspaceSize()
  - void* getCUDABlasLtWorkspace()

Fixes https://github.com/ROCm/pytorch/issues/2286.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156495
Approved by: https://github.com/eqy
2025-06-28 22:38:43 +00:00
0629dfb860 Fix FSDP offload pin_memory bug (#157147)
Fixes #157146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157147
Approved by: https://github.com/weifengpy
2025-06-28 21:09:11 +00:00
67f8270516 [ROCm] test_hip_device_count safely runs on 1 GPU systems (#156398)
Fixes test_cuda.py::TestCuda::test_hip_device_count on single gpu scenario

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156398
Approved by: https://github.com/jeffdaily
2025-06-28 20:17:26 +00:00
aeffb68d34 [schema_upgrader] add C++ upgrader for json based upgrading (#156761)
Differential Revision: [D77459912](https://our.internmc.facebook.com/intern/diff/D77459912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156761
Approved by: https://github.com/angelayi
2025-06-28 18:15:06 +00:00
064a7db7fc [invoke_subgraph] turn on supports_input_mutation by default (#157177)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157177
Approved by: https://github.com/anijain2305
2025-06-28 18:14:47 +00:00
2eb744c08d Revert "[BE] parse CMake version from cmake -E capabilities instead of cmake --version (#157073)"
This reverts commit 0c58bdd8fb5f269aef100af8e2c43cfcf5f1f9dd.

Reverted https://github.com/pytorch/pytorch/pull/157073 on behalf of https://github.com/XuehaiPan due to break libtorch build on Windows ([comment](https://github.com/pytorch/pytorch/pull/157073#issuecomment-3015273679))
2025-06-28 13:40:19 +00:00
0c58bdd8fb [BE] parse CMake version from cmake -E capabilities instead of cmake --version (#157073)
`cmake -E capabilities` produces a JSON format that is more machine-friendly.

```console
$ cmake --version
cmake version 4.0.3

CMake suite maintained and supported by Kitware (kitware.com/cmake).
$ cmake -E capabilities | jq '.version.string'
"4.0.3"
$ cmake -E capabilities | jq
{
  "debugger": true,
  "fileApi": {
    "requests": [
      {
        "kind": "codemodel",
        "version": [
          {
            "major": 2,
            "minor": 8
          }
        ]
      },
      {
        "kind": "configureLog",
        "version": [
          {
            "major": 1,
            "minor": 0
          }
        ]
      },
      {
        "kind": "cache",
        "version": [
          {
            "major": 2,
            "minor": 0
          }
        ]
      },
      {
        "kind": "cmakeFiles",
        "version": [
          {
            "major": 1,
            "minor": 1
          }
        ]
      },
      {
        "kind": "toolchains",
        "version": [
          {
            "major": 1,
            "minor": 0
          }
        ]
      }
    ]
  },
  "generators": [
    {
      "extraGenerators": [],
      "name": "Watcom WMake",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [
        "Kate"
      ],
      "name": "Ninja Multi-Config",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [
        "CodeBlocks",
        "CodeLite",
        "Eclipse CDT4",
        "Kate",
        "Sublime Text 2"
      ],
      "name": "Ninja",
      "platformSupport": false,
      "toolsetSupport": false
    },
    {
      "extraGenerators": [],
      "name": "Xcode",
      "platformSupport": false,
      "toolsetSupport": true
    },
    {
      "extraGenerators": [
        "CodeBlocks",
        "CodeLite",
        "Eclipse CDT4",
        "Kate",
        "Sublime Text 2"
      ],
      "name": "Unix Makefiles",
      "platformSupport": false,
      "toolsetSupport": false
    }
  ],
  "serverMode": false,
  "tls": true,
  "version": {
    "isDirty": false,
    "major": 4,
    "minor": 0,
    "patch": 3,
    "string": "4.0.3",
    "suffix": ""
  }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157073
Approved by: https://github.com/Skylion007
2025-06-28 13:35:30 +00:00
cdb144fcf0 Display a warning when overwriting CMAKE_CUDA_ARCHITECTURES (#156123)
Really, pytorch shoudn't be messing with basic _global_ cmake configuration like this, but without a careful analysis what all depends on this behaviour, I'm not confident to propose a change.
But at least notifying the user that something wonky is going on seems like a good idea.
@drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156123
Approved by: https://github.com/drisspg, https://github.com/msaroufim

Co-authored-by: Mark Saroufim <marksaroufim@meta.com>
2025-06-28 11:22:09 +00:00
8147c4a904 [symm_mem] Create a dedicated ci flow for symmetric memory and only use 4 GPUs (#157181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157181
Approved by: https://github.com/kwen2501, https://github.com/huydhn
2025-06-28 08:33:50 +00:00
88c6199db0 [nativert] Move KernelFactory to PyTorch core (#156913)
Summary: Kernel factory handles the kernel nodes initializations and different type of kernels executions.

Test Plan:
CI

Rollback Plan:

Differential Revision: D77346836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156913
Approved by: https://github.com/zhxchen17
2025-06-28 06:34:24 +00:00
51eb8e8f84 [ATen][CUDA][CUB] Implement changes to CCCL (CUB/Thrust/LibCUDACXX) usage in ATen (#153373)
A major release of CCCL 3.0.0 will introduce some bc-breaking changes. Namely iterators like TransformInputIterator and ConstantInputIterator were moved from CUB to Thrust, some operators like Max and Sum were moved to LibCUDACXX.

For the more info on changes please visit: https://nvidia.github.io/cccl/cccl/3.0_migration_guide.html

This is a follow up to PR #147493. A description from the original PR:
> Several cub iterators have been deprecated and removed in the latest CCCL (cub) development https://github.com/NVIDIA/cccl/pull/3831. This PR replaced the usage of those cub iterators with thrust iterators.
>
> Some cub thread operators were also deprecated and removed in https://github.com/NVIDIA/cccl/pull/3918. This PR replaced those operators with libcudacxx ops.
>
> This might also affect ROCM usability a bit.
>
> This patch is tested to work with CCCL commit at 82befb0894
>
> Tracking of CCCL/CUB deprecations in the most recent development https://github.com/NVIDIA/cccl/issues/101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153373
Approved by: https://github.com/cyyever, https://github.com/atalman
2025-06-28 05:44:52 +00:00
a92b24cd83 Prevent cudaStreamSync when indexing GPU tensors with boolean CPU mask (#156384)
`index_put` with a boolean mask (`target[mask] = src`) causes a `cudaStreamSynchronize`. When both `mask` and `target` tensors are on GPU this is expected.

However, the sync can be prevented if the `mask` is a CPU tensor.
Internally a new index tensor is created with `mask.nonzero()` so we can use a non-blocking copy to transfer it to the GPU since it cannot be accidentally mutated by the user between its creation and the device copy. @ngimel Let me know if I'm missing something.

I think this is useful since users can't prevent a sync simply by making sure all tensors are on the same device as with other ops. Instead one would need to do something like this which is much less readable
```python
indices = mask.nonzero().squeeze(1).to("cuda", non_blocking=True)
target[indices] = src
```
Fixes #12461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156384
Approved by: https://github.com/ngimel
2025-06-28 05:41:16 +00:00
5692cbb818 [ONNX] Delete symbolic caffe2 (#157102)
Caffe2 is removed from pytorch. This is a clean up.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157102
Approved by: https://github.com/titaiwangms, https://github.com/cyyever
2025-06-28 05:22:02 +00:00
cyy
30d2648a4a Install nvperf_host together with cupti (#156668)
Because cupti depends on nvperf_host, as discussed in https://github.com/pytorch/pytorch/pull/154595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156668
Approved by: https://github.com/Skylion007
2025-06-28 04:26:36 +00:00
adf6dd1e44 Fix aten::index_put args Dtensor type mismatch and add a propagation strategy (#156240)
We notice model code contains indexing syntax like [nanogpt model code](f144fe9095/torchbenchmark/models/nanogpt/model.py (L240)), which causes training fail in the backward pass when using DTensor.

In the code, `x = x[:, [-1], :]` calls the index op and in the backward pass, it will trigger `aten.index_put.default` with the second argument to be of type `torch::List<std::optional<Tensor>>`, e.g., `[None, tensor([-1], device='cuda:0')]`. We are unable to unwarp the op info into Dtensor based on the current logic [here](2625c70aec/torch/distributed/tensor/_dispatch.py (L339-L358)). We need to set runtime_schema_info for the op and enable needs_pytree to support the conversion of tensor list arg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156240
Approved by: https://github.com/wanchaol
2025-06-28 04:09:41 +00:00
f810480dbe Revert "[schema_upgrader] add C++ upgrader for json based upgrading (#156761)"
This reverts commit 61712e6f2ba58cce354a742d918934ec7293ee43.

Reverted https://github.com/pytorch/pytorch/pull/156761 on behalf of https://github.com/ydwu4 due to break linter test, which doesn't show up in the pr ([comment](https://github.com/pytorch/pytorch/pull/156761#issuecomment-3014918800))
2025-06-28 03:58:25 +00:00
0e47312ae5 ci: Add ability to test images for build-triton-wheel (#156894)
This wasn't available prior making it difficult to test if manywheel
image changes would affect triton wheel builds.

Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156894
Approved by: https://github.com/atalman, https://github.com/clee2000, https://github.com/malfet
ghstack dependencies: #156893
2025-06-28 03:41:18 +00:00
ef6dfa06a9 Create a base Checkpointer and SyncCheckpointer and add dist barrier impl and (#156926)
In preparation to adding async checkpointing, this diff adds
1.  Change Checkpointer to an Abstract base class and adds a sync checkpointer implementation.
2. torch.distributed.barrier() as one of the barrier choices.

Differential Revision: [D77341314](https://our.internmc.facebook.com/intern/diff/D77341314/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156926
Approved by: https://github.com/pradeepfn
2025-06-28 02:48:29 +00:00
e8217ad8be [inductor][static launcher] Skip correctness test for test_floats (#157023)
https://github.com/triton-lang/triton/issues/6176 causes kernels that take fp64 scalar inputs to generate wrong results. Until we get around to fixing this, just skip the accuracy check (it'll fail on Triton's launcher anyway).

Differential Revision: [D77407307](https://our.internmc.facebook.com/intern/diff/D77407307)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157023
Approved by: https://github.com/jamesjwu
2025-06-28 02:19:10 +00:00
e3320965b4 [sym_mem] Further Fix NCCL symm mem unit test (#157156)
We still see CI failures because of error "RuntimeError: CUDA driver error: invalid device ordinal". So upon discussion, we might also need a GPU number skip macro for the test itself:

Fixes #156569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157156
Approved by: https://github.com/kwen2501, https://github.com/fegin
2025-06-28 02:17:13 +00:00
a1e4f1f98a [MPS] Reimplement tri[ul] as Metal shaders (#157179)
And add in-place flavor, as it is currently broken for non-contig tensors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157179
Approved by: https://github.com/dcci
2025-06-28 01:33:18 +00:00
c14110056f [caffe2] Allow the elimination of implicit calls to strlen when using the RECORD_FUNCTION macros (#153567)
Summary:
With the way these were written, any string literals that were being passed in, like `__func__`, were only ever passed down as a `const char*`, so this switches it over to take a `std::string_view` at the deepest part.

This also has the side effect of allowing `std::string_view` to be passed to the `RECORD_FUNCTION` macros as well.

Test Plan:
contbuilds

Rollback Plan:

Differential Revision: D74681042

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153567
Approved by: https://github.com/Skylion007, https://github.com/swolchok
2025-06-28 01:11:00 +00:00
1e4c5b666a Revert "[dynamo] fix _torchdynamo_orig_callable naming issues (#156901)"
This reverts commit eb9efb37c8f315f1d30e86d5797490c6a8666889.

Reverted https://github.com/pytorch/pytorch/pull/156901 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some internal tests D77411594 ([comment](https://github.com/pytorch/pytorch/pull/156901#issuecomment-3014734151))
2025-06-28 00:37:01 +00:00
61712e6f2b [schema_upgrader] add C++ upgrader for json based upgrading (#156761)
Differential Revision: [D77459912](https://our.internmc.facebook.com/intern/diff/D77459912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156761
Approved by: https://github.com/angelayi
2025-06-27 23:50:19 +00:00
2815ade9a8 updated adafactor doc #154862 (#155248)
updated adafactor doc to reflect difference in implementation vs original paper

Fixes #154862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155248
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2025-06-27 23:23:19 +00:00
feea575082 [MTIA ATen Backend] Add dispatch keys for add.out (#156952)
Migrate add.out

Differential Revision: [D77352482](https://our.internmc.facebook.com/intern/diff/D77352482/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156952
Approved by: https://github.com/malfet, https://github.com/huydhn
ghstack dependencies: #156944, #156945, #156946, #156947, #156948, #156949, #156950, #156951
2025-06-27 22:49:00 +00:00
253cbadade [MTIA ATen Backend] Add dispatch keys for rsub.Tensor / rsub.Scalar / sub.out (#156951)
Migrate rsub.Tensor / rsub.Scalar / sub.out

Differential Revision: [D77015033](https://our.internmc.facebook.com/intern/diff/D77015033/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156951
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945, #156946, #156947, #156948, #156949, #156950
2025-06-27 22:49:00 +00:00
b6b2871555 [MTIA ATen Backend] Add dispatch keys for fmod / abs.out / logical_not.out (#156950)
Migrate fmod / abs.out / logical_not.out

Differential Revision: [D77220217](https://our.internmc.facebook.com/intern/diff/D77220217/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156950
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945, #156946, #156947, #156948, #156949
2025-06-27 22:48:48 +00:00
a95bee9ed6 [MTIA ATen Backend] Add dispatch key for div.out (#156949)
Migrate div.out

Differential Revision: [D77063371](https://our.internmc.facebook.com/intern/diff/D77063371/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156949
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945, #156946, #156947, #156948
2025-06-27 22:48:39 +00:00
f30e072cb4 [MTIA ATen Backend] Add dispatch keys for mul.Scalar_out / mul.out (#156948)
Migrate mul.Scalar_out / mul.out

Differential Revision: [D77011801](https://our.internmc.facebook.com/intern/diff/D77011801/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156948
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945, #156946, #156947
2025-06-27 22:48:32 +00:00
66ad843583 [MTIA ATen Backend] Add dispatch keys for gt.Tensor_out / gt.Scalar_out (#156947)
Migrate gt.Tensor_out / gt.Scalar_out

Differential Revision: [D77009468](https://our.internmc.facebook.com/intern/diff/D77009468/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156947
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945, #156946
2025-06-27 22:48:25 +00:00
f0a5a3b453 [MTIA ATen Backend] Add dispatch keys for ne.Tensor_out / ne.Scalar_out (#156946)
Migrate ne.Tensor_out / ne.Scalar_out

Differential Revision: [D77008139](https://our.internmc.facebook.com/intern/diff/D77008139/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156946
Approved by: https://github.com/malfet
ghstack dependencies: #156944, #156945
2025-06-27 22:48:18 +00:00
cd1a924dba [nativert] get rid of sigmoid naming (#157134)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D77451215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157134
Approved by: https://github.com/zhxchen17, https://github.com/jingsh
2025-06-27 22:41:52 +00:00
d283fc79b1 chunk_size should always be int64_t for Foreach functors (#156872)
See https://github.com/pytorch/pytorch/issues/156261#issuecomment-3002394773

Testing is a valid q--it is pretty expensive to test such large tensors for all these ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156872
Approved by: https://github.com/Skylion007, https://github.com/eqy
ghstack dependencies: #156876, #156871
2025-06-27 22:35:34 +00:00
5a0926a26e Stop skipping entire foreach tests, just skip the profiler portion (#156871)
Instead of skipping the whole test as the CUPTI team figures out what is wrong, let's temporarily skip the profiler check portion. It is high pri to add it back to ensure foreach ops are actually performant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156871
Approved by: https://github.com/albanD
ghstack dependencies: #156876
2025-06-27 22:35:34 +00:00
20e40492b0 [dynamo] Add fx_graph_runnable test coverage (#157021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157021
Approved by: https://github.com/StrongerXi, https://github.com/xmfan
2025-06-27 21:35:56 +00:00
130d4973bd Documentation update torch.clone #156644 (#157007)
updated torch clone docs to reflect implemented memory behavior

Fixes #156644

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157007
Approved by: https://github.com/malfet, https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-27 21:10:09 +00:00
3ee75b7eac [MTIA ATen Backend] Add dispatch keys for le.Tensor_out / le.Scalar_out (#156945)
Migrate le.Tensor_out / le.Scalar_out

Differential Revision: [D77002317](https://our.internmc.facebook.com/intern/diff/D77002317/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156945
Approved by: https://github.com/malfet
ghstack dependencies: #156944
2025-06-27 21:03:19 +00:00
6b7767fc8d [MTIA ATen Backend] Add dispatch keys for ge.Tensor_out / ge.Scalar_out (#156944)
Migrate ge.Tensor_out / ge.Scalar_out

Differential Revision: [D77002145](https://our.internmc.facebook.com/intern/diff/D77002145/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156944
Approved by: https://github.com/malfet
2025-06-27 21:02:27 +00:00
0decd966af Revert "Fixes for CPython int/float tests (#155978)"
This reverts commit 216bd6091ec52865052282eced7e6d5d2a4b4fb4.

Reverted https://github.com/pytorch/pytorch/pull/155978 on behalf of https://github.com/huydhn due to Some tests are still failing in trunk ([comment](https://github.com/pytorch/pytorch/pull/155978#issuecomment-3014185210))
2025-06-27 19:39:41 +00:00
7c51619e7f Fix Float16 CooperativeReduction Test Failure (#154516)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154516
Approved by: https://github.com/jansel, https://github.com/jeffdaily
2025-06-27 19:31:49 +00:00
4048a144ab Address richard's comments on libtorch_stable_abi note (#156324)
Followups from #155984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156324
Approved by: https://github.com/zou3519
2025-06-27 19:19:12 +00:00
dcb97cd519 Remove unneccesary code to check autograd state (#156855)
Summary: Title

Test Plan:
CI

Rollback Plan:

Differential Revision: D77317627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156855
Approved by: https://github.com/zhxchen17

Co-authored-by: Camyll Harajli <camyllh@meta.com>
2025-06-27 19:18:06 +00:00
8a88c6e85a [nit] fix xavier init doc (#157100)
Remove part of the documentation that is irrelevant and confusing at best, probably a copy-paste mistake:

<img src="https://github.com/user-attachments/assets/77fa5734-5a5a-4f8d-80a5-bc3269668e07" width="500">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157100
Approved by: https://github.com/mikaylagawarecki
2025-06-27 19:13:40 +00:00
75a7d9e868 Revert "python definitely_contiguous-> is_contiguous_or_false (#156515)"
This reverts commit 4c0091fda65b714fa73671a15e379f814af153e0.

Reverted https://github.com/pytorch/pytorch/pull/156515 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause some torch.export failures internally ([comment](https://github.com/pytorch/pytorch/pull/156515#issuecomment-3014104570))
2025-06-27 19:07:06 +00:00
2860f5c4f5 Remove mentioning of TorchScript in Export doc (#156969)
Remove mentioning of TorchScript

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156969
Approved by: https://github.com/angelayi

Co-authored-by: Angela Yi <yiangela7@gmail.com>
2025-06-27 17:59:15 +00:00
456b7451c7 Minor error message fix in device_mesh.py (#157096)
Fixed error message:
On main:
```
KeyError: ("Invalid mesh_dim_names ('dp_shard', 'dp_shard') specified. ", 'Found mesh dim indices to slice: [(1,), (1,)]. ', 'Mesh dim indices should be in ascending order.')
```
On PR:
```
KeyError: Invalid mesh_dim_names ('dp_shard', 'dp_shard') specified. Found mesh dim indices to slice: [(1,), (1,)]. Mesh dim indices should be in ascending order.'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157096
Approved by: https://github.com/Skylion007
2025-06-27 17:42:29 +00:00
36fd1ac932 [ONNX] Bump onnxscript api for torch 2.8 (#157017)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157017
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2025-06-27 17:39:17 +00:00
84c588e5ea [cutlass backend][BE][ez] Make matmul layouts be row x column (#156656)
Differential Revision: [D77184232](https://our.internmc.facebook.com/intern/diff/D77184232/)

Motivation:
* This is the case we care the most.
* We are caching the kernels for this row x column layout. So testing on them can potentially make ci run faster.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156656
Approved by: https://github.com/ColinPeppler
2025-06-27 17:15:45 +00:00
b22b93a6ba [2/n] rewrite load balancing and sharding in context parallel (#155442)
This PR rewrite how load balancing and sharding works in the current
context parallel implementation.

Why the changes? We should NOT expose another layer of "sharding"
concept as it would confuse the user about its difference with DTensor
sharding. The current CP perform sharding weirdly simply because it
mixed the concept of load balancing and sharding.

I think load balancing and sharding need to be decoupled to separate
layers:

* The load balancing layer is responsible to reorder the input sequence
so that the attention computation are evenly balanced across rows/ranks.
* Sharding is a separate layer after it, it simply take the input reordered by
the load balancer and shard it exactly as how DTensor shard tensor sequentially

In this PR:
* I removed the "Sharder" and "LoadBalancer" mixed usage, and
simply generate a roundrobin indices when the mask is a casual mask
* use `distribute_tensor` to perform the sharding. We still keep the local
shard instead of the DTensor objects to allow maximum compatibility with
arbitrary model architecture given DTensor op coverage is not high
enough.

One alternative design is to still keep the LoadBalancer and add the indices
generation and restore to be the protocol of the LoadBalancer. I thought through
it and think we might want to directly expose the load_balancing indices as
an argument instead of a dedicated class interface, so I removed it here. More
discussion on this is welcomed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155442
Approved by: https://github.com/XilunWu
ghstack dependencies: #155441
2025-06-27 17:06:42 +00:00
f7c730107e [1/n] refactor the ring attention implementation (#155441)
as titled, I'm working on a series of changes to make ring attention
impl and DTensor works better together, this PR specifically refactor the
current implemtnation to:

* remove dead/unused code
* restructure the functions to make them stay organized
* refactor to remove/make error message better

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155441
Approved by: https://github.com/fegin
2025-06-27 17:06:42 +00:00
eeaefa1336 Fix UnbackedSymint rebinding - check unbacked before renaming (#156911)
Differential Revision: D77249427

Due to memoization and graph order update, it can happen that a backed symbol is passed into compute_unbacked_bindings and lead to failure. An example as follow:

- There are 2 boolean indexing operators (e.g. op1 and op2) with the same mask.
- A unbacked symint is generated from op1, and then op2 reuses the unbacked symint due to a nonzero_memo in nonzero's fake implementation and no rebinding is needed for op2.
- Since op1 generated the unbacked symint, its meta has "unbacked_bindings" field filled and op2's meta doesn't have it.
- Output from op1 and op2 are later concated with others with backed symint, so that the unbacked symint can be replaced by a backed symint.
- In Inductor, during fake tensor prop, there is no memoi because new fake tensor is always generated (for the same node). op1 generates an unbacked symint and the unbacked can be rebound successfully to the backed symint. Since there is no memoi, op2 also generates a new unbacked symint, but no rebinding can happen because op2's meta doesn't have "unbacked_bindings". And "compute_unbacked_bindings/_rename_unbacked_to" fails to assert op2's old symbol to be unbacked.

From discussion with [@ezyang](https://www.internalfb.com/intern/profile/?id=503862770), there is no easy way to fix this issue.

- We can try to enable memoization for fake tensor prop in Inductor, however, we need to ensure that op1 is visited before op2 during Inductor fake tensor prop for this to work (op2's meta doesn't have "unbacked_bindings" so no rebinding can happen and we need to do rebinding from op1. But there are passes such as reorder_for_locality that can change the graph order so this doesn't work.
- A simple hack is to just replace the unbacked symbol in op2 by the backed symbol.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156911
Approved by: https://github.com/ezyang
2025-06-27 16:57:04 +00:00
216bd6091e Fixes for CPython int/float tests (#155978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155978
Approved by: https://github.com/zou3519
2025-06-27 16:41:00 +00:00
d0cfa3e5bf [c10d] Move the include of header file of TraceUtils.h into NCCLUtil.cpp instead of keeping in hpp (#156909)
We have seen complaint about compilation failure of `NCCLSymmetricMemory.cu` and the reason is because we include <torch/csrc/distributed/c10d/TraceUtils.h> inside NCCLUtil.hpp this is not necessary so we want to move the include to cpp.

Differential Revision: [D77346675](https://our.internmc.facebook.com/intern/diff/D77346675)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156909
Approved by: https://github.com/kwen2501
2025-06-27 16:30:49 +00:00
21b5dc7a6a [CD] Add python-3.14.0b3 to docker image (#156889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156889
Approved by: https://github.com/albanD, https://github.com/atalman
ghstack dependencies: #157033
2025-06-27 16:24:39 +00:00
d158e9ea82 Update nightly PyTorch version to 2.8.0->2.9.0 (#156965)
Same as https://github.com/pytorch/pytorch/pull/149038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156965
Approved by: https://github.com/Camyll, https://github.com/malfet
2025-06-27 16:22:08 +00:00
60abb0d327 [dynamo] Better error for invalid @contextlib.contextmanager usage (#156924)
Fixes #156716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156924
Approved by: https://github.com/williamwen42
2025-06-27 15:50:36 +00:00
ff8b53c056 [Kineto] Add MTIA_INSIGHT to kineto_shim (#156853)
Summary:
Add MTIA_INSIGHT to kMtiaTypes in kineto_shim.cpp

For insight, user can use MTIA_INSIGHT_VERBOSE_TRACES=0 to disable the profiler. So, we can enable it by default

Test Plan:
{F1979756361}
When the environment var isn't set, it uses 0.

Rollback Plan:

Differential Revision: D77315882

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156853
Approved by: https://github.com/sraikund16
2025-06-27 15:30:14 +00:00
5118a8f8a5 Rename mm_scaled_grouped.py to mm_grouped.py (#156849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156849
Approved by: https://github.com/amjames, https://github.com/Skylion007
2025-06-27 15:02:22 +00:00
aa2d54148d Add AOTDispatcher config to set backward autocast behavior (#156356)
This PR adds a new config `backward_pass_autocast`, to set the backward autocast
behavior. It does not change the existing behavior.

The reason why we need this is that torch.compile acquires a forward and
backward graph at the time of the forward pass. This means that
implemented naively, if there are any context managers active outside
the call to torch.compile, the backward graph will also get the
behaviors from those context managers. This PR gives users a way to
tweak the autocast behavior of the backward pass.

Please see torch._functorch.config for the options to the
`backward_pass_autocast` config.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156356
Approved by: https://github.com/bdhirsh
ghstack dependencies: #155354
2025-06-27 14:58:58 +00:00
adf9644440 Add pg transport and tests (#154653)
Add PG transport and tests under `torch/distributed/checkpoint/`

### API:
```python
def send_checkpoint(self, dst_ranks: list[int], state_dict: object) -> None:
def recv_checkpoint(self, src_rank: int) -> object:
```

### Tests:
```
python test/distributed/checkpoint/test_pg_transport.py
```

### Example:
Under `_pg_transport_example.py` (in https://github.com/pytorch/pytorch/pull/155810)
```
torchrun --nproc_per_node=2 -m torch.distributed.checkpoint._pg_transport_example -- --device cuda
```

Differential Revision: [D76044919](https://our.internmc.facebook.com/intern/diff/D76044919)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154653
Approved by: https://github.com/meetv18
2025-06-27 14:53:34 +00:00
414ad47045 revamp dtype documentation for 2025 (#156087)
The dtype documentation has not been updated in awhile, let's do a revamp.

1. combine the duplicated docs for dtypes from `tensors.rst` and `tensor_attributes.rst` to live in `tensor_attributes.rst`, and link to that page from `tensors.rst`
2. split the dtype table into floating point and integer dtypes
3. add the definition of shell dtype
4. add the float8 and MX dtypes as shell dtypes to the dtype table
5. remove legacy quantized dtypes from the table
6. add the definition of various dtype suffixes ("fn", etc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156087
Approved by: https://github.com/albanD
2025-06-27 13:10:23 +00:00
43523bf168 Fix silent incorrectness arising from incorrect alias information (#152011)
Fixes #136662

There are two problems:
1) canonicalize_view_scatter_ops adds some new nodes into the graph.
   These new nodes cause the alias info on the graph to be wrong. To fix
   this, we try to run FakeTensorUpdater on the graph again.
2) FakeTensorUpdater's alias information is wrong. It tries to skip
   nodes that it thinks have "equivalent" FakeTensor metadata.
   It should not be allowed to do this if any users of the node can
   alias the node. The example
   is if we have `x = foo(...); y = x.view(...)`. If the user replaces
   `foo` with a new `bar` node and sets bar.meta["val"] correctly, then
   FakeTensorUpdater still needs to update y's meta["val"] to be a view
   of the new bar node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152011
Approved by: https://github.com/yf225
2025-06-27 12:45:03 +00:00
75f3e5a88d [dynamo] Fix issue with tensors passed as view() shapes (#156928)
Fixes #156720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156928
Approved by: https://github.com/ezyang
2025-06-27 08:52:31 +00:00
588b5fb94b Optimize TorchHigherOrderOperatorVariable.make() with lookup table (#157022)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157022
Approved by: https://github.com/zou3519
2025-06-27 07:36:12 +00:00
968f90ce73 [ROCm][Windows] Fixing undefined symbol linker error after exposing MIOpen symbols (#156479)
Fixing undefined symbol linker error after [exposing MIOpen symbols](https://github.com/pytorch/pytorch/pull/154545).
This fix:

- Hipifies `aten/src/ATen/miopen` and `aten/src/ATen/native/miopen` files
- Adds `aten/src/ATen/miopen` and `aten/src/ATen/native/miopen` hipified source files to `all_hip_cpp` list

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156479
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-27 07:23:32 +00:00
4a80ddfbe7 Revert "Fix reinplace pass handling of view input + mutable custom op (#156729)"
This reverts commit b754b1fa43d20f5b31e17c396487ab56991912da.

Reverted https://github.com/pytorch/pytorch/pull/156729 on behalf of https://github.com/davidberard98 due to breaks lint: [GH job link](https://github.com/pytorch/pytorch/actions/runs/15918483073/job/44900430950) [HUD commit link](b754b1fa43) ([comment](https://github.com/pytorch/pytorch/pull/156729#issuecomment-3011867746))
2025-06-27 06:38:58 +00:00
cyy
064288cbab Use std::string_view in torchgen (#157050)
Let the generated code use std::sv

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157050
Approved by: https://github.com/ezyang
2025-06-27 06:36:10 +00:00
cc3ea2d840 remove gso from Linear.cpp (#156899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156899
Approved by: https://github.com/ColinPeppler
2025-06-27 06:30:50 +00:00
cf0749c92f Use expecttest in test_compiled_optimizers.py (#155308)
Fixes #141262

## Test Result

```bash
pytest test/inductor/test_compiled_optimizers.py -vv
```

![image](https://github.com/user-attachments/assets/1886fb71-ff05-46e7-988c-82d36358a834)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155308
Approved by: https://github.com/mlazos, https://github.com/msaroufim

Co-authored-by: Mark Saroufim <marksaroufim@gmail.com>
2025-06-27 06:29:51 +00:00
cbcffce48a address remaining straight forward gso in meta_registrations (#156902)
Those are all straight forward generalization of existing checks,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156902
Approved by: https://github.com/ColinPeppler
2025-06-27 06:19:54 +00:00
640703d95f add torch.concat to normalization pass (#156574)
Summary: In the normalization pass, we also add torch.concat to it to normalize it as torch.cat

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes -- test_cat_normalization
```

Buck UI: https://www.internalfb.com/buck2/597fd4f1-0aa7-4372-8a66-5a690d9b63a4
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850152284203
Network: Up: 84KiB  Down: 34KiB  (reSessionID-3916e009-7117-41ce-b6f9-089873aa50dd)
Executing actions. Remaining     0/3                                                                                              1.1s exec time total
Command: test.     Finished 2 local
Time elapsed: 3:47.1s
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Rollback Plan:

Differential Revision: D77125331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156574
Approved by: https://github.com/Mingming-Ding
2025-06-27 06:07:26 +00:00
1155c53e7d Port three dynamo test to Intel GPU (#156575)
For https://github.com/pytorch/pytorch/issues/114850, we will port test cases to Intel GPU. Two dynamo test files were ported in PR [#156056](https://github.com/pytorch/pytorch/pull/156056). In this PR we will port 3 more dynamo test files.
We could enable Intel GPU with following methods and try the best to keep the original code styles:

- instantiate_device_type_tests()
- use "torch.accelerator.current_accelerator()" to determine the accelerator backend
- added XPU support in decorators like @requires_gpu
- enabled XPU for some test path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156575
Approved by: https://github.com/guangyey, https://github.com/jansel

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
2025-06-27 05:56:22 +00:00
51853b358e [dynamo] Improve error message for cond aliasing (#156963)
See #156724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156963
Approved by: https://github.com/zou3519, https://github.com/williamwen42
2025-06-27 05:31:46 +00:00
6b05842e47 [test][inductor] fix test_conv_cat failure (#155852)
This test is currently failing because triton_poi_fused_cat_2 has changed to triton_poi_fused_cat_3. I have not investigated why the extra kernel is generated, but this test has been failing on trunk for a while (and I verified locally that it is failing).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155852
Approved by: https://github.com/FindHao, https://github.com/Skylion007
2025-06-27 05:11:11 +00:00
2c76f31221 Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous (#155590)
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.

sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.

ex:
 bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.

In this PR I only handle default contiguity, will follow up with changes for other formats like  channel_last .
We use this patter in this PR for several locations to avoid DDEs.
Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590
Approved by: https://github.com/ezyang
2025-06-27 04:59:52 +00:00
b754b1fa43 Fix reinplace pass handling of view input + mutable custom op (#156729)
Fixes #153389.

Using approach https://github.com/pytorch/pytorch/issues/153389#issuecomment-3006049928 suggested by Richard.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156729
Approved by: https://github.com/zou3519
2025-06-27 04:54:17 +00:00
e6d8ed02cb PyTorch Data Sampler benchmark (#156974)
## Motivation
Many PRs optimizing samplers (for eg https://github.com/pytorch/pytorch/pull/147706, https://github.com/pytorch/pytorch/pull/137423) are leveraging an adhoc script for benchmarking samplers. The script and outputs are often copied over in PRs. We want to begin centralizing benchmarks for torch.utils.data components.

## What ?
* This PR adds a new sub-folder in `benchmarks`  for `data`. This is aimed to cover benchmarking scripts for torch.utils.data components like dataloader and sampler.
* Specifically, this PR includes a simple script to time samplers. This is often "copy-pasted" in PRs optimizing samplers. Having it in a centralized location should prevent that, and allow a common standard.

## Output
```
Benchmark Results:
+--------------+-------------+----------------+-----------+-----------+
|   Batch Size | Drop Last   |   Original (s) |   New (s) | Speedup   |
+==============+=============+================+===========+===========+
|            4 | True        |         0.004  |    0.0088 | -119.62%  |
+--------------+-------------+----------------+-----------+-----------+
|            4 | False       |         0.0083 |    0.009  | -9.23%    |
+--------------+-------------+----------------+-----------+-----------+
|            8 | True        |         0.003  |    0.0074 | -147.64%  |
+--------------+-------------+----------------+-----------+-----------+
|            8 | False       |         0.0054 |    0.0075 | -38.72%   |
+--------------+-------------+----------------+-----------+-----------+
|           64 | True        |         0.0021 |    0.0056 | -161.92%  |
+--------------+-------------+----------------+-----------+-----------+
|           64 | False       |         0.0029 |    0.0055 | -92.50%   |
+--------------+-------------+----------------+-----------+-----------+
|          640 | True        |         0.002  |    0.0055 | -168.75%  |
+--------------+-------------+----------------+-----------+-----------+
|          640 | False       |         0.0024 |    0.0062 | -161.35%  |
+--------------+-------------+----------------+-----------+-----------+
|         6400 | True        |         0.0021 |    0.0055 | -160.13%  |
+--------------+-------------+----------------+-----------+-----------+
|         6400 | False       |         0.0021 |    0.0068 | -215.46%  |
+--------------+-------------+----------------+-----------+-----------+
|        64000 | True        |         0.0042 |    0.0065 | -55.29%   |
+--------------+-------------+----------------+-----------+-----------+
|        64000 | False       |         0.0029 |    0.0077 | -169.56%  |
+--------------+-------------+----------------+-----------+-----------+
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156974
Approved by: https://github.com/ramanishsingh
2025-06-27 04:49:43 +00:00
195ef1bce8 [SymmMem] Refactor NVSHMEM tests: separate Triton tests into dedicated file (#156685)
## Summary

Moved the Triton-specific NVSHMEM tests in `test_nvshmem.py` into a dedicated `test_nvshmem_triton.py` file. Also put the shared Triton JIT kernels at the top-level of new file for reusability.

## Testing

```bash
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem_triton.py
```

All 16 original tests pass with no functionality changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156685
Approved by: https://github.com/mandroid6, https://github.com/kwen2501
ghstack dependencies: #156684
2025-06-27 04:38:37 +00:00
b6c00dfe24 [user triton] AOT inductor support for device-side TMA (#155896)
Tests: `python test/inductor/test_aot_inductor.py -vvv -k device_tma`

Device-side TMA in Triton allows the kernel author to construct the TMA descriptor on the device (which composes with things like autotuning much better). However, it also requires a scratch space to be provided into which the TMA descriptor will be constructed. In the new TMA API (tl.make_tensor_descriptor), this is implemented using a "global scratch space" - a tensor which is allocated beforehand and then passed in as an argument for the kernel.

To support this in AOTI, this PR:
* records the global scratch space needed (triton_heuristics.py), so that it can be used during AOTI codegen
* allocates global scratch, if needed (cuda/device_op_overrides.py)
* plumbs `device_idx_` into the triton caller function, so that global scratch can be allocated on the right device)
* updates tests to verify this works for dynamically shaped inputs

This PR should support both inductor-generated device-side TMA (e.g. persistent TMA mm) and user-defined triton kernels that contain device-side TMA (which is the test I ran to verify this works)

Note: this overrides any user-provided allocator function (typically with eager triton code, the user must provide their own custom allocator function that is used to allocate scratch space).

For Meta reviewers, here is a tlparse from running `python test/inductor/test_aot_inductor.py -vvv -k test_triton_kernel_on_device_tma_dynamic_True_tma_version_new_cuda` https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpFg13g1/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Differential Revision: [D77352139](https://our.internmc.facebook.com/intern/diff/D77352139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155896
Approved by: https://github.com/desertfire
2025-06-27 04:28:04 +00:00
710b92cf3b [BE][BugFix] Install Python-3.13 correctly (#157033)
Fixes temporary workaround introduced by https://github.com/pytorch/builder/pull/1827

I.e. it's  been downloading latest 3.13 branch rather than 3.13.0 release

Simplify nogil version handling
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157033
Approved by: https://github.com/wdvr, https://github.com/huydhn
2025-06-27 04:19:59 +00:00
1eea2c4fe3 [Inductor][CPP] Fix perf regression of functorch_maml_omniglot (#156526)
**Summary**
Fix the performance regression of `functorch_maml_omniglot` in TorchBench. The issue reported in [#151523](https://github.com/pytorch/pytorch/issues/151523) occurs only when a parallel reduction is performed under the vectorized loop and a scalar kernel is used for the tail loop. Previously, we addressed this regression in [#151887](https://github.com/pytorch/pytorch/pull/151887) by disabling all cases where a parallel reduction occurs under the vectorized  loop. However, for `functorch_maml_omniglot`, we found that a masked vector kernel is used in the tail loop instead of the scalar kernel in the job of `inductor_torchbench_cpu_smoketest_perf`. In this PR, we refine the fix by excluding the cases where a masked vector kernel is used in the tail loop, rather than disabling all such scenarios.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156526
Approved by: https://github.com/CaoE
2025-06-27 03:09:24 +00:00
7392470da4 [nativert] alias analyzer + layout planner/manager to pytorch core (#156897)
Summary: att

Test Plan:
ci - unit tests still have some unresolved deps but will move them later.

Rollback Plan:

Differential Revision: D77320950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156897
Approved by: https://github.com/zhxchen17
2025-06-27 03:01:22 +00:00
382c6190c1 complex.pow(2) on GPU by replacing with complex * complex to avoid numerical instability (#152373)
Fixes #150951
Summary:
For complex.pow(2) on GPU:

Uses complex * complex directly.
Produces results consistent with CPU implementation.
Eliminates spurious imaginary components for real inputs.

🧪 Tests
Added unit tests to verify correctness of the new kernel path.
Verified numerical consistency with CPU results.

This change is backward-compatible and only affects the specific case of pow(2) on complex tensors on GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152373
Approved by: https://github.com/ezyang
2025-06-27 02:21:59 +00:00
e290a4c645 Revert "Rename torch::standalone to headeronly (#156964)"
This reverts commit 7e54c02a35b905e758497b856a1953eb009ba836.

Reverted https://github.com/pytorch/pytorch/pull/156964 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156964#issuecomment-3011136947))
2025-06-27 02:20:33 +00:00
4ab4d29cbe [BE] Remove SymmMem allocator destruct log (#157020)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157020
Approved by: https://github.com/fduwjj
2025-06-27 02:10:54 +00:00
56c69bedcc Revert "[dynamo] Better error for invalid @contextlib.contextmanager usage (#156924)"
This reverts commit 863327ae496471654344e1e04ccaa713a44a135d.

Reverted https://github.com/pytorch/pytorch/pull/156924 on behalf of https://github.com/jansel due to Likely same issue as #156963 ([comment](https://github.com/pytorch/pytorch/pull/156924#issuecomment-3011087802))
2025-06-27 01:57:05 +00:00
8e8bbfc803 Remove ts to export retracer (#156857)
Summary: This is probably not used anymore

Test Plan:
CI

Rollback Plan:

Reviewed By: SherlockNoMad

Differential Revision: D77318582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156857
Approved by: https://github.com/SherlockNoMad
2025-06-27 01:54:24 +00:00
a4b59498c5 Fix fake kernel for the out=... variant of unbind_copy (#156643)
`unbind_copy(..., out=...)` returns None rather than the `out` argument
(see https://github.com/pytorch/pytorch/issues/130829#issuecomment-2283936222),
but the old fake kernel didn't account for that and caused an assertion
failure in `pushPyOutToStack`. This patch fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156643
Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/bdhirsh
ghstack dependencies: #156642
2025-06-27 01:34:07 +00:00
89aa708b39 [core] Dispatch to at::nansum_out rather than at::native::nansum_out (#156642)
Calling `at::native::nansum_out` causes the fake kernel to dispatch to a
`make_reduction` call and then segfaults later due to the
`mutable_data_ptr` call in `TensorIteratorBase::build`. It also causes
fake tensor propagation issue in Dynamo. The added tests demonstrate the
aforementioned 2 issues.

This patch fixes it by dispatching to `at::nansum_out` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156642
Approved by: https://github.com/zou3519
2025-06-27 01:34:07 +00:00
863327ae49 [dynamo] Better error for invalid @contextlib.contextmanager usage (#156924)
Fixes #156716

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156924
Approved by: https://github.com/williamwen42
2025-06-27 01:02:01 +00:00
7e54c02a35 Rename torch::standalone to headeronly (#156964)
Summary: headeronly is more clear, let's change the name before anyone depends on standalone

Test Plan:
CI should pass!

Rollback Plan:

Differential Revision: D77381084

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156964
Approved by: https://github.com/swolchok, https://github.com/albanD, https://github.com/desertfire
2025-06-27 01:00:14 +00:00
3bdd5ae334 [PT2] deprecate force_same_precision, guarded by JK (#156789)
Summary:
cuBLAS used to have strict alignment requirements for TF32 usage, even if TF32 was enabled by users; this caused a numeric SEV in the past, when Triton would use TF32 even if cuBLAS could not due to failing the alignment checks

we believe that cuBLAS no longer has alignment requirements for TF32 usage, based on some testing in D77265581; we'd like to deprecate `force_same_precision` since it no longer functions as expected

changing the default to False in fbcode, guarded by a jk so that we can quickly revert to the original behavior if needed

Test Plan:
CI

Rollback Plan:

Differential Revision: D77265930

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156789
Approved by: https://github.com/jhadidjojo, https://github.com/masnesral
2025-06-27 00:43:06 +00:00
6215e90b7b Revert "[dynamo] Improve error message for cond aliasing (#156963)"
This reverts commit 9c39bc24807a5843f8affdf56bd71836760dc554.

Reverted https://github.com/pytorch/pytorch/pull/156963 on behalf of https://github.com/huydhn due to Sorry for reverting your PR, but the failures are legit ([comment](https://github.com/pytorch/pytorch/pull/156963#issuecomment-3010870664))
2025-06-27 00:31:00 +00:00
e3977e843d Revert "Fix silent incorrectness arising from incorrect alias information (#152011)"
This reverts commit 2d39a48d524021995269411bd49fe792e59d9f94.

Reverted https://github.com/pytorch/pytorch/pull/152011 on behalf of https://github.com/Camyll due to cannot land internally. owner will update and reland to fix ([comment](https://github.com/pytorch/pytorch/pull/152011#issuecomment-3010723960))
2025-06-26 23:54:13 +00:00
eb9efb37c8 [dynamo] fix _torchdynamo_orig_callable naming issues (#156901)
`_torchdynamo_orig_callable` was being used in two distinct places:
- to get the original user function from nested eval_frame.py decorators
- to get the original backend from nested convert_frame.py callbacks

We rename the first usage to `_torchdynamo_orig_fn` and the second to `_torchdynamo_orig_backend` in order to distinguish these cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156901
Approved by: https://github.com/StrongerXi, https://github.com/jansel
ghstack dependencies: #156527
2025-06-26 23:51:08 +00:00
6089ebcf6d [dynamo] fix segfault due to dangling CacheEntry backend pointer (#156527)
Fixes https://github.com/pytorch/pytorch/issues/155057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156527
Approved by: https://github.com/anijain2305, https://github.com/jansel
2025-06-26 23:51:08 +00:00
e0447bb5f8 Add max_pool3d for MPS (#156467)
Fixes #100674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156467
Approved by: https://github.com/malfet
2025-06-26 23:33:50 +00:00
1fff6356d9 [MPS] Optimize cummin/cummax metal kernels (#156794)
Performance improvement (M4 Max 64GB, macOS 15.5):
```
                                              | Current | Previous
      cummin-dim0-32x32 (torch.float16)       |  103.4  |   102.5
      cummin-dim0-128x128 (torch.float16)     |  112.2  |   133.6
      cummin-dim0-512x512 (torch.float16)     |  146.9  |   233.1
      cummin-dim0-1024x1024 (torch.float16)   |  193.6  |   364.2
      cummin-dim1-32x32 (torch.float16)       |  102.0  |    94.4
      cummin-dim1-128x128 (torch.float16)     |  103.0  |   109.9
      cummin-dim1-512x512 (torch.float16)     |  109.1  |   227.0
      cummin-dim1-1024x1024 (torch.float16)   |  140.5  |   985.1
      cummin-1d-100 (torch.float16)           |  101.8  |   100.7
      cummin-1d-10000 (torch.float16)         |  112.8  |   805.0
      cummin-1d-1000000 (torch.float16)       | 1343.8  | 70545.6
      cummin-dim0-32x32 (torch.float32)       |  104.6  |   102.7
      cummin-dim0-128x128 (torch.float32)     |  112.3  |   137.2
      cummin-dim0-512x512 (torch.float32)     |  146.6  |   209.7
      cummin-dim0-1024x1024 (torch.float32)   |  194.0  |   340.1
      cummin-dim1-32x32 (torch.float32)       |  100.1  |    99.2
      cummin-dim1-128x128 (torch.float32)     |  101.4  |   111.9
      cummin-dim1-512x512 (torch.float32)     |  110.3  |   250.7
      cummin-dim1-1024x1024 (torch.float32)   |  141.4  |   987.9
      cummin-1d-100 (torch.float32)           |  101.0  |   100.6
      cummin-1d-10000 (torch.float32)         |  112.9  |   794.7
      cummin-1d-1000000 (torch.float32)       | 1311.7  | 71995.3
      cummin-dim0-32x32 (torch.bfloat16)      |  105.8  |   105.9
      cummin-dim0-128x128 (torch.bfloat16)    |  111.9  |   135.7
      cummin-dim0-512x512 (torch.bfloat16)    |  147.1  |   231.9
      cummin-dim0-1024x1024 (torch.bfloat16)  |  191.2  |   327.7
      cummin-dim1-32x32 (torch.bfloat16)      |  101.8  |    91.3
      cummin-dim1-128x128 (torch.bfloat16)    |  100.2  |   108.5
      cummin-dim1-512x512 (torch.bfloat16)    |  108.9  |   222.0
      cummin-dim1-1024x1024 (torch.bfloat16)  |  140.1  |   936.9
      cummin-1d-100 (torch.bfloat16)          |  103.0  |   106.6
      cummin-1d-10000 (torch.bfloat16)        |  113.1  |   795.8
      cummin-1d-1000000 (torch.bfloat16)      | 1296.8  | 68667.4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156794
Approved by: https://github.com/malfet
ghstack dependencies: #156860
2025-06-26 23:30:20 +00:00
9c39bc2480 [dynamo] Improve error message for cond aliasing (#156963)
See #156724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156963
Approved by: https://github.com/zou3519, https://github.com/williamwen42
2025-06-26 23:12:00 +00:00
e6ed4074e8 update expected results (#157010)
<img width="1490" alt="Screenshot 2025-06-26 at 12 30 46 PM" src="https://github.com/user-attachments/assets/4df626d4-3010-4362-974c-fb96fa68b29f" />

<img width="904" alt="Screenshot 2025-06-26 at 12 28 29 PM" src="https://github.com/user-attachments/assets/42626892-27e1-4e69-9efc-c9baf80c5384" />

<img width="752" alt="Screenshot 2025-06-26 at 12 29 05 PM" src="https://github.com/user-attachments/assets/0b1afb30-5868-4ba6-9985-2cc7994a4227" />
PR https://github.com/pytorch/pytorch/pull/152011
added slight regression

<br class="Apple-interchange-newline">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/157010
Approved by: https://github.com/zou3519
2025-06-26 21:56:57 +00:00
80d89974c1 [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #156762, #155166
2025-06-26 21:40:38 +00:00
6df6eacce8 [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #156762
2025-06-26 21:40:38 +00:00
dcb8982969 [dynamo] move error_on_graph_break out of config (#156762)
error_on_graph_break doesn't need to be in config, so we move it out. It should make the functorch_maml_omniglot regression less severe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156762
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-26 21:40:38 +00:00
36666033ab [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-26 21:40:38 +00:00
7b7eafe7ba [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-26 21:40:38 +00:00
1c3f5e902d [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-26 21:40:38 +00:00
fc10d4b1d6 [SymmMem] Allow selection of allocation backend (#156661)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

Today the only way to choose allocation backend is via env `TORCH_SYMMMEM=...`.
This is a bit hard to set in CI on test file basis. (The env has to be set before program is loaded).

This PR added a programmatic way -- a `set_backend` API.

Implementation:
Since this API is slightly more dynamic than static registration, at static time each backend registers its availability rather than filling itself as **the** allocator directly. Later when `set_backend` is called, the allocator would actually fill in the device-to-allocation `map_`.

Though added, `set_backend` is **not** a necessary API for user to call -- one backend is still registered as the default at static time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156661
Approved by: https://github.com/ngimel, https://github.com/fduwjj
2025-06-26 21:37:44 +00:00
262654ee51 [nativert] move constantfolder to libtorch (#156918)
Summary: att -- unit tests will be migrated later, since they still have unresolved deps.

Test Plan:
ci

Rollback Plan:

Differential Revision: D77159278

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156918
Approved by: https://github.com/henryoier, https://github.com/zhxchen17
2025-06-26 21:26:37 +00:00
7f6e7103a3 Convert to markdown: jit_python_reference.rst, jit_unsupported.rst, jit_utils.rst, library.rst (#155404)
Fixes #155024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155404
Approved by: https://github.com/svekars
2025-06-26 21:09:46 +00:00
aff9c1eec5 [aoti][mps] Add fused_rms and sdpa_mps fallback ops (#156844)
Needed for llama3.1

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156844
Approved by: https://github.com/desertfire
ghstack dependencies: #156843
2025-06-26 21:03:05 +00:00
17dab018e3 [aoti][mps] Fix deduplication of kernels (#156843)
Previously I was not correctly deduplicating kernels generated by mps, so it would generate multiple of the same kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156843
Approved by: https://github.com/desertfire
2025-06-26 21:03:05 +00:00
977abe786d fix 'register_foward_pre_hook not supported on ScriptModule' error (#156904)
Summary:
Encountered 'register_foward_pre_hook not supported on ScriptModule' error when trying to publish CFR MTML with placing remote_ro module in remote. Issue may come from the fact that the local net from torchArrow is already scriptModule before gen_app_graph pass.
{F1979770267}

Test Plan:
hg checkout 1ff14dfaade4ac1f3cbbf38fbd72f7fdd5cdcd16
bash hstu_blocker.sh

Rollback Plan:

Reviewed By: RenfeiChen-FB

Differential Revision: D77341370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156904
Approved by: https://github.com/jingsh
2025-06-26 20:59:24 +00:00
81759afed4 [nativert] clean up some migration side-effects (#156919)
Summary: explicit torch::nativert namespace usage + // manual declarations

Test Plan:
ci

Rollback Plan:

Differential Revision: D77328855

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156919
Approved by: https://github.com/zhxchen17
2025-06-26 20:28:32 +00:00
b6e625e34f [SymmMem] Remove redundant dist.barrier in Triton NVSHMEM tests & add device‐side signal_op support (#156684)
## Summary

This PR removes unnecessary `dist.barrier` calls up in our Triton NVSHMEM test suite and adds signal_op support, which is a lightweight device-side signaling mechanism. Added test for this in our `wait_until` kernel and corresponding `core.extern` wrapper.

**Why did we drop the `dist.barrier()` calls?**
We dropped the host‐side dist.barrier() in all Triton NVSHMEM tests (except the raw put/get cases) because every other test already uses NVSHMEM collectives or device‐side sync primitives (fence/quiet/signal/wait), making the extra barrier redundant. This keeps synchronization entirely on the GPU and leverages NVSHMEM’s native ordering guarantees for clearer, more efficient tests.

**`test_triton_wait_until` update**
- **Rank 1**: after `put_kernel` writes the data, launches `signal_op_kernel` to atomically set Rank 0's flag via `nvshmemx_signal_op`
- **Rank 0**: drops its old `dist.barrier()` and simply calls `wait_until_kernel` to spin-wait on the device flag, then asserts data correctness
- Changes made per [this comment](https://github.com/pytorch/pytorch/pull/156472#discussion_r2159734046)

## Testing

```bash
TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156684
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
2025-06-26 20:16:06 +00:00
43a09189c6 [MPS] Add benchmark for scan with indices (#156860)
Baseline performance on M4 Max 64GB (macOS 15.5):
```
[--------------------------------  --------------------------------]
                                              |   eager   |  compile
1 threads: ---------------------------------------------------------
      cummin-dim0-32x32 (torch.float16)       |    102.5  |    115.0
      cummin-dim0-128x128 (torch.float16)     |    133.6  |    147.8
      cummin-dim0-512x512 (torch.float16)     |    233.1  |    243.1
      cummin-dim0-1024x1024 (torch.float16)   |    364.2  |    385.2
      cummin-dim1-32x32 (torch.float16)       |     94.4  |    109.8
      cummin-dim1-128x128 (torch.float16)     |    109.9  |    122.5
      cummin-dim1-512x512 (torch.float16)     |    227.0  |    233.8
      cummin-dim1-1024x1024 (torch.float16)   |    985.1  |   1010.5
      cummin-1d-100 (torch.float16)           |    100.7  |    114.3
      cummin-1d-10000 (torch.float16)         |    805.0  |    879.1
      cummin-1d-1000000 (torch.float16)       |  70545.6  |  71310.3
      cummin-dim0-32x32 (torch.float32)       |    102.7  |    115.5
      cummin-dim0-128x128 (torch.float32)     |    137.2  |    143.8
      cummin-dim0-512x512 (torch.float32)     |    209.7  |    222.0
      cummin-dim0-1024x1024 (torch.float32)   |    340.1  |    389.9
      cummin-dim1-32x32 (torch.float32)       |     99.2  |    107.8
      cummin-dim1-128x128 (torch.float32)     |    111.9  |    119.3
      cummin-dim1-512x512 (torch.float32)     |    250.7  |    255.1
      cummin-dim1-1024x1024 (torch.float32)   |    987.9  |   1013.2
      cummin-1d-100 (torch.float32)           |    100.6  |    114.6
      cummin-1d-10000 (torch.float32)         |    794.7  |    862.2
      cummin-1d-1000000 (torch.float32)       |  71995.3  |  71963.5
      cummin-dim0-32x32 (torch.bfloat16)      |    105.9  |    113.9
      cummin-dim0-128x128 (torch.bfloat16)    |    135.7  |    147.9
      cummin-dim0-512x512 (torch.bfloat16)    |    231.9  |    240.7
      cummin-dim0-1024x1024 (torch.bfloat16)  |    327.7  |    366.9
      cummin-dim1-32x32 (torch.bfloat16)      |     91.3  |    103.3
      cummin-dim1-128x128 (torch.bfloat16)    |    108.5  |    117.4
      cummin-dim1-512x512 (torch.bfloat16)    |    222.0  |    233.6
      cummin-dim1-1024x1024 (torch.bfloat16)  |    936.9  |    982.5
      cummin-1d-100 (torch.bfloat16)          |    106.6  |    112.4
      cummin-1d-10000 (torch.bfloat16)        |    795.8  |    819.6
      cummin-1d-1000000 (torch.bfloat16)      |  68667.4  |  68557.9

Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156860
Approved by: https://github.com/malfet
2025-06-26 18:44:16 +00:00
9fe2d156a9 Revert "[dynamo] fix segfault due to dangling CacheEntry backend pointer (#156527)"
This reverts commit 5ad2bee2c8a7defd2580bb138145a49c37146fcc.

Reverted https://github.com/pytorch/pytorch/pull/156527 on behalf of https://github.com/Camyll due to failing test assertions ([comment](https://github.com/pytorch/pytorch/pull/156527#issuecomment-3009231797))
2025-06-26 17:32:34 +00:00
13efb2c858 [BE] Deprecate search_autotune_cache (#155302)
We haven't had the offline cache populated in > 1 year, this *should* be safe; if this passes, we can finally go through and rip out the offline cache logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155302
Approved by: https://github.com/masnesral
2025-06-26 17:30:08 +00:00
039a1ce0eb [BE] Remove CXX11_ABI references from cpp_builder.py (#156896)
As all Linux builds are CXX11_ABI compatible at this point

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156896
Approved by: https://github.com/desertfire, https://github.com/jansel
2025-06-26 17:28:01 +00:00
e15ea965a1 remove guard_size_oblivious from unbind. (#148815)
unbind will always specialize on dim, because it determine the number of output tensors.
guard_size_oblivious is not useful there and more confusing probably for code readers
added a comment and a test that verifies the specialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148815
Approved by: https://github.com/pianpwk
2025-06-26 17:16:32 +00:00
61eaaa21a4 Better error message when no .so/cpp files are found (#156863)
Summary:
Sample error message:

```
RuntimeError: Failed to find a generated cpp file or so file for model 'forward' in the zip archive.

Available models in the archive:
model

To load a specific model, please provide its name using the `model_name` parameter when calling AOTIModelPackageLoader()  or torch._inductor.package.load_package.

The following files were loaded from the archive:
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/cqdxv6zki2oiiytjeqrg774uxlxgqdemhdxn5dycn4nnc3rmcd7w.cubin
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper.cpp
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/ctmp7adn3spwyscdotllyj4yx3vrqcnxk3thkpgdcax7zvqmyyp3.kernel.cpp
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper_metadata.json
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/ctmp7adn3spwyscdotllyj4yx3vrqcnxk3thkpgdcax7zvqmyyp3.kernel_metadata.json
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/data/aotinductor/model/c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper.so
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/archive_format
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/archive_version
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/.data/version
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/byteorder
c7l7jkswdq7ud6gpvpmunx76hi3c357l7epyc7oofeemzeoy7euo.wrapper/.data/serialization_id

```

Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_loading_wrong_model"
```

Rollback Plan:

Differential Revision: D77320485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156863
Approved by: https://github.com/tugsbayasgalan
2025-06-26 17:13:29 +00:00
21990fbad9 Revert "[cond] support gen_schema for cond (#154193)"
This reverts commit 6de41ce0f899604c3f8b33e1f8d37eb89b3a963e.

Reverted https://github.com/pytorch/pytorch/pull/154193 on behalf of https://github.com/Camyll due to issue landing internally, discussed with Yidi offline ([comment](https://github.com/pytorch/pytorch/pull/154193#issuecomment-3009160081))
2025-06-26 17:10:00 +00:00
c808af514d Support deterministic upsample trilinear backward (#154239)
Fixes https://github.com/pytorch/pytorch/issues/154183
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154239
Approved by: https://github.com/eellison, https://github.com/albanD
2025-06-26 15:02:27 +00:00
2f94f69b7c [aotd] Support mutations of the same input in fw and bw (#155354)
Original issue: https://github.com/pytorch/pytorch/issues/154820

The issue happens when there is a mutation for the same input in forward AND in backward.

AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward).
After that partitioner can put it either in forward or in backward.

The fix:

1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward

We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for  forward mutation.

2/ Exposing mutation_counter to python

We want to keep invariant that copy_ exist only in the end of joint graph.

3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward.
Emit post_forward mutations after joint graph fully traced.

add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward.

4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward.
For this set MUST_SAVE for the source of mutation in forward.

proxy_tensor changes:

By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained.
But we want that this copy_ will be independent and applied just to primals.
For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354
Approved by: https://github.com/bdhirsh
2025-06-26 14:05:54 +00:00
197c1869f5 [Inductor][CLN] Remove unused default configs in flex_attention.py (#156700)
They probably became unusable after 03023f178c

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156700
Approved by: https://github.com/jataylo, https://github.com/drisspg
2025-06-26 13:24:09 +00:00
2d39a48d52 Fix silent incorrectness arising from incorrect alias information (#152011)
Fixes #136662

There are two problems:
1) canonicalize_view_scatter_ops adds some new nodes into the graph.
   These new nodes cause the alias info on the graph to be wrong. To fix
   this, we try to run FakeTensorUpdater on the graph again.
2) FakeTensorUpdater's alias information is wrong. It tries to skip
   nodes that it thinks have "equivalent" FakeTensor metadata.
   It should not be allowed to do this if any users of the node can
   alias the node. The example
   is if we have `x = foo(...); y = x.view(...)`. If the user replaces
   `foo` with a new `bar` node and sets bar.meta["val"] correctly, then
   FakeTensorUpdater still needs to update y's meta["val"] to be a view
   of the new bar node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152011
Approved by: https://github.com/yf225
2025-06-26 13:05:08 +00:00
53e0b9c393 refine fp32 precision api (#125888)
Based on the [conversation](https://github.com/pytorch/pytorch/issues/121791), we plan to drop the "highest, high, medium" to represent fp32  internal computation data types . Instead, we will directly use the algorithm to represent it.

### Design Choice: Directly use algorithms name like "TF32", "BF16".
#### Pros
 - The names are more informative. 'tf32' is more informative than a simple "high".
 - Easier to extend new algorithm like `tf32x3`
#### Cons
 - "HIGHEST, HIGH, MEDIUM" indicated the relative precision between different algorithms. However, we can have more documents to discuss them.

### We provide a layered structure for backends/operators.
('f32' is short for 'fp32_precision')
![image](https://github.com/user-attachments/assets/f89143e5-d6a1-4865-9351-9a50439f5067)

### We provide 3 fp32 compute precision can be set:
 - **"ieee"**: Not allowed to use any other internal computation data types .
 - **"tf32"**: Allowed to use tf32 as internal computation data types.
 - **"bf16"**: Allowed to use bf16 as internal computation data types.
 - **"none"**:  Precision's are not set. Can be override by its father node.

### Overriding Precision Settings
Child node can be override by its father node if it is set to default.
For current default settings:
```
backend = generic, op = all, precision setting = none
    backend = cuda, op = all, precision setting = none
        backend = cuda, op = conv, precision setting = tf32
        backend = cuda, op = rnn, precision setting = tf32
        backend = cuda, op = matmul, precision setting = none
    backend = matmul, op = all, precision setting = none
        backend = matmul, op = conv, precision setting = none
        backend = matmul, op = rnn, precision setting = none
        backend = matmul, op = matmul, precision setting = none
```
 - If the user set `torch.backends.mkldnn.fp32_precision="bf16"`, his child nodes `torch.backends.mkldnn.matmul.fp32_precision` / `torch.backends.mkldnn.conv.fp32_precision` / `torch.backends.mkldnn.rnn.fp32_precision` will also be override to "bf16".
 - If the user set `torch.backends.fp32_precision="bf16"`,  `torch.backends.mkldnn.fp32_precision` and his child nodes will also we override to "bf16".

### Backward Compatible
Since new API allow user to have more fine-grained control. There will be some conflict. For example, previous `torch.backends.cudnn.allow_tf32` are not enough to represent the status for `torch.backends.cudnn.rnn.fp32_precision="ieee"` and `torch.backends.cudnn.conv.fp32_precision="tf32"`. Therefore, our goal for backward compatible is
 - If the user only uses previous APIs, it will work as previous expectations.
 - If the user use **new** API to change the status to an **un-representable** status for old API, and try to access the status by **old** API. We will raise Runtime Error and point the document for user.

### Test Plan
```
python test/test_cuda.py -k test_fp32_precision_with_tf32
python test/test_cuda.py -k test_fp32_precision_with_float32_matmul_precision
python test/test_cuda.py -k test_invalid_status_for_legacy_api
python test/test_mkldnn.py -k test_mlkdnn_get_set
python test/test_mkldnn.py -k test_generic_precision
python test/test_mkldnn.py -k test_invalid
python test/test_mkldnn.py -k test_default_use_parent
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125888
Approved by: https://github.com/jgong5, https://github.com/albanD

Co-authored-by: Jiang, Yanbing <yanbing.jiang@intel.com>
2025-06-26 10:32:20 +00:00
de45c5f673 [aarch64] Add back NCCL lib to cuda arm wheel (#156888)
We discovered that when importing latest 12.9 arm nightly wheel, it is missing the NCCL lib. With the use of USE_SYSTEM_NCCL=1, we need to copy the libnccl.so lib into our big wheel environment, so that it can be dynamically linked at runtime.

https://github.com/pytorch/pytorch/pull/152835 enabled USE_SYSTEM_NCCL=1, which would use the system NCCL by default, and it would no longer use the one built from libtorch_cuda.so. With this PR, we add back the libnccl.so to be used at runtime. In this way, we also provide the flexibility to use different versions of NCCL from what came with the original pytorch build.

related - https://github.com/pytorch/pytorch/issues/144768

```
Python 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.12/dist-packages/torch/__init__.py", line 417, in <module>
    from torch._C import *  # noqa: F403
    ^^^^^^^^^^^^^^^^^^^^^^
ImportError: libnccl.so.2: cannot open shared object file: No such file or directory
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156888
Approved by: https://github.com/atalman
2025-06-26 10:24:18 +00:00
18b01afa9e load inline user overridable gencode (#156850)
Fixes https://github.com/pytorch/pytorch/issues/156815

As far as testing goes
* I tried to use cuobjdump but that was kinda goofy bccd9393a5 the problem was that the name of the cubin will have a single gencode always
* Another idea was to read stderr and check that the right amount of gencodes is there 0beadc01b3 this helped a lot to convince me locally that this test works, the test passed on my dev gpu but was failing in CI and I suspect it's because of a bad interaction with subprocesses
* Last approach was to have a simpler unit test to check which flags get added by default, this is not as comprehensive as the previous ideas but it works and is fast so will opt for this since I'm convinced testing is working per my own experiments and customers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156850
Approved by: https://github.com/malfet
2025-06-26 10:15:08 +00:00
bbf1a6feac Add dist_info to non-building setup.py commands (#156709)
This adds the `dist_info` command to the list of non-building commands of `setup.py`, which avoids the current situation where simple metadata generation with any packaging tool already triggers a build.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156709
Approved by: https://github.com/Skylion007
2025-06-26 08:38:39 +00:00
455dfd2589 Fix macOS build with USE_MPS=OFF (#156847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156847
Approved by: https://github.com/angelayi
2025-06-26 07:15:41 +00:00
50b2069b61 Move out super large one off foreach_copy test (#156876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156876
Approved by: https://github.com/albanD, https://github.com/jeffdaily
2025-06-26 06:02:38 +00:00
dfc31b3345 [BE] comments + try to get rid of secondary make_autotune_fn (#156358)
Not sure this will work, but let's try it on the unit tests. The only thing I am worried about is the counters drifting off from their true values, so let the unit tests check that

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156358
Approved by: https://github.com/masnesral
2025-06-26 05:54:01 +00:00
0d01bafc34 remove gso from set_storage_meta__symint (#156525)
We already check that inputs are hinted? i dont see value here for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156525
Approved by: https://github.com/pianpwk
2025-06-26 05:42:05 +00:00
127695eb5c ci: Add ciflow trigger for build-triton-wheel (#156893)
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156893
Approved by: https://github.com/malfet
2025-06-26 04:38:38 +00:00
0a16818d5b [OpenReg] Remove the unit.skip for test_serialization (#156804)
This bugs was fixed by this [PR](https://github.com/pytorch/pytorch/pull/147095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156804
Approved by: https://github.com/albanD
ghstack dependencies: #156588, #156589
2025-06-26 03:59:50 +00:00
98e594b565 [OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156589)
----

- serialization
- dlpack

**Next Steps**:

- The rest of `test/test_cpp_extensions_open_device_registration.py` is about the fallback mechanism. In order to keep it consistent with other accelerator usage (C++ registration), the implementation of OpenReg needs to be refactored:

    * Simulate multiple device memory in a single process (a brief RFC will be submitted this week)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156589
Approved by: https://github.com/albanD
ghstack dependencies: #156588
2025-06-26 03:59:50 +00:00
a730c65fe3 [OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156588)
----

- fake tensor
- named tensor
- custom autograd function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156588
Approved by: https://github.com/albanD
2025-06-26 03:59:50 +00:00
4585c33e74 [symm_mem] Fix nccl test for symm mem (#156752)
Try not to call set_device to Fixes #156569

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156752
Approved by: https://github.com/kwen2501
2025-06-26 02:59:38 +00:00
7521cd9111 [BE] Typo fix (#156836)
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156836
Approved by: https://github.com/albanD, https://github.com/jingsh, https://github.com/Skylion007
ghstack dependencies: #156830, #156831
2025-06-26 02:48:55 +00:00
68e023cbbb [BE] Add missing type for storage dict (#156831)
For some reason, this one always bleats when I run mypy on OSX, so shut it up.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156831
Approved by: https://github.com/mikaylagawarecki, https://github.com/atalman, https://github.com/malfet
ghstack dependencies: #156830
2025-06-26 02:48:55 +00:00
df9e5a276b [BE] Add type and docs for _process_export_inputs (#156830)
Done using claude code and manual review.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156830
Approved by: https://github.com/tugsbayasgalan, https://github.com/malfet
2025-06-26 02:48:55 +00:00
81bf278537 [cutlass] rename cutlass python lib to python-cutlass (#156655)
Differential Revision: [D77173366](https://our.internmc.facebook.com/intern/diff/D77173366/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156655
Approved by: https://github.com/Skylion007
2025-06-26 02:47:14 +00:00
8da774d81f [ez] Add docblock for SchedulerNode.codegen (#156718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156718
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #156466, #156445, #156625, #156717
2025-06-26 02:43:50 +00:00
78ee2ee90e Fix environment and push env var for docker image builds for binary builds (#156910)
Changes WITH_PUSH and the environment check to be ok with giving credentials to push to docker io if its on the main branch, a tag starting with v, or the release branch

Credentials for pushing to docker io are in the environment, so without the environment, you can't push to docker io.  You also don't do the push unless WITH_PUSH is true

binary builds on release branch were failing because they pull from docker io, but the docker build wasn't pushing to docker io because it was either on the release branch (didn't have credentials https://github.com/pytorch/pytorch/actions/runs/15888166271/job/44813180986) or it was on the tag (doesn't have WITH_PUSH)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156910
Approved by: https://github.com/atalman
2025-06-26 02:06:57 +00:00
5db9a2b54a [BE] Install Helion without dependencies (#156706)
After: https://github.com/pytorch/pytorch/pull/155513
Please see comment: https://github.com/pytorch/pytorch/pull/155513#issuecomment-2998085740

Here are the logs: https://github.com/pytorch/pytorch/actions/runs/15838529400/job/44646874281?pr=156664#step:6:16372

Looks like current workflow is :
Build triton - triton-3.4.0+git5389ed79-cp310-cp310-linux_x86_64.whl
Install Helion - Overwrite triton with production 3.3.1 and install production torch
Reinstall triton as final docker build step - triton-3.4.0+git5389ed79-cp310-cp310-linux_x86_64.whl

This makes it somewhat messy since we install both torch and triton from prod. This is something we want to avoid when building underlining docker images for CI

Log:
```
#55 311.4 + pip_install helion
#55 311.4 + as_jenkins conda run -n py_3.10 pip install --progress-bar off helion
#55 311.4 + sudo -E -H -u jenkins env -u SUDO_UID -u SUDO_GID -u SUDO_COMMAND -u SUDO_USER env PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin LD_LIBRARY_PATH= conda run -n py_3.10 pip install --progress-bar off helion
#55 393.6 Collecting helion
#55 393.6   Downloading helion-0.0.7-py3-none-any.whl.metadata (14 kB)
#55 393.6 Collecting filecheck (from helion)
#55 393.6   Downloading filecheck-1.0.2-py3-none-any.whl.metadata (5.8 kB)
#55 393.6 Collecting torch>=2.7.0 (from helion)
#55 393.6   Downloading torch-2.7.1-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (29 kB)
#55 393.6 Requirement already satisfied: typing-extensions>=4.0.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from helion) (4.14.0)
#55 393.6 Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (3.18.0)
#55 393.6 Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (1.13.3)
#55 393.6 Requirement already satisfied: networkx in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (2.8.8)
#55 393.6 Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (3.1.6)
#55 393.6 Requirement already satisfied: fsspec in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch>=2.7.0->helion) (2025.5.1)
#55 393.6 Collecting nvidia-cuda-nvrtc-cu12==12.6.77 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-cuda-runtime-cu12==12.6.77 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-cuda-cupti-cu12==12.6.80 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
#55 393.6 Collecting nvidia-cudnn-cu12==9.5.1.17 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl.metadata (1.6 kB)
#55 393.6 Collecting nvidia-cublas-cu12==12.6.4.1 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-cufft-cu12==11.3.0.4 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-curand-cu12==10.3.7.77 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-cusolver-cu12==11.7.1.2 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
#55 393.6 Collecting nvidia-cusparse-cu12==12.5.4.2 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
#55 393.6 Collecting nvidia-cusparselt-cu12==0.6.3 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl.metadata (6.8 kB)
#55 393.6 Collecting nvidia-nccl-cu12==2.26.2 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.0 kB)
#55 393.6 Collecting nvidia-nvtx-cu12==12.6.77 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.6 kB)
#55 393.6 Collecting nvidia-nvjitlink-cu12==12.6.85 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting nvidia-cufile-cu12==1.11.1.6 (from torch>=2.7.0->helion)
#55 393.6   Downloading nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.5 kB)
#55 393.6 Collecting triton==3.3.1 (from torch>=2.7.0->helion)
#55 393.6   Downloading triton-3.3.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (1.5 kB)
#55 393.6 Requirement already satisfied: setuptools>=40.8.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from triton==3.3.1->torch>=2.7.0->helion) (80.9.0)
#55 393.6 Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch>=2.7.0->helion) (1.3.0)
#55 393.6 Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch>=2.7.0->helion) (3.0.2)
#55 393.6 Downloading helion-0.0.7-py3-none-any.whl (149 kB)
#55 393.6 Downloading torch-2.7.1-cp310-cp310-manylinux_2_28_x86_64.whl (821.2 MB)
#55 393.6 Downloading nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB)
#55 393.6 Downloading nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.9 MB)
#55 393.6 Downloading nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB)
#55 393.6 Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (897 kB)
#55 393.6 Downloading nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB)
#55 393.6 Downloading nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (200.2 MB)
#55 393.6 Downloading nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB)
#55 393.6 Downloading nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (56.3 MB)
#55 393.6 Downloading nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (158.2 MB)
#55 393.6 Downloading nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (216.6 MB)
#55 393.6 Downloading nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB)
#55 393.6 Downloading nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB)
#55 393.6 Downloading nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB)
#55 393.6 Downloading nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB)
#55 393.6 Downloading triton-3.3.1-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (155.6 MB)
#55 393.6 Downloading filecheck-1.0.2-py3-none-any.whl (23 kB)
#55 393.6 Installing collected packages: nvidia-cusparselt-cu12, triton, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, filecheck, nvidia-cusparse-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, nvidia-cusolver-cu12, torch, helion
#55 393.6   Attempting uninstall: triton
#55 393.6     Found existing installation: triton 3.4.0+git5389ed79
#55 393.6     Uninstalling triton-3.4.0+git5389ed79:
#55 393.6       Successfully uninstalled triton-3.4.0+git5389ed79
#55 393.6 Successfully installed filecheck-1.0.2 helion-0.0.7 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 torch-2.7.1 triton-3.3.1
#55 393.6
#55 DONE 428.8s

#56 [final  1/30] COPY --from=triton-builder /opt/triton /opt/triton
#56 DONE 0.0s

#57 [final  2/30] RUN if [ -n "yes" ] || [ -n "" ]; then pip install /opt/triton/*.whl; chown -R jenkins:jenkins /opt/conda; fi
#57 0.823 Processing /opt/triton/triton-3.4.0+git5389ed79-cp310-cp310-linux_x86_64.whl
#57 2.263 Requirement already satisfied: setuptools>=40.8.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from triton==3.4.0+git5389ed79) (80.9.0)
#57 2.589 Installing collected packages: triton
#57 6.405 Successfully installed triton-3.4.0+git5389ed79
#57 6.405 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.
#57 DONE 86.5s
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156706
Approved by: https://github.com/oulgen, https://github.com/malfet
2025-06-26 02:05:47 +00:00
b50075343a [distributed] Enable H100 test for all distributed related changes (#156721)
We want to run H100 CI for distributed related changes. We already have a labeling of oncall:distributed when touching distributed related code: 4491326fb0/.github/labeler.yml (L94). So we want to leverage that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156721
Approved by: https://github.com/huydhn
2025-06-26 01:51:41 +00:00
e581f015ee Bump STATIC_CUDA_LAUNCHER_VERSION to 2 (#156726)
Differential Revision: [D77241813](https://our.internmc.facebook.com/intern/diff/D77241813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156726
Approved by: https://github.com/oulgen
2025-06-26 01:50:51 +00:00
b5bfbba184 [Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)
Fixes #154328

**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.

Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-06-26 01:24:36 +00:00
214e2959dc Cleanup leftover miniconda brew installation (#156898)
That results in torch.compile being unable to produce working artifacts

Should fix https://github.com/pytorch/pytorch/issues/156833

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156898
Approved by: https://github.com/seemethere, https://github.com/atalman
2025-06-26 01:02:04 +00:00
4c0091fda6 python definitely_contiguous-> is_contiguous_or_false (#156515)
We probably can avoid having those in python as well and  just depend on c++ impl after we land https://github.com/pytorch/pytorch/pull/155590 but that is for a different PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156515
Approved by: https://github.com/bobrenjc93
2025-06-26 00:47:14 +00:00
85df746892 refresh expected numbers (#156877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156877
Approved by: https://github.com/huydhn
2025-06-26 00:03:09 +00:00
2c6324a1eb Delete sections referencing torchscript in serialization docs (#156648)
Address [T228333890](https://www.internalfb.com/intern/tasks/?t=228333890)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156648
Approved by: https://github.com/svekars
2025-06-25 23:41:24 +00:00
a25d1443fa Mark TorchServe as all emeritus (#156865)
As per title and to follow the broader tutorial cleanup work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156865
Approved by: https://github.com/svekars, https://github.com/malfet, https://github.com/seemethere
2025-06-25 23:34:57 +00:00
451b525bf0 [ez] add docblock and comments to simd.split_and_set_ranges (#156717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156717
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #156445
2025-06-25 23:07:28 +00:00
204db27a0c Consolidate stack trace in Tracer (#156257)
Summary:
- Consolidate the stack trace recording code in TracerBase and PythonKeyTracer
- Change `make_fx`'s arg name to be consistent with TracerBase member name `record_stack_traces`

We move the stack trace logic from `create_proxy` to `create_node` so all inherited classes of TracerBase and re-use the same stack trace logic.

Test Plan:
```
buck run caffe2/test:test_export -- -r  test_stack_trace
```

Rollback Plan:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156257
Approved by: https://github.com/angelayi, https://github.com/zou3519
2025-06-25 23:07:10 +00:00
653c52fe52 [MPS] Fix batch norm incorrect gradient (#156867)
Fixes #156555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156867
Approved by: https://github.com/malfet
2025-06-25 23:05:49 +00:00
acaf6ba3c6 Organize BUCK for torch/standalone (#156503)
Summary: Undo highlevel BUCKification in favor of something more organized by moving it to the dir itself

Test Plan:
CI

Rollback Plan:

Reviewed By: swolchok

Differential Revision: D76920013

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156503
Approved by: https://github.com/swolchok
2025-06-25 22:56:15 +00:00
d98fa4a103 implement SR's storage group planning algorithm (#156715)
Summary: att

Test Plan:
tested on a localnet. it's ~15% worse performance than greedy-by-size, but more performant.

local:
gbs: 110656b
dsg: 131584b

local_ro:
gbs: 38208
dsg: 44544

Differential Revision: D75653840

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156715
Approved by: https://github.com/zhxchen17
2025-06-25 22:43:40 +00:00
1e7e21ec5d unify dynamic shapes API namings 3 (guard_int, guard_int_seq) (#155973)
evaluate_static_shape -> guard_int
evaluate_static_shapes -> guard_int_seq

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155973
Approved by: https://github.com/bobrenjc93
2025-06-25 22:40:28 +00:00
61f6aa36b9 [resubmit][export] add _union_dataclass to support comparing dataclasses that inherits from union. (#156765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156765
Approved by: https://github.com/zhxchen17
2025-06-25 22:32:12 +00:00
53057fc16a [dynamo] update base variable call_method hint with note on comprehensions (#156769)
Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1696822194289318/

List/dict comprehensions in Python <= 3.11 result in potentially weird graph breaking behavior because comprehensions result in implicit function calls, which Dynamo may end up tracing as top-level frames, resulting in iterators being passed as arguments to the compiled region.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156769
Approved by: https://github.com/StrongerXi
2025-06-25 21:55:55 +00:00
95a7d1912a [sigmoid] add layout planner to executor (#156852)
Summary: if memory planning is enabled in the runtime config, we will create a copy in the executor here.

Test Plan: ci

Differential Revision: D73635622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156852
Approved by: https://github.com/zhxchen17
2025-06-25 21:41:09 +00:00
6de41ce0f8 [cond] support gen_schema for cond (#154193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154193
Approved by: https://github.com/zou3519
ghstack dependencies: #155644
2025-06-25 21:19:58 +00:00
3257c8f74c [cond] preserve merged phs meta for subgraph (#155644)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155644
Approved by: https://github.com/zou3519
2025-06-25 21:19:58 +00:00
e7a66166ce [precompile] When using BundledAOTAutogradCache, disable FXGraphCache (#156611)
The goal of this PR is to fix a specific bug when turning precompile on/off between caching runs.

If you try to turn on BundledAOTAutogradCacheEntry today in between local runs, the FXGraphCache may randomly hit *between* the two runs, because FXGraphCache knows nothing about AOTAutogradCache's config. When FXGraphCache hits, it immediately will call make_launchers() immediately on the triton code it launches, which then causes an assertion failure because pickle should not be called after make_launchers.

One way to resolve the bug is just to add whether precompile is enabled to teh FxGraph cache key. But the better fix for this, however, is higher level/philosophical:

When using BundledAOTAutogradCacheEntry, the entire CompiledFxGraph is saved directly to the cache entry, and we expect the two caches to work in sync, i.e. as one cache. So to simplify the programming model, we disable FxGraphCache when BundledAOTAUtogradCache is turned on.

BundledAOTAutogradCacheEntry is only used for precompile use cases now; if we wanted to use BundledAOTAutogradCache for traditional caching use cases, there's a bunch of further work, one of which would be to re-enable FxGraphCache in the event that BundledAOTAutogradCache has to bypass. However, for precompile, this is not a scenario that should happen: we should always expect the entire callable to be saveable, and we should expect to never bypass. So we don't do that change for now.

Added a unit test demonstrating this behavior. Also updated existing unit tests to show that all fx graph cache operations are now 0 (but all tests still pass).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156611
Approved by: https://github.com/zhxchen17
2025-06-25 21:01:42 +00:00
fe1f1a38df add test_batchnorn_2D and 3D tests (#156498)
New set of batchnorm tests to verify NCHW 2D/3D BatchNorm
This test also allows to add and configure different BatchNorm tests (dtypes, NCHW/NHWC, Mixed) in the future
based on:
- Train [test_batchnorm_cudnn_nhwc](1051b93192/test/test_nn.py (L4985))
- Inference [test_batchnorm_nhwc_cuda](1051b93192/test/test_nn.py (L5130))

```
test_batchnorm_3D_inference_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_float32) ... ok (0.113s)
test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.057s)
test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_cpu_mixed_float16) ... ok (0.063s)
test_batchnorm_3D_inference_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_float32) ... ok (0.059s)
test_batchnorm_3D_inference_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_mixed_bfloat16) ... ok (0.006s)
test_batchnorm_3D_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_3D_inference_NCHW_vs_native_mixed_float16) ... ok (0.006s)
test_batchnorm_3D_train_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_float32) ... ok (0.007s)
test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.005s)
test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_cpu_mixed_float16) ... ok (0.005s)
test_batchnorm_3D_train_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_float32) ... ok (0.003s)
test_batchnorm_3D_train_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_bfloat16) ... skip: bfloat16 NCHW train failed due to native tolerance issue (0.001s)
test_batchnorm_3D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_3D_train_NCHW_vs_native_mixed_float16) ... skip: 3D float16 NCHW train failed on ROCm<7.0 (0.001s)

test_batchnorm_2D_inference_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_float32) ... ok (0.016s)
test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.003s)
test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_cpu_mixed_float16) ... ok (0.003s)
test_batchnorm_2D_inference_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_float32) ... ok (0.054s)
test_batchnorm_2D_inference_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_mixed_bfloat16) ... ok (0.002s)
test_batchnorm_2D_inference_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_2D_inference_NCHW_vs_native_mixed_float16) ... ok (0.001s)
test_batchnorm_2D_train_NCHW_vs_cpu_float32 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_float32) ... ok (0.007s)
test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_mixed_bfloat16) ... ok (0.004s)
test_batchnorm_2D_train_NCHW_vs_cpu_mixed_float16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_cpu_mixed_float16) ... ok (0.004s)
test_batchnorm_2D_train_NCHW_vs_native_float32 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_float32) ... ok (0.003s)
test_batchnorm_2D_train_NCHW_vs_native_mixed_bfloat16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_mixed_bfloat16) ... skip: bfloat16 NCHW train failed due to native tolerance issue (0.001s)
test_batchnorm_2D_train_NCHW_vs_native_mixed_float16 (__main__.TestNN.test_batchnorm_2D_train_NCHW_vs_native_mixed_float16) ... ok (0.002s)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156498
Approved by: https://github.com/jeffdaily
2025-06-25 20:38:02 +00:00
48e7b62d3a [dynamo] Add immutable pytree to trace_rules (#156772)
Fixes https://github.com/pytorch/pytorch/issues/155426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156772
Approved by: https://github.com/williamwen42
2025-06-25 20:08:47 +00:00
e99a2a2dba [PG/nccl] Simplify uniqueHash management (#156790)
Summary:

ncclUniqueID is only relevant when a comm is created using ncclCommCreate or ncclCommCreateConfig.  If a comm is created with ncclCommSplit, this field is unset, causing its usage to create unexpected behavior.

This patch creates a unique hash key for each comm, irrespective of how the comm is created.

Test Plan:

CI

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156790
Approved by: https://github.com/fduwjj, https://github.com/kwen2501
2025-06-25 20:06:08 +00:00
070aa59e49 Refactor DynamoStore into disk and in memory implementations (#155818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155818
Approved by: https://github.com/zhxchen17
2025-06-25 18:24:28 +00:00
6c24c6633a [torch][test] skip test_transformer_backend_inductor_fullgraph_True (#156763)
Summary: "Traceable FSDP2" is not being maintained anymore.

Test Plan:
```
buck test @//mode/opt caffe2/test/distributed/_composable:fully_shard_compile -- test_transformer_backend_inductor_fullgraph_True
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/16044073764394232/

Rollback Plan:

Differential Revision: D77264408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156763
Approved by: https://github.com/xunnanxu, https://github.com/yf225
2025-06-25 18:15:23 +00:00
09ffba3cf7 [docs] Decorator to create a deprecation warning (#155127)
This PR adds the `@deprecate` decorator for internal functions which we are prepping for deprecation.  Add it on top of an internal function to emit a deprecation warning + allow bc with the non internal version of the function.

Tested with `python test/test_utils.py TestDeprecate.test_deprecated `

Furthermore, testing with a modified version of the tes in the pr gives something like this which is what we want

```
/home/sahanp/repos/pytorch/test/test_utils.py:1239: UserWarning: deprecated_api is DEPRECATED, please consider using an alternative API(s).
  deprecated_api(1, 2)
  ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155127
Approved by: https://github.com/albanD

Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-06-25 18:09:04 +00:00
4bc3e4b497 [cutlass backend] Move cutlass key to cutlass_library (#156654)
Differential Revision: [D77188311](https://our.internmc.facebook.com/intern/diff/D77188311/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156654
Approved by: https://github.com/ColinPeppler, https://github.com/jingsh
ghstack dependencies: #156651
2025-06-25 17:55:57 +00:00
c1a629f76d Update device for perf dashboard on AMD runners (#156809)
Uses arch_device naming convention for storing perf dashboard logs on AMD runners based on the following PR
https://github.com/pytorch/test-infra/pull/6793

Updated from zen_cpu_x86 to cpu_x86_zen

Fixes https://github.com/pytorch/test-infra/issues/6823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156809
Approved by: https://github.com/desertfire, https://github.com/malfet
2025-06-25 17:34:49 +00:00
e071837594 [cutlass backend] compile and link for .so files (#155876)
Differential Revision: [D76482736](https://our.internmc.facebook.com/intern/diff/D76482736/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155876
Approved by: https://github.com/coconutruben, https://github.com/ColinPeppler
2025-06-25 17:01:56 +00:00
1051b93192 [export] Implement _compile_and_package for ExportPackage. (#156638)
add a method to implement weight sharing.

Differential Revision: [D76132005](https://our.internmc.facebook.com/intern/diff/D76132005/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156638
Approved by: https://github.com/tugsbayasgalan
2025-06-25 16:00:40 +00:00
8eb3c5b7a1 [release] delete tag-docker-images.sh as not required anymore (#156737)
Thanks to @clee2000  This is no longer required since the docker images use hash as tag: https://github.com/pytorch/pytorch/actions/runs/15844298044/job/44662813176#step:15:92

```
Login Succeeded
++ docker manifest inspect docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09
++ jq '[.layers[].size, .config.size] | add / 1024 / 1024'
+ IMAGE_SIZE=9322.26076889038
+ echo 'Compressed size of image in MB: 9322.26076889038'
+ set -e
+ docker inspect --type=image docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09
Compressed size of image in MB: 9322.26076[88](https://github.com/pytorch/pytorch/actions/runs/15844298044/job/44662813176#step:15:90)9038
+ retry docker pull docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09
+ docker pull docker.io/pytorch/manylinux2_28-builder:cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09
cuda12.9-5011468da53e13424002bd211cc919a0ec0e8b09: Pulling from pytorch/manylinux2_28-builder
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156737
Approved by: https://github.com/clee2000
2025-06-25 15:17:06 +00:00
029e2b05c2 Revert "[Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)"
This reverts commit 19ffb5e6f7606436249742b0f3efc0bab244dc55.

Reverted https://github.com/pytorch/pytorch/pull/155109 on behalf of https://github.com/albanD due to The corresponding test still breaks on rocm ([comment](https://github.com/pytorch/pytorch/pull/155109#issuecomment-3004698438))
2025-06-25 13:05:40 +00:00
c2185dc4a5 [Quant][CPU] Enable fp8 qlinear (#155678)
**Summary**
Enable fp8 qlinear on CPU. It's part of the plan to enable fp8 static quantization on CPU. This PR only adds FP8 support of the existing int8 qlinear op. It does not add a new op nor does it affect frontend or quantization flow. The schema of the qlinear op is not changed either.

So, the FP8 qlinear shares the same op as INT8 qlinear and the difference is that src/wei dtype is fp8 instead of int8. The output dtype can be fp8/float32/bfloat16. The implementation uses the oneDNN library.

The differences of qlinear from `_scaled_mm` are that
- Qlinear supports post op fusion while `_scaled_mm` does not
- Weights are prepacked for qlinear

**Test plan**
```
pytest test/quantization/core/test_quantized_op.py -k "qlinear and fp8"
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155678
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-06-25 10:01:08 +00:00
19ffb5e6f7 [Quant][CPU] fix fake_quantize_per_tensor_affine of inf values (#155109)
Fixes #154328

**Summary**
Fail reason:
The input value is infinity in float and it has undefined behavior to convert it to int64_t. On X86, it will be converted to the min value of int64_t, which is not expected.

Fix:
Clamping `(input * inv_scale + zero_point)` to `[quant_min, quant_max]` before converting it to int64_t.

**Test plan**
```
pytest test/quantization/core/test_workflow_ops.py -k test_fake_quantize_per_tensor_affine_inf
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155109
Approved by: https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2025-06-25 09:28:54 +00:00
0ab075a69e Fix docker image build for s390x (#156687)
Add upstream patch for onnxruntime
updating eigen dependency URL and hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156687
Approved by: https://github.com/seemethere
2025-06-25 09:09:22 +00:00
4918502d2e bug fix for losing shape on wrapper tensor for DTensor (#156774)
Summary: Wrapper tensor for DTensor is losing shape in offload_tensor. This PR fixes this bug.

Test Plan:
updated the test. Test fails with old code and passes with the fix.

Rollback Plan:

Differential Revision: D77269733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156774
Approved by: https://github.com/mikaylagawarecki
2025-06-25 08:14:16 +00:00
d9577df312 [ROCm] Bump AOTriton to 0.10b (#156499)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156499
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2025-06-25 07:09:03 +00:00
62272d5b24 [BE][Easy][setup] wrap over long error messages and redirect them to stderr in setup.py (#156043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156043
Approved by: https://github.com/jingsh
2025-06-25 06:57:59 +00:00
6c008e2fb5 [nativert] Move ParallelGraphExecutor to PyTorch core (#156751)
Summary: `ParallelGraphExecutor` inherits from `GraphExecutorBase` and executes all nodes in the graph in a parallel manner

Test Plan:
CI

Rollback Plan:

Differential Revision: D77088996

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156751
Approved by: https://github.com/zhxchen17, https://github.com/dolpm
2025-06-25 06:54:45 +00:00
44a5f93462 [dynamo] allow symints in list.__setitem__ (#156197)
Fixes https://github.com/pytorch/pytorch/issues/155174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156197
Approved by: https://github.com/StrongerXi
2025-06-25 06:20:35 +00:00
162ca185ff [BE][PYFMT] migrate PYFMT for torch/_[a-h]*/ to ruff format (#144551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144551
Approved by: https://github.com/ezyang
ghstack dependencies: #148186
2025-06-25 06:16:06 +00:00
9642c75689 added stubs for jit tree views (#156504)
Fixes #156488

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156504
Approved by: https://github.com/ezyang
2025-06-25 06:15:17 +00:00
c60327ba74 avoid to declare an unknown bound array without any element (#156543)
Fixes #153180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156543
Approved by: https://github.com/jansel

Co-authored-by: Xu Han <xu.han@outlook.com>
2025-06-25 06:14:57 +00:00
4237ee3c33 [XPU] Add periodic run for xpu worklfow (#156698)
Enable XPU periodic testing in xpu.yml workflow directly. It works for https://github.com/pytorch/pytorch/issues/114850.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156698
Approved by: https://github.com/atalman, https://github.com/huydhn
2025-06-25 05:57:52 +00:00
194c221e0a Update the UT of test_decompose_mm_cpu (#154100)
**Summary**
Fixes #153616
Based on the latest decomposed heuristic in daca611465/torch/_inductor/fx_passes/decompose_mem_bound_mm.py (L79-L82), for the shape in this test case `[m=1, k=64, n=32]`, the result should be decomposed. The previous CI didn't capture this failure due to the UT skip described in https://github.com/pytorch/pytorch/pull/153245. So this PR should be verified in CI after https://github.com/pytorch/pytorch/pull/153245 landed.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_decompose_mem_bound_mm.py -k test_decompose_mm_cpu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154100
Approved by: https://github.com/jansel
2025-06-25 05:45:58 +00:00
f5f4beaf56 [invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156347
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #156260
2025-06-25 04:29:22 +00:00
558d7f7db0 [invoke_subgraph] make same subgraph share get_attr target (#156260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156260
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-06-25 04:29:22 +00:00
568ca89bac Add a crash handler to async compile subprocesses (#155068)
When the async compile subprocesses crash in C++ they tend to just silently die instead of leaving any kind of trace.  This installs a crash handler so that if they SEGV, ILL, or ABRT they'll attempt to output a backtrace instead.

While in there I also cleaned up the CLANGTIDY warnings coming from Module.cpp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155068
Approved by: https://github.com/masnesral
2025-06-25 03:27:28 +00:00
beb52f5c0a use more efficient implementation for broadcasted indexing in determi… (#156744)
…nistic scatter_add

per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156744
Approved by: https://github.com/suo
2025-06-25 02:59:50 +00:00
9b498d3bb2 Update docs for torch.device (#156686)
# Motivation
Update the doc, to make `torch.device`'s constructor officially support the following methods:
- A device string, which is a string representation of the device type and optionally the device ordinal.
- A device type and a device ordinal.
- A device ordinal, which is treated as the current accelerator type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156686
Approved by: https://github.com/albanD
2025-06-25 02:12:36 +00:00
3608737347 [ez] fix typo in comment (#156402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156402
Approved by: https://github.com/BoyuanFeng
ghstack dependencies: #156397
2025-06-25 02:07:36 +00:00
d06a406656 [dynamo] Graph break on torch.Tensor.data assignment with mismatched dtype (#156623)
Fixes #152162. Discussed with @bdhirsh and decided this is the easiest
workaround for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156623
Approved by: https://github.com/bdhirsh
2025-06-25 02:03:04 +00:00
e8cf5ff564 Fix the Problems About Defining Static Variable in Inline Function (#147095)
Refer to https://github.com/pytorch/pytorch/issues/125465 for more informations

- Remove unused header files
- Move common functionality to separate files to reduce dependencies between picklers and unpicklers
- Move the inline function that defines the static variable to .cc

Differential Revision: [D76266755](https://our.internmc.facebook.com/intern/diff/D76266755)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/147095
Approved by: https://github.com/cyyever, https://github.com/albanD

Co-authored-by: Edward Yang <ezyang@meta.com>
2025-06-25 01:59:10 +00:00
cyy
41910d7a94 Move use of c10::string_view to std::string_view (#152509)
Eliminate use of c10::string_view in OSS.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152509
Approved by: https://github.com/ezyang
2025-06-25 01:57:49 +00:00
02c7ab2f9b [cpp wrapper] add AOTI shim for collective ops (#154492)
Implementations:
1. Move collective ops to c10d namespace, so that we can call them externally.
2. Add AOTI shims for collective ops.

Testing
1. Add c10d functional UT for cpu.
2. Include the above one in cpp wrapper UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154492
Approved by: https://github.com/desertfire
2025-06-25 01:20:05 +00:00
d797038ea9 [dcp_poc] Introduce a new simple rank local checkpointer (#156142)
Summary:
Adds an experimental implementation for a rank local checkpointer with save and load with partial load, blind load and in-place load.

This uses an new API and simpler format.

Plan to add async checkpointing, IO layer, pluggable storage backend, layout customization,  Resharding, deduplication etc are not implemented.

Test Plan: unit tests

Reviewed By: saumishr

Differential Revision: D75426560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156142
Approved by: https://github.com/saumishr
2025-06-25 01:19:40 +00:00
0d8e4e2327 [PG/nccl] improvements to eager init (#156748)
Summary:

Cleanup eager init management, to detect and throw a warning when multiple p2p are issued on the same PG in eager init mode.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156748
Approved by: https://github.com/wconstab, https://github.com/kwen2501, https://github.com/Skylion007
2025-06-25 01:04:37 +00:00
06930706a1 Improve documentation for torch.lobpcg (#156139)
The changes are documentation changes to the function lobpcg. There are three changes to the doc.
1. Match doc arg description to be in the same order as the parameters to the function.
2. Update documentation for arg `n` to indicate that when arg `x` is specified value of `n` is ignored if set.
3. Add warning that `m` must be bigger than 3 x the number of requested eigenpairs.

Fixes #152107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156139
Approved by: https://github.com/soulitzer
2025-06-25 00:39:34 +00:00
3dd872e6d5 Revert "Add DeviceAllocator as the base device allocator (#138222)"
This reverts commit 92409b6c89fbfbd3caa79c81b1e3d9e7917d3bc7.

Reverted https://github.com/pytorch/pytorch/pull/138222 on behalf of https://github.com/Camyll due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3002206756))
2025-06-25 00:11:35 +00:00
6459a5c7a9 Revert "Add unified memory APIs for torch.accelerator (#152932)"
This reverts commit 35e44067c4d9cc9be2652c0b9098885c5a321029.

Reverted https://github.com/pytorch/pytorch/pull/152932 on behalf of https://github.com/Camyll due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/138222#issuecomment-3002206756))
2025-06-25 00:11:35 +00:00
fd4bb29410 Revert "[logging] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156517)"
This reverts commit fb75dea2c1b93c78dccf08d5fd5e20b362ecd405.

Reverted https://github.com/pytorch/pytorch/pull/156517 on behalf of https://github.com/Camyll due to internal reverted ([comment](https://github.com/pytorch/pytorch/pull/156517#issuecomment-3002172049))
2025-06-24 23:45:13 +00:00
313a6a8ef9 [pt2][pr_time_benchmarks] Refresh instructions count after disabled test (#156738)
https://github.com/pytorch/pytorch/issues/153987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156738
Approved by: https://github.com/laithsakka
2025-06-24 23:45:02 +00:00
4bd18e31e5 Revert "Add fx_graph_runnable tests boilerplate (#156552)"
This reverts commit 0a2ec7681d2af973d8daaf7905431a088739dc90.

Reverted https://github.com/pytorch/pytorch/pull/156552 on behalf of https://github.com/Camyll due to breaking internal ([comment](https://github.com/pytorch/pytorch/pull/156552#issuecomment-3002159473))
2025-06-24 23:34:21 +00:00
2ff3280c77 [ez] Disable some failing periodic tests (#156731)
test_torch.py::TestTorchDeviceTypeCUDA::test_storage_use_count_cuda:
Added in https://github.com/pytorch/pytorch/pull/150059
Fails in debug mode [GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44706020831) [HUD commit link](4491326fb0)

inductor/test_inductor_freezing.py::FreezingGpuTests::test_cpp_wrapper_cuda:
[GH job link](https://github.com/pytorch/pytorch/actions/runs/15856606665/job/44707119967) [HUD commit link](4491326fb0)
started failing after moving to new cuda version https://github.com/pytorch/pytorch/pull/155234

I'll ping people if this gets merged

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156731
Approved by: https://github.com/huydhn
2025-06-24 23:02:21 +00:00
d8bb5ac260 [ez] fix typo in select_algorithm.py (#156625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156625
Approved by: https://github.com/Skylion007, https://github.com/BoyuanFeng
ghstack dependencies: #156445
2025-06-24 23:01:58 +00:00
ce97a5dcfa [Inductor] Restrict block analysis to only match integer dims and strides (#149615)
Restrict block analysis to only match dimension sizes and strides that are integers. E.g. `sympy` can match index expressions like  `ModularIndexing(xindex, 4, 4)) + 4*(ModularIndexing(xindex, 32, 2))` with the candidate below that is invalid.
  ```python
match_expr = stride_mod0_*((xindex//(dim_mod1_*dim_mod2_*dim_mod3_*dim_mod4_))) + stride_mod1_*(ModularIndexing(xindex, dim_mod2_*dim_mod3_*dim_mod4_, dim_mod1_)) + stride_mod2_*(ModularIndexing(xindex, dim_mod3_*dim_mod4_, dim_mod2_)) + stride_mod3_*(ModularIndexing(xindex, dim_mod4_, dim_mod3_)) + stride_mod4_*(ModularIndexing(xindex, 1, dim_mod4_))
match={
      dim_mod4_: 32, dim_mod3_: 2, stride_mod3_: 4, dim_mod2_: 1/16,
       dim_mod1_: 4, stride_mod1_: 1, stride_mod4_: 0, stride_mod2_: 0, stride_mod0_: 0
     }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149615
Approved by: https://github.com/blaine-rister
2025-06-24 22:43:12 +00:00
c48d0f4643 [Inductor] Fix epilogue fusion decision with 1 Triton caller as choice (#156500)
Differential Revision: D76904773

In the current scheduler logic, if a template buffer is only a Triton template, which can result from only 1 Triton choice in the autotuning, the fusion won't be benchmarked.

This can lead to an edge case in which a Triton GEMM template from the autotune lookup table can have a problematic fusion, leading to shared memory requirements above the hardware limit. `(256, 128, 64, 4, 8, 8)` is such a config, where we have seen fusion with a `.to(torch.float32)` can lead to this issue, `out of resource: shared memory, Required: 264224, Hardware limit: 232448`. We benchmark the fusion for this case to ensure it's safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156500
Approved by: https://github.com/jansel
2025-06-24 22:33:47 +00:00
e96f530af5 Remove unnecessary use of c10::SmallVector from moments_utils (#156714)
It's just making arrays of a particular size. (If it was resizing the vectors, we'd see compile errors.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156714
Approved by: https://github.com/Skylion007
2025-06-24 22:30:10 +00:00
4ee4863232 Fix #156261 _foreach_copy indexing (#156719)
Fixes #156261

Thanks to @ngimel's fast eyes

For testing, I had experimented with a broader test case change but found that creating a tensor of 2**31+1 size was too expensive to do more than just a few times. Note that while the test case does not run in CI, I did run it locally to ensure it passes with new changes and fails without.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156719
Approved by: https://github.com/albanD
2025-06-24 21:58:44 +00:00
310e8361c5 [nativert] Move PrimKernelRegistry to PyTorch core (#156506)
Summary:
Torch Native Runtime RFC: pytorch/rfcs#72
PrimKernelRegistry manages a small subset of kernel registry in NativeRT.
Including ListPack, ListUnpack, Input, Output, VarConcat, VarStack

Test Plan: Internal unittests

Differential Revision: D77034945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156506
Approved by: https://github.com/zhxchen17
2025-06-24 21:42:41 +00:00
fa0ea57f5e [ROCm][CD] upgrade to 6.4.1 patch release (#156636)
During https://github.com/pytorch/pytorch/pull/156112, we missed upgrading the manylinux and libtorch docker images.

Fixes #155292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156636
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-24 21:41:42 +00:00
3efb22e091 Enable C++ dynamic shape guards by default (#140756)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140756
Approved by: https://github.com/anijain2305, https://github.com/laithsakka
2025-06-24 21:10:17 +00:00
26f7ca3972 Unify dynamic shapes APIs naming 2 (expect_true and check) attempt2 (#156518)
Summary:
The functions guard_lt, guard_equals, and guard_leq work similarly to torch.check and expect_true, but they operate on SymPy expressions. Notably, guard_equals applies local replacements before comparison, which might be better extracted into a separate function.

This pull request standardizes naming conventions to match symbolic_shapes.py. Specifically,
-  it introduces size_vars.expect_true and size_vars.check.
- guard_lt becomes check_lt
- guard_leq becomes check_leq
- guard_equals becomes check_equals

I am also seeing a couple of wrong usages !! that i will fix  in the next PR

Test Plan:
OSS and cont

Rollback Plan:

Differential Revision: D77054177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156518
Approved by: https://github.com/bobrenjc93
2025-06-24 21:01:38 +00:00
dfef1e4408 Optimize dim description in torch.max (#156153)
Fixes #156071

## Test Result

### Before

![image](https://github.com/user-attachments/assets/8dd0d952-277a-4197-b323-d68ae1438171)

### After

![image](https://github.com/user-attachments/assets/4af5388e-ca9e-4268-a7c4-cf16b09b899f)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156153
Approved by: https://github.com/albanD
2025-06-24 20:50:40 +00:00
1dc1eedd43 Revert "[dynamo] Graph break on torch.Tensor.data assignment with mismatched dtype (#156623)"
This reverts commit c1ad4b8e7a16f54c35a3908b56ed7d9f95eef586.

Reverted https://github.com/pytorch/pytorch/pull/156623 on behalf of https://github.com/albanD due to Breaks Dynamo tests in trunk ([comment](https://github.com/pytorch/pytorch/pull/156623#issuecomment-3001806841))
2025-06-24 20:44:42 +00:00
aa280ea19f Revert "Remove remaining CUDA 12.4 CI code (#155412)"
This reverts commit 9fed2addedb42da86b657165fe14eadc911232cf.

Reverted https://github.com/pytorch/pytorch/pull/155412 on behalf of https://github.com/Camyll due to cuda 12.4 still needed ([comment](https://github.com/pytorch/pytorch/pull/155412#issuecomment-3001711830))
2025-06-24 20:05:39 +00:00
19f851ce10 Revert "Simplify nvtx3 CMake handling, always use nvtx3 (#153784)"
This reverts commit 099d0d6121125062ebc05771c8330cb7cd8d053a.

Reverted https://github.com/pytorch/pytorch/pull/153784 on behalf of https://github.com/Camyll due to breaking internal tests and cuda 12.4 builds still used in CI ([comment](https://github.com/pytorch/pytorch/pull/153784#issuecomment-3001702310))
2025-06-24 20:02:07 +00:00
376c16703c Document each of the private member variables on ExportedProgram (#156704)
Authored with claude code and then reviewed by hand. If you don't like it, tell me.

Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156704
Approved by: https://github.com/albanD, https://github.com/zhxchen17, https://github.com/jingsh
2025-06-24 19:56:40 +00:00
c1ad4b8e7a [dynamo] Graph break on torch.Tensor.data assignment with mismatched dtype (#156623)
Fixes #152162. Discussed with @bdhirsh and decided this is the easiest
workaround for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156623
Approved by: https://github.com/bdhirsh
2025-06-24 19:33:11 +00:00
f97f03c7ef [cutlass backend] delete pip cutlass path since nvidia stops supporting nvidia-cutlass (#156651)
Differential Revision: [D77186982](https://our.internmc.facebook.com/intern/diff/D77186982/)

source: https://pypi.org/project/nvidia-cutlass/

If users want to use it, they can install pytorch through wheel, git clone cutlass, and specify cutlass path via TORCHINDUCTOR_CUTLASS_DIR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156651
Approved by: https://github.com/mlazos
2025-06-24 18:32:15 +00:00
a00a697c17 [dynamo] updated version of detecting any differences between PRs unimplemented_v2() callsites and graph_break_registry json file (#156237)
This PR runs an automatic check as part of dynamo_wrapped to make sure that all unimplemented_v2() callsites are mapped to the JSON file. It also fixes the issue of the CI not able to expand the hints, which was the root cause of the previous workflow failure. If not, the dev gets a message giving them instructions on how to update the JSON file. I also updated a dynamic gb_type to static and updated its test_error_message to include the GBID link for the graph break (before the link would not be produced).

Testing:
I ran the file with the argument to ensure all cases were covered, and also tested the test in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156237
Approved by: https://github.com/williamwen42
2025-06-24 18:12:23 +00:00
2d7e6c6241 [MPS] Revert cumsum/cumprod to MPSGraph implementation (#156708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156708
Approved by: https://github.com/malfet
2025-06-24 18:12:18 +00:00
af284b45d5 [sigmoid] layout planner alias analyzer (#156676)
Summary: we need a mechanism that provided the functionschemas  for each kernel will be able to trace aliasing behaviour s.t., we have correct value lifetimes when we plan.

Test Plan: ci + unit tests

Reviewed By: SherlockNoMad

Differential Revision: D73635213

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156676
Approved by: https://github.com/zhxchen17
2025-06-24 18:11:03 +00:00
644cc58dff Add CPython exception tests (#150789)
----

* test_baseexception.py
* test_exceptions.py
* test_exception_variations.py
* test_raise.py
* test_sys.py

Minor changes were made to each test to run them inside Dynamo

One can reproduce the changes by downloading the tests from CPython and applying the diff:
```bash
for f in "test_raise" "test_sys" "test_exceptions" "test_baseexception" "test_exception_variations"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150789
Approved by: https://github.com/zou3519
2025-06-24 18:06:42 +00:00
5ad2bee2c8 [dynamo] fix segfault due to dangling CacheEntry backend pointer (#156527)
Fixes https://github.com/pytorch/pytorch/issues/155057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156527
Approved by: https://github.com/anijain2305, https://github.com/jansel
2025-06-24 17:57:14 +00:00
4491326fb0 [inductor] select_algorithm: add preprocessing fns (#156464)
Summary:
# Why

- keep code cleaner
- modular way to hook up preprocessing steps
- expand testability of flows that change which choices are provided e.g. to test performance models and lookup tables by running torch.compile

# What

- similar to feedback_saver_fns, now there are preprocessing_fns
- the existing regex logic is exported into those as a proof of concept

Test Plan:
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 | tee /tmp/epx038
```

This does not exercise the logic, it just shows that it's safe right now

Rollback Plan:

Differential Revision: D76946993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156464
Approved by: https://github.com/masnesral
2025-06-24 16:44:40 +00:00
6e17315cd3 Skip FSDP tests if device count is less then requested world_size value (#155836)
Usually `world_size=torch.cuda.device_count()` for FSDPTest-based tests
But distributed test class `TestFullyShardAllGatherExtensionsMultiProcess` [forces to use `world_size=2`](0a6e1d6b9b/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L170)) even for 1 GPU.

Then NCCL fails with errors:
```
HIP_VISIBLE_DEVICES=0 python distributed/_composable/fsdp/test_fully_shard_extensions.py -v -k test_all_gather_extensions_train_parity
...
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device c000
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device c000
```
The test method [has `@skip_if_lt_x_gpu(2)` decorator](0a6e1d6b9b/test/distributed/_composable/fsdp/test_fully_shard_extensions.py (L209)), but test fails during test class initialization before decorator activation

This PR will skip FSDPtest-based tests if `world_size > torch.cuda.device_count()`
```
HIP_VISIBLE_DEVICES=0 python distributed/_composable/fsdp/test_fully_shard_extensions.py -v -k test_all_gather_extensions_train_parity
...
dist init r=0, world=2
dist init r=1, world=2
SKIPPED [15.5507s] (Need at least 2 CUDA devices)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155836
Approved by: https://github.com/jeffdaily
2025-06-24 16:38:23 +00:00
e2c9d8d641 Fix non-bitwise type annotations for Tensor operators (see #145838) (#146845)
Fix https://github.com/pytorch/pytorch/issues/145838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146845
Approved by: https://github.com/Skylion007
2025-06-24 15:41:34 +00:00
cb853945a7 [ez][CI] Update viable strict: change concurrency group to cancel in progress (#156619)
Should help with https://github.com/pytorch/pytorch/issues/156425

The one I saw today was because the job was waiting for an environment deployment approval for mergebot environment, which I think comes from something like a temporary github outage or a dropped webhook since it should have permissions as it was on the main branch, and other runs are fine
The run is https://github.com/pytorch/pytorch/actions/runs/15820977440 but you can't see anything about waiting for deployment anymore

My solution is to change the concurrency group so that it will cancel in progress jobs if there is one.  My hope is that if one gets stuck, the next one will cancel and re do the environment check.  I don't know how to replicate this because apparently you're just supposed to fail if you don't match the protection rules https://github.com/pytorch/pytorch/actions/runs/15830920815

The job runs every 30 minutes so there might be an issue if this job needs to run for >30 minutes to find a green sha, but takes <5 minutes to run usually so I think its ok

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156619
Approved by: https://github.com/atalman
2025-06-24 15:37:43 +00:00
4c59edf0c5 [nativert] Move call_torchbind_kernel (#156571)
Summary: Move call_torchbind_kernel target from internal sigmoid to pytorch

Test Plan:
Test Internally:

buck2 test mode/dev-nosan caffe2/test/cpp/nativert:op_kernel_test
buck build //sigmoid/core/kernels:kernel_factory
and all  sandcastle tests

Rollback Plan:

Differential Revision: D77118592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156571
Approved by: https://github.com/zhxchen17
2025-06-24 15:24:06 +00:00
795a6a0aff Update github first merge rule (#156583)
**Summary**
Update the merge rules for `CPU Frontend` and `Autocast`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156583
Approved by: https://github.com/atalman
2025-06-24 14:04:22 +00:00
dd78d6e7ea Add CPython generator/contextlib tests (#150796)
Tests:
* test_generator.py
* test_generator_stop.py
* test_contextlib.py

Minor changes were made to each test to run them inside Dynamo. We
intentionally didn't copy the binary files stored in
`python/Lib/test/archivetestdata` for security reasons. There's a single
test that requires a binary file and it is skipped because of that.

The tests were downloaded from CPython 3.13 and the diff was generated
using `git diff` to apply the changes:

```bash
for f in "test_contextlib" "test_generators" "test_generator_stop"; do
	wget -O "test/dynamo/cpython/3_13/${f}.py" "https://raw.githubusercontent.com/python/cpython/refs/heads/3.13/Lib/test/${f}.py"
	git apply "test/dynamo/cpython/3_13/${f}.diff"
done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150796
Approved by: https://github.com/williamwen42
2025-06-24 13:15:04 +00:00
3a7ff829c5 Fix MacOS MP hang in Python-3.12+ (#155698)
By leaking resource_tracker destructor (introduced by https://github.com/python/cpython/issues/88887 )  at exit, as at this point handle to child process might no longer be valid

Also, switch CI from using `setup-miniconda` to `setup-python` as an integration test for the fix as all data loader tests will hang otherwise
- Remove `CONDA_RUN` macro...
- Hack the search path in `macos-test.sh` to put both python and python3 aliases first in the path (not sure what other action are messing with path environment variable)

Fixes https://github.com/pytorch/pytorch/issues/153050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155698
Approved by: https://github.com/atalman
2025-06-24 12:13:35 +00:00
f5e6e52f25 [BE][PYFMT] migrate PYFMT for test/inductor/ to ruff format (#148186)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148186
Approved by: https://github.com/jansel
2025-06-24 11:12:11 +00:00
4e8dd11be1 simplify nvrtc discovery login in compile_kernel (#156674)
Followup from https://github.com/pytorch/pytorch/pull/156332

Tested a bunch while I was working on https://github.com/pytorch/pytorch/pull/156380

Works just fine on dev gpus
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156674
Approved by: https://github.com/malfet
2025-06-24 08:55:40 +00:00
ce73b0c53f Validate custom op support for compile_kernel (#156332)
Follow-up work from #151484 - just makes sure that compile_kernel composes nicely with custom ops by writing some new tests, no new code functionality is added

benchmark failure in CI is unrelated to this change, CI is green
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156332
Approved by: https://github.com/zou3519, https://github.com/malfet
2025-06-24 08:21:21 +00:00
35e44067c4 Add unified memory APIs for torch.accelerator (#152932)
# Motivation
The following API will be put under torch.accelerator
- empty_cache
- max_memory_allocated
- max_memory_reserved
- memory_allocated
- memory_reserved
- memory_stats
- reset_accumulated_memory_stats
- reset_peak_memory_stats

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152932
Approved by: https://github.com/albanD
ghstack dependencies: #138222
2025-06-24 07:57:48 +00:00
cyy
ce1a07570d Fix TORCH_CUDA_ARCH_LIST (#156667)
Before the fix, `TORCH_CUDA_ARCH_LIST` variable contains string `TORCH_CUDA_ARCH_LIST`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156667
Approved by: https://github.com/ngimel
2025-06-24 07:27:53 +00:00
04178d347c [Reland] [Intel GPU] Make SDPA output has the same stride as Query. (#154340)
Fixes [#153903](https://github.com/pytorch/pytorch/issues/153903).

Currently the output tensor of SDPA XPU is always defined as contiguous stride, while CPU/CUDA flash_attention and cudnn_attention allocate output tensor with stride the same as Query.

This PR aligns XPU's behavior with CUDA/CPU to make XPU compatible to CPU/CUDA's modeling code.

The function `alloc_with_matching_layout` is copied from cudnn 8c16d0e404/aten/src/ATen/native/cudnn/MHA.cpp (L874)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154340
Approved by: https://github.com/guangyey, https://github.com/drisspg
2025-06-24 06:09:59 +00:00
a7b29c88b1 [ONNX] Preserve all legacy exporter params in fallback (#156659)
Fixes #151693

Previous to this PR, the fallback does not take care of all user parameters. This pr preserves them to ensure a smooth transition for users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156659
Approved by: https://github.com/justinchuby
2025-06-24 05:28:55 +00:00
a6a8641c8a Fix UT failure on non-cuda backend (#156577)
# Motivation
`HAS_TRITON` is a generic API that could return `True` on xpu backend. It will result in these cases failing on xpu. So we should use `HAS_CUDA` (equivalently `torch.cuda.is_available() && HAS_TRITON`) to avoid these failures.

Please refer to https://github.com/pytorch/pytorch/actions/runs/15813693789/job/44569593370#step:15:2129

# Additional Context
This PR aims to fix the CI failure soon. We will have a dedicated PR to generalize these UT to be generic. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @daisyden
Fix https://github.com/pytorch/pytorch/issues/156576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156577
Approved by: https://github.com/jansel
2025-06-24 05:24:24 +00:00
495c317005 Replace deprecated is_compiling method (#154476)
Replace depreacted `is_compiling` in `torch._dynamo` with `torch.compiler`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154476
Approved by: https://github.com/eellison
2025-06-24 05:16:40 +00:00
1044934878 [CUDAGraph] add config cudagraph_capture_sizes (#156551)
Users may want CUDAGraph for certain sizes and fallback for other sizes.

As discussed in Issue #121968, we would like to use cudagraph for [batch size [1,2,3,...,16]](https://github.com/pytorch/pytorch/issues/121968#issuecomment-2259942345) and fallback for others.

Another use case is [vllm](https://github.com/vllm-project/vllm/blob/main/vllm/compilation/cuda_piecewise_backend.py#L114-L119), where 67 batch sizes (i.e., [1,2,4,8,16,24,32,...,512]) are captured and all other sizes fallback.

This PR implements the feature with `torch._inductor.config.triton.cudagraph_capture_sizes`. When it is specified, we only capture cudagraph for these shapes. When it is None (by default), we capture cudagraph for all shapes.

Example:
```python
import torch

torch._inductor.config.triton.cudagraph_capture_sizes = [(2,3), (4,5), (6, 2), (7,3)]

def f(x):
    return x + 1

f = torch.compile(f, mode="reduce-overhead", dynamic=False)

def run(batch_size, seq_len, d):
    x = torch.randn((batch_size, seq_len, d), device="cuda")
    # Need to mark the dimension as dynamic. Automated-dynamic
    # may have some ux issues on matching `cudagraph_capture_sizes`
    # with the actual dynamic shapes, since there are specialization and
    # multiple dynamo graphs.
    torch._dynamo.mark_dynamic(x, 0)
    torch._dynamo.mark_dynamic(x, 1)
    for _ in range(3):
        f(x)

for i in range(2, 10):
    for j in range(2, 10):
        run(i, j, 8)

num_cudagraph = torch._inductor.cudagraph_trees.get_container(0).tree_manager.new_graph_id()
assert num_cudagraph.id == 4
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156551
Approved by: https://github.com/bobrenjc93
2025-06-24 05:14:49 +00:00
899d3d3e9e Don't call sum() on a tensor that is not summable in layer_norm (#156600)
Don't call `sum()` on a tensor that is default constructed.

Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this:

```
Traceback (most recent call last):
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor
    yield
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run
    self._callTestMethod(testMethod)
  File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod
    if method() is not None:
       ^^^^^^^^
  File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper
    method(*args, **kwargs)
  File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps
    ln_out_cuda.backward(grad_output_cuda)
  File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward
    torch.autograd.backward(
  File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward
    _engine_run_backward(
  File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: tensor does not have a device
Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first):
C++ CapturedTraceback:
#4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0
#5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0
#6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0
#7 at::TensorBase::options() const from :0
#8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0
#9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0
#12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0
#13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0
#14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0
#16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0
#17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0

```

Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156600
Approved by: https://github.com/eqy, https://github.com/ngimel
2025-06-24 05:00:42 +00:00
17eb649d55 Implement guard collectives (optimized version) (#156562)
This is a remix of https://github.com/pytorch/pytorch/pull/155558

Instead of mediating guard collective via a config option, in this one it's done via a `set_stance` like API. The motivation is that checking for the config value on entry on torch.compile is apparently quite expensive, according to functorch_maml_omniglot. So this makes it a bit cheaper.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156562
Approved by: https://github.com/Microve
2025-06-24 04:59:49 +00:00
73772919d2 remove deprecated numpy.typing.mypy_plugin in mypy.ini (#156601)
Fixes #156489
removed deprecated numpy plugin in mypy.ini
 @ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156601
Approved by: https://github.com/ezyang
2025-06-24 04:56:08 +00:00
6d5c789ad5 [BE][PYFMT] migrate PYFMT for test/[a-h]*/ to ruff format (#144555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144555
Approved by: https://github.com/ezyang
ghstack dependencies: #144551, #144554
2025-06-24 04:53:54 +00:00
e600e044a7 Revert "[aotd] Support mutations of the same input in fw and bw (#155354)"
This reverts commit 3f920f3d8f5bd15d2222758f21f9a5d36e4dad1f.

Reverted https://github.com/pytorch/pytorch/pull/155354 on behalf of https://github.com/malfet due to Not sure why CI was green, but it breaks tons of tests, see 930b575389/1 ([comment](https://github.com/pytorch/pytorch/pull/155354#issuecomment-2998780884))
2025-06-24 04:42:14 +00:00
930b575389 [symm_mem] Add sym mem test into ptd h100 ci (#156634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156634
Approved by: https://github.com/ngimel, https://github.com/mori360
2025-06-24 03:43:22 +00:00
b2d473c8f8 [ROCm][Windows] Fix rocsolver undefined symbol error (#156591)
Fix undefined symbol error while using `rocsolver_ssyevd_strided_batched` call in `aten/src/ATen/native/cuda/linalg/BatchLinearAlgebraLib.cpp`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156591
Approved by: https://github.com/jeffdaily
2025-06-24 03:28:45 +00:00
87d615efab [fr] Use a vector to temporarily keep the reference to future object to avoid block (#156653)
At the end of the scope when std::async is launched, a wait will be called which could makes the code blocking, this is not expected for monitoring thread. Instead, let's use a vector to contain the reference to it. So no blocking will happen. And at the end of loop, wait will still be called but it is ok since all the checks or dump has already been finished.

Differential Revision: [D77190380](https://our.internmc.facebook.com/intern/diff/D77190380)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156653
Approved by: https://github.com/kwen2501
2025-06-24 03:25:04 +00:00
cyy
b09bd414a6 Deprecate c10::string (#155084)
Now there is no mention of c10::string in OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155084
Approved by: https://github.com/ezyang
2025-06-24 03:03:06 +00:00
0a2ec7681d Add fx_graph_runnable tests boilerplate (#156552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156552
Approved by: https://github.com/StrongerXi
2025-06-24 02:41:38 +00:00
9665702c64 [nativert] reland D76832891 remove designated initializer cpp20 (#156565)
Summary: fix windows build broke in https://github.com/pytorch/pytorch/pull/156508

Test Plan:
ci

Rollback Plan:

Differential Revision: D77080420

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156565
Approved by: https://github.com/zhxchen17
2025-06-24 02:38:08 +00:00
6a3d00aa3b Add Windows cuda 12.9.1 build (#156630)
Without Support for SegmentReduce.cu
Test PR confirmed by Removing SegmentReduce.cu windows build for CUDA 12.9 can succeed

Related to: https://github.com/pytorch/pytorch/issues/156181
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156630
Approved by: https://github.com/malfet

Co-authored-by: Ting Lu <tingl@nvidia.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2025-06-24 02:15:49 +00:00
a9ef7c4d04 [dynamo] update to lru_cache message and updated user stack trace in debug mode (#156639)
I had to create a new PR for this because of @atalman request of temporary reverting the previous PR to restore diff train sync. Nothing has changed from this PR and the original one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156639
Approved by: https://github.com/atalman
2025-06-24 01:52:13 +00:00
86996c15dc [Inductor] Allow exhaustive autotuning across all GEMM options (#156610)
Differential Revision: D76843916

Exhaustive autotuning is meant to autotune GEMM configs across the entire search space of possible configs. Some of these configs can cause extremely long compilation times and OOMs, especially with configs of the following nature:
Excessive register spillage
Using much larger amounts of shared memory than available on the hardware
This diff prunes out those configs to make exhaustive autotuning more viable, along with supporting exhaustive autotuning for persistent+tma template and decompose_k. Previously, exhaustive autotuning would hang, now we are able to tune shapes in ~5 minutes. Below is a sample log for autotuning with exhaustive:

```
  AUTOTUNE mm(1152x21504, 21504x1024)
  strides: [21504, 1], [1, 21504]
  dtypes: torch.bfloat16, torch.bfloat16
  mm 0.1167 ms 100.0%
  triton_mm_6270 0.1172 ms 99.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6522 0.1183 ms 98.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7482 0.1190 ms 98.1% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7483 0.1195 ms 97.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6523 0.1274 ms 91.6% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=5, num_warps=8, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6267 0.1285 ms 90.8% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_6519 0.1287 ms 90.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, EVEN_K=True, GROUP_M=8, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7480 0.1298 ms 89.9% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=128, BLOCK_N=128, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  triton_mm_persistent_tma_7312 0.1302 ms 89.7% ACC_TYPE='tl.float32', ALLOW_TF32=False, A_ROW_MAJOR=True, BLOCK_K=64, BLOCK_M=64, BLOCK_N=256, B_ROW_MAJOR=False, EVEN_K=True, GROUP_M=8, NUM_SMS=132, TMA_SIZE=128, USE_FAST_ACCUM=False, num_stages=4, num_warps=4, num_consumer_groups=0, num_buffers_warp_spec=0
  SingleProcess AUTOTUNE benchmarking takes 298.7185 seconds and 21.2569 seconds precompiling for 2210 choices
  INFO:tritonbench.utils.triton_op:Took 333894.46ms to get benchmark function for pt2_matmul_maxautotune
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156610
Approved by: https://github.com/jansel
2025-06-24 01:42:05 +00:00
40a785103c [dynamo] fix debugging code_parts for relational guards (#154753)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154753
Approved by: https://github.com/anijain2305
ghstack dependencies: #154772
2025-06-24 01:38:29 +00:00
849468034d [dynamo] fix selecting shape guards (#154772)
Not all LAMBDA_GUARDs are shape guards. Only the epilogue guards
are lambda guards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154772
Approved by: https://github.com/anijain2305
2025-06-24 01:38:29 +00:00
5dd9652389 Clean up HF components (#155707)
Differential Revision: [D76427358](https://our.internmc.facebook.com/intern/diff/D76427358/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155707
Approved by: https://github.com/saumishr
2025-06-24 00:07:37 +00:00
ca5a40395d [partitioner] Fix _broadcast_on_rank0 to use deterministic hash function (#153734)
Summary:
I was using python's hash, which is not deterministic across different interpreter runs.

Use hashlib instead.

Test Plan:
Run using it

https://www.internalfb.com/mlhub/pipelines/runs/mast/aps-rebase_sanity_128bs_8t_cc-8e17be61ce?job_attempt=1&version=0&tab=summary&env=prod

Differential Revision: D74882405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153734
Approved by: https://github.com/Microve
2025-06-24 00:06:23 +00:00
24063ad109 Fix native static dispatch kernels (#156331)
Summary: Fix for native static dispatch kernels not taking effect

Test Plan:
```
buck2 test //sigmoid/backend/test:static_kernels_ops_test

buck2 run mode/opt caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=BenchmarkByOp --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}${SUFFIX} --moduleName=${MODULE} --submodToDevice "" --pytorch_predictor_sigmoid_static_dispatch_enable=true --pytorch_predictor_sigmoid_graph_passes_enable=true --benchmarkEnableProfiling=true --load_lowered_merge=3 --using_aoti_lowering_allowlist=false --requestFilePath=/data/users/georgiaphillips/replayer/inputs/742055223/0/mix/742055223_0_mix.inputs.recordio --benchmarkNumIterations=2
```

Rollback Plan:

Reviewed By: dolpm

Differential Revision: D76559764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156331
Approved by: https://github.com/Skylion007, https://github.com/jingsh
2025-06-24 00:05:49 +00:00
380e30a723 [EZ/Profiler] Change 'b' to 'B' in FunctionEvent Frontend (#156250)
Summary: Fixes https://github.com/pytorch/pytorch/issues/149311

Test Plan:
Just changes string output

```
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem    # of Calls
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------
void at::native::vectorized_elementwise_kernel<4, at...         0.00%       0.000us         0.00%       0.000us       0.000us      60.993us         0.97%      60.993us       1.848us           0 B           0 B            33
...
```

Rollback Plan:

Differential Revision: D76857251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156250
Approved by: https://github.com/sanrise
2025-06-23 23:25:04 +00:00
07bb097698 Fix clang-tidy bugprone* warnings (#148529)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148529
Approved by: https://github.com/ezyang
2025-06-23 23:09:56 +00:00
3f920f3d8f [aotd] Support mutations of the same input in fw and bw (#155354)
Original issue: https://github.com/pytorch/pytorch/issues/154820

The issue happens when there is a mutation for the same input in forward AND in backward.

AOTD emited copy_ after joint_function tracing. This made this fx-node to correspond to the side effects of both mutations (in forward and in backward).
After that partitioner can put it either in forward or in backward.

The fix:

1/ Introduce joint_function.handle that allows to set "post_forward" callback, to be able to check inputs state after forward

We do not want to apply the mutation after joint, if we already applied it in forward. For that we need "mutation_counter" and memorize the version of mutation that we applied for  forward mutation.

2/ Exposing mutation_counter to python

We want to keep invariant that copy_ exist only in the end of joint graph.

3/ We memorize mutation_counter and state of the inputs after forward, using the handle post_forward.
Emit post_forward mutations after joint graph fully traced.

add for post_forward mutations "must_be_in_forward" tag (similar to existing "must_be_in_backward") to keep them in forward.

4/ Ban recompute of the source of mutation. Recompute can apply the same op (e.g. add) in forward and backward.
For this set MUST_SAVE for the source of mutation in forward.

proxy_tensor changes:

By default proxy tensor updates tensor_tracker. In this case applied mutations will be chained.
But we want that this copy_ will be independent and applied just to primals.
For this introducing a contextmanager to be able to disable update of tensor_tracker for adding forward mutations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155354
Approved by: https://github.com/bdhirsh
2025-06-23 22:25:45 +00:00
c82a174cea Extract CPU log_softmax kernels to header (#156243)
This allows sharing them with ExecuTorch.

Differential Revision: [D76830114](https://our.internmc.facebook.com/intern/diff/D76830114/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156243
Approved by: https://github.com/janeyx99
2025-06-23 21:31:16 +00:00
96e4c95cd8 [Inductor] Subgraph as a choice symbolic expression as input (#156185)
Differential Revision: D76514984

Fix subgraph as a choice for when a symbolic shape is inputted as an expression, i.e. 256 * s0, which typically happens in the backwards pass. The current logic assumes that all symbolic shapes are single inputs, i.e. standalone s0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156185
Approved by: https://github.com/masnesral
2025-06-23 21:29:17 +00:00
b1d62febd0 Revert "Use official CUDAToolkit module in CMake (#154595)"
This reverts commit 08dae945ae380d80efbaf140a95abfc5d96e5100.

Reverted https://github.com/pytorch/pytorch/pull/154595 on behalf of https://github.com/malfet due to It breaks on some local setup with no clear diagnostic, but looks like it fails to find cuFile ([comment](https://github.com/pytorch/pytorch/pull/154595#issuecomment-2997959344))
2025-06-23 21:15:31 +00:00
31e1274597 [MTIA Aten Backend] Migrate max.dim_max / min.dim_min (#156568)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate max.dim_max / min.dim_min to in-tree.

Differential Revision: [D77095185](https://our.internmc.facebook.com/intern/diff/D77095185/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156568
Approved by: https://github.com/malfet
ghstack dependencies: #156502, #156539, #156554
2025-06-23 20:43:39 +00:00
dfdd636cfa [aoti] Check longlong upperbound for codegening input size check (#156522)
Summary:
Fixes
```
error: integer literal is too large to be represented in any integer type
 38979 |     if (arg410_1_size[0] > 1171368248680556527362) {
```

Test Plan: ci

Differential Revision: D77057898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156522
Approved by: https://github.com/jingsh, https://github.com/desertfire
2025-06-23 20:38:34 +00:00
edd9c09e73 [MTIA Aten Backend] Migrate isnan (#156554)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate isnan to in-tree.

Differential Revision: [D77094811](https://our.internmc.facebook.com/intern/diff/D77094811/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156554
Approved by: https://github.com/malfet
ghstack dependencies: #156502, #156539
2025-06-23 20:22:32 +00:00
070e580d30 [MTIA Aten Backend] Migrate _log_softmax.out / _log_softmax_backward_data.out (#156539)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate _log_softmax.out / _log_softmax_backward_data.out to in-tree.

Differential Revision: [D77044380](https://our.internmc.facebook.com/intern/diff/D77044380/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156539
Approved by: https://github.com/malfet
ghstack dependencies: #156502
2025-06-23 19:56:01 +00:00
93cd16512f [MTIA Aten Backend] Migrate maximum.out / minimum.out / cos.out / erf.out / exp.out (#156502)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

 Migrate maximum.out / minimum.out / cos.out / erf.out / exp.out to in-tree.

Differential Revision: [D76917384](https://our.internmc.facebook.com/intern/diff/D76917384/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156502
Approved by: https://github.com/malfet
2025-06-23 19:56:01 +00:00
ee4d343499 Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166)" (#156624)
This reverts changes to [test/dynamo/test_repros.py](https://github.com/pytorch/pytorch/compare/main...atalman:revert_only_portion_of_file?expand=1#diff-4c82a5798a61d4cceb176b2700ba6fdd7c3e72d575b8e7e22458589139459caa)

Missed by: ee3d9969cc (diff-036cb21341ff8e390cc250e74fe9e3f0f15f259ea4bec4abcce49d95febf1553)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156624
Approved by: https://github.com/Camyll
2025-06-23 19:30:08 +00:00
56b3bf0c74 [nativert] Move HigherOrderKernel (#156507)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72
As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the implementation to torch/:
fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Test Plan: CI

Differential Revision: D77032074

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156507
Approved by: https://github.com/zhxchen17
2025-06-23 19:29:27 +00:00
d061a02e6e Revert "[invoke_subgraph] make same subgraph share get_attr target (#156260)"
This reverts commit 39dd2f4d7defc63164a7969bfac0d0c62ffac900.

Reverted https://github.com/pytorch/pytorch/pull/156260 on behalf of https://github.com/ydwu4 due to no signal, it breaks linter tests. ([comment](https://github.com/pytorch/pytorch/pull/156260#issuecomment-2997478798))
2025-06-23 18:24:10 +00:00
35d03398e5 Revert "[invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347)"
This reverts commit f179b7198522e6d93bd103efba1a1ebd5a2cf891.

Reverted https://github.com/pytorch/pytorch/pull/156347 on behalf of https://github.com/ydwu4 due to no signal, it breaks linter tests. ([comment](https://github.com/pytorch/pytorch/pull/156347#issuecomment-2997453729))
2025-06-23 18:19:29 +00:00
98a34e8d4b Move code out of individual token linters (#152256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152256
Approved by: https://github.com/Skylion007
2025-06-23 18:16:33 +00:00
da910e603a [ROCm] update state check for test_trace_while_active* (#153545)
When timing is enabled, ROCR runtime used to sleep for a small amount which ensured that the application saw the correct state. However, for perf reasons this sleep was removed and now the state is not guaranteed to be "started". That's why I updated the test state check to be either "started" or "scheduled"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153545
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-23 17:58:14 +00:00
55ef7b15e0 Revert "[dynamo] fixes to lru_cache message and adding user stack trace in debug mode (#156463)"
This reverts commit afbf5420b8745099bf7d871f5a4fb6dec338f825.

Reverted https://github.com/pytorch/pytorch/pull/156463 on behalf of https://github.com/atalman due to This is temoprary revert, to restore diff train sync. We should be good to reland this change ([comment](https://github.com/pytorch/pytorch/pull/156463#issuecomment-2997335541))
2025-06-23 17:44:36 +00:00
a95504b10f [torchbench] update environment setup script (#156465)
Existing torchbench `Makefile` installs all models from torchbench, which could easily take 30 minutes, even if a developer only want to run 1 model.

This PR adds a config to only install torchbench models we want to run.

Example usage:
```
# Install 1 torchbench model
make build-deps TORCHBENCH_MODELS="alexnet"

# Install 3 torchbench models
make build-deps TORCHBENCH_MODELS="alexnet basic_gnn_gcn BERT_pytorch"

# Install all models
make build-deps

# Install all models
make build-deps TORCHBENCH_MODELS=""
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156465
Approved by: https://github.com/ezyang
2025-06-23 17:41:29 +00:00
e583b88819 Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit ac86ec0e60370c037e018137f2048cafd47c5c28.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to internal breakage ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2997314638))
2025-06-23 17:36:44 +00:00
f179b71985 [invoke_subgraph] make collect_meta_analysis fake prop cachable (#156347)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156347
Approved by: https://github.com/anijain2305, https://github.com/zou3519
ghstack dependencies: #156260
2025-06-23 17:10:07 +00:00
39dd2f4d7d [invoke_subgraph] make same subgraph share get_attr target (#156260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156260
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2025-06-23 17:10:07 +00:00
276c790010 [ROCm][SymmetricMemory] Avoid bf16 to float conversion during reduce (#155587)
This PR helps improve the performance of one-shot and two-shot allreduce as reported here: https://github.com/pytorch/FBGEMM/issues/4072

One-Shot:
![image](https://github.com/user-attachments/assets/69fe0d53-6636-42e1-90e0-e5efb989f59f)
As shown in the numbers presented above, symmetric memory performance prior to the PR (baseline) was on average about 26% less than fbgemm's number reported in the issue above. After this PR, we are seeing 16% improvement on average as compared to fbgemm and 59% as compared to our baseline numbers.

Two-Shot:
![image](https://github.com/user-attachments/assets/e5c8a288-303e-4d50-814b-4348e589e1fc)
Similarly, in two-shot, we were originally underperforming by 12%. We have improved by 22% after this PR as compared to symmetric memory performance prior to this PR. However, two-shot performance is still about 23% lower than fbgemm. This work is still in progress and will be pushing those changes through a separate PR.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155587
Approved by: https://github.com/jeffdaily
2025-06-23 16:14:01 +00:00
5a533f74a1 Checkout optional submodules when publishing a release tarball (#156615)
This includes Eigen and nccl for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156615
Approved by: https://github.com/huydhn
2025-06-23 16:08:22 +00:00
6835ba1b34 Register hpu device to fake backend (#156076)
## MOTIVATION

This PR intends to add hpu ( Intel Gaudi) also to the list of devices that will be supported by the "fake" distributed backend and the process group that will be created.

## CHANGES
- Add "hpu" to the list of devices

@ankurneog, @EikanWang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156076
Approved by: https://github.com/d4l3k, https://github.com/EikanWang, https://github.com/albanD
2025-06-23 16:08:08 +00:00
cc410d3761 [SymmMem] Rename all_to_all_vdev ops (#156582)
`all_to_all_vdev` are not binding of NVSHMEM APIs. Removing the `nvshmem_` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156582
Approved by: https://github.com/fduwjj
ghstack dependencies: #155134
2025-06-23 15:57:36 +00:00
640f5a7090 [dynamo] Support builtin bool on non-constant VTs (#155863)
In practice `bool(...)` is either constant folded by Dynamo or used for
branching (so most of its emulation logic lived in
`InstructionTranslator.generic_jump`.

This patch adds a dedicated `bool` hanlder (only for symbolic
bool/int/float for now), and fixes #136075.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155863
Approved by: https://github.com/williamwen42
2025-06-23 15:53:15 +00:00
6b45af38a5 [easy] better copy_misaligned_inputs assertion failure message (#154472)
internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/688540560729579/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154472
Approved by: https://github.com/williamwen42
2025-06-23 15:39:15 +00:00
2e9bd03f60 Implemented Size.__radd__ (#152554)
Fixes #144334
Builds on top of #146834 by @khushi-411

The needed trick was to add `PyNumberMethods` because these Number Protocol appears to be responsible for `__radd__` (see https://stackoverflow.com/q/18794169)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152554
Approved by: https://github.com/albanD

Co-authored-by: Khushi Agrawal <khushiagrawal411@gmail.com>
Co-authored-by: albanD <desmaison.alban@gmail.com>
2025-06-23 15:38:37 +00:00
3cbae6dde8 [MPSInductor][BE] Fix multistage reduction check (#156567)
From less than max threadgroup size to less or equal to that, which eliminates redundant trivial loops.

I.e. it changes shader code generated for
```python
import torch

def f(x):
    var, mean = torch.var_mean(x, dim=2, keepdim = True)
    return x / var, var

torch.compile(f)(torch.rand(1, 16, 1024, dtype=torch.float32, device='mps'))

```

from
```metal
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device float* out_ptr1,
    device float* out_ptr2,
    constant float* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    int x0 = xindex;
    threadgroup float3 tmp_acc_0[1024];
    tmp_acc_0[r0_index * 1] = 0.0;
    for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) {
        int r0_1 = 1 * r0_index + r0_1_cnt;
        auto tmp0 = in_ptr0[r0_1 + 1024*x0];
        tmp_acc_0[r0_index * 1] = ::c10:🤘:welford_combine(tmp_acc_0[r0_index * 1], float3(tmp0, 0.0, 1.0));
    }
    auto tmp1 = c10:🤘:threadgroup_welford_combine(tmp_acc_0, 1024);
    auto tmp2 = 1023.0;
    auto tmp3 = tmp1.y / tmp2;
    out_ptr1[x0] = static_cast<float>(tmp3);
    for(auto r0_1_cnt = 0; r0_1_cnt < 1; ++r0_1_cnt) {
        int r0_1 = 1 * r0_index + r0_1_cnt;
        auto tmp4 = in_ptr0[r0_1 + 1024*x0];
        auto tmp5 = tmp4 / tmp3;
        out_ptr2[r0_1 + 1024*x0] = static_cast<float>(tmp5);
    }
}
```
to
```metal
[[max_total_threads_per_threadgroup(1024)]]
kernel void generated_kernel(
    device float* out_ptr1,
    device float* out_ptr2,
    constant float* in_ptr0,
    uint2 thread_pos [[thread_position_in_grid]],
    uint2 group_pos [[thread_position_in_threadgroup]]
) {
    auto xindex = thread_pos.x;
    auto r0_index = thread_pos.y;
    int r0_1 = r0_index;
    int x0 = xindex;
    threadgroup float tmp_acc_0[1024];
    auto tmp0 = in_ptr0[r0_1 + 1024*x0];
    tmp_acc_0[r0_index * 1] = tmp0;
    auto tmp1 = c10:🤘:threadgroup_welford_reduce(tmp_acc_0, 1024);
    auto tmp2 = 1023.0;
    auto tmp3 = tmp1.y / tmp2;
    out_ptr1[x0] = static_cast<float>(tmp3);
    auto tmp4 = tmp0 / tmp3;
    out_ptr2[r0_1 + 1024*x0] = static_cast<float>(tmp4);
}

``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156567
Approved by: https://github.com/dcci
ghstack dependencies: #156566
2025-06-23 14:49:26 +00:00
e28925aa75 [MPS] Activation kernels: do compute at float precision (#155735)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155735
Approved by: https://github.com/malfet
ghstack dependencies: #155304, #155316, #155462, #155479, #155571, #155586
2025-06-23 14:48:57 +00:00
f5e1b24945 Revert "Enable Leak Sanitizer (#154584)"
This reverts commit c79c7bbe615265b6b3d7df39d6d5a68afd7d6b2a.

Reverted https://github.com/pytorch/pytorch/pull/154584 on behalf of https://github.com/cyyever due to Need to suppress more output ([comment](https://github.com/pytorch/pytorch/pull/154584#issuecomment-2995792265))
2025-06-23 10:08:40 +00:00
4f70fbbd16 Revert "Use CMake wholearchive group (#156393)"
This reverts commit d1b4e0fa9a5feb22fc6de1d36dc4c9dac685caed.

Reverted https://github.com/pytorch/pytorch/pull/156393 on behalf of https://github.com/etaf due to This PR is breaking XPU windows build. ([comment](https://github.com/pytorch/pytorch/pull/156393#issuecomment-2995576362))
2025-06-23 09:03:19 +00:00
92409b6c89 Add DeviceAllocator as the base device allocator (#138222)
# Motivation
In line with [RFC] [A device-agnostic Python device memory related API design for stream-based accelerators](https://github.com/pytorch/pytorch/issues/134978), some memory-related APIs are widely used in popular repositories, such as HuggingFace [so many if-else conditional code](https://github.com/search?q=repo%3Ahuggingface%2Faccelerate%20torch.cuda.empty_cache&type=code). We would like to introduce a generic API set under torch.accelerator namespace to generalize these user cases.

<div align="center">
<table>
<tr>
<td> Device-specific memory APIs torch.xxx.foo</td> <td> Device-agnostic memory APIs torch.accelerator.foo</td>
</tr>
<tr>
<td>

```python
torch.xxx.empty_cache
```

</td>
<td>

```python
torch.accelerator.empty_cache
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_peak_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_peak_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.reset_accumulated_memory_stats
```

</td>
<td>

```python
torch.accelerator.reset_accumulated_memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_stats
```

</td>
<td>

```python
torch.accelerator.memory_stats
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_allocated
```

</td>
<td>

```python
torch.accelerator.memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_allocated
```

</td>
<td>

```python
torch.accelerator.max_memory_allocated
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.memory_reserved
```

</td>
<td>

```python
torch.accelerator.memory_reserved
```

</td>
</tr>

<tr>
<td>

```python
torch.xxx.max_memory_reserved
```

</td>
<td>

```python
torch.accelerator.max_memory_reserved
```

</td>
</tr>

</table>
</div>

# Solution
This design follows a similar pattern to `HostAllocator`. We're introducing a base class `DeviceAllocator`, from which `CUDAAllocator` and `XPUAllocator` will inherit. This allows us to provide a unified call path like: `torch.accelerator.empty_cache()` -> `GetDeviceAllocator(allocator)->empty_cache()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138222
Approved by: https://github.com/albanD
2025-06-23 08:49:30 +00:00
d5781c8d21 remove allow-untyped-defs from torch/fx/passes/utils/fuser_utils.py (#156538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156538
Approved by: https://github.com/ezyang
2025-06-23 08:18:16 +00:00
e0ae4ecca8 Refactor cpp codegen to support overridable class attributes. (#155553)
- Refactored CppKernelProxy and CppScheduling to use class-level attributes (kernel_cls, kernel_proxy_cls) for backend-specific kernel customization.
 - Avoids method duplication (e.g., codegen_functions, codegen_node) for backend-specific overrides thus reduces downstream maintenance when upgrading Torch.
 - Ensures type safety with annotations while keeping core logic centralized and extensible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155553
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2025-06-23 07:36:30 +00:00
cyy
67ee0c6725 Remove outdated Android workarounds of nearbyintf (#151292)
This PR uses std::nearbyint on all supported platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151292
Approved by: https://github.com/ezyang
2025-06-23 06:28:15 +00:00
cyy
d1b4e0fa9a Use CMake wholearchive group (#156393)
Use CMake wholearchive group to simplify code. It may also support more OSes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156393
Approved by: https://github.com/ezyang
2025-06-23 06:22:34 +00:00
cyy
099d0d6121 Simplify nvtx3 CMake handling, always use nvtx3 (#153784)
Fall back to third-party NVTX3 if system NVTX3 doesn't exist. We also reuse the `CUDA::nvtx3` target for better interoperability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153784
Approved by: https://github.com/ezyang
2025-06-23 06:12:46 +00:00
31659964a5 [Cutlass] Fix buffer missing issues (#155897)
Handles constants and constant folding with aoti.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155897
Approved by: https://github.com/henrylhtsang
2025-06-23 05:58:39 +00:00
cyy
c79c7bbe61 Enable Leak Sanitizer (#154584)
It enables Leak Sanitizer and also provides a suppression file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154584
Approved by: https://github.com/ezyang
2025-06-23 05:20:27 +00:00
9fed2added Remove remaining CUDA 12.4 CI code (#155412)
Because no 12.4 job.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155412
Approved by: https://github.com/ezyang
2025-06-23 05:16:38 +00:00
4cd6e96bf0 [MPSInductor] Fix nested loop var elimination (#156566)
As reduction resuts must be kept around
Add regression test that is specific for this issue

Fixes https://github.com/pytorch/pytorch/issues/156426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156566
Approved by: https://github.com/dcci
2025-06-23 04:35:16 +00:00
d55dc00f84 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-23 02:57:50 +00:00
5b210bb3a6 [BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316, #156317
2025-06-23 02:57:50 +00:00
ced90016c1 [BE][7/16] fix typos in torch/ (torch/csrc/) (#156317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156317
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316
2025-06-23 02:57:41 +00:00
cec2977ed2 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-23 02:57:34 +00:00
4ccc0381de [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-23 02:57:28 +00:00
1b2146fc6d [BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156314
Approved by: https://github.com/jingsh
ghstack dependencies: #156313
2025-06-23 02:57:19 +00:00
6ff6630375 [BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-23 02:57:12 +00:00
c55eef79f8 [Inductor][CPP] Enable a config to use a small dequant buffer for woq int4 (#156395)
**Summary**
Add a configuration option to enable a smaller dequantization buffer for WOQ INT4 CPP GEMM template. This can improve the performance of the WOQ INT4 GEMM template in cases where M is small. In such scenarios, matrix B cannot be effectively reused across matrix A, and we found that reducing the Kc block size can lead to better performance.

**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_with_small_buffer_config
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156395
Approved by: https://github.com/jansel
ghstack dependencies: #156407, #156387
2025-06-23 02:00:42 +00:00
3c7079959c [Inductor][CPP] Enable WOQ int4 concat linear (#156387)
**Summary**
Enable the concat linear optimization pass in Inductor for woq int4 linear.

**Test Plan**
```
 python test/inductor/test_cpu_select_algorithm.py -k test_int4_concat_woq_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156387
Approved by: https://github.com/CaoE, https://github.com/jansel
ghstack dependencies: #156407
2025-06-23 01:52:00 +00:00
03023f178c FlexAttn config refactor + ROCm optimisations (#156307)
This PR primarily unifies the flex attention config logic with the GEMM/Conv config approach https://github.com/pytorch/pytorch/pull/147452 this will make it much easier to handle optimisation pathways for particular triton backends.

This PR also introduces:
1. Introduces an exhaustive tuning mode for flex attention via TORCHINDUCTOR_MAX_AUTOTUNE_FLEX_SEARCH_SPACE="EXHAUSTIVE" to allow for wide scale benchmarking for perf investigation use cases.
3. Updates configs for ROCm flex autotune path providing perf optimisations

AMD perf numbers on score mod benchmark (default inputs)
flex_attn | mode | Speedup (Avg) | Speedup (Max)
-- | -- | -- | --
fwd | autotune before PR | 2.608 | 20.56
fwd | autotune after PR | 2.862 | 22
fwd | exhaustive_autotune | 2.943 | 22.471
bwd | autotune before PR | 2.196 | 9.831
bwd | autotune after PR | 2.423 | 11.331
bwd | exhaustive_autotune | 2.566 | 13.87

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156307
Approved by: https://github.com/drisspg, https://github.com/jansel
2025-06-22 22:27:38 +00:00
a5cbb2bcb3 Improve All to All Perf for inter-node use-case (#156376) (#156389)
Summary:

For 16 GPU use-case. NVSHMEM can drive only upto 49GB/s with 8 thread blocks per peer for all to all V use-case. Increasing that to 16 threads per block is able to max out the perf.

Test Plan:
Verify on two hosts
Host1:
TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip}  --node_rank=0  comms.py --	master-ip ${master_ip} --b 4 --e 256M --n 500 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda
Host2:
TORCH_SYMMMEM=NVSHMEM torchrun --nnodes=2 --nproc_per_node=8 --master_addr ${master_ip}  --node_rank=1  comms.py --	master-ip ${master_ip} --b 4 --e 256M --n 100 --f 2 --z 1 --collective all_to_allv --backend nccl --device cuda

Rollback Plan:

Differential Revision: D76937048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156389
Approved by: https://github.com/kwen2501
2025-06-22 20:45:46 +00:00
a28e6ae38f [OpenReg][2/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156401)
As the title stated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156401
Approved by: https://github.com/albanD
ghstack dependencies: #156400
2025-06-22 18:40:38 +00:00
1d522325b4 [OpenReg][1/N] Migrate cpp_extensions_open_device_registration to OpenReg (#156400)
As the title stated.

**Changes:**

- add resize_ for OpenReg
- migrate related tests into test_openreg.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156400
Approved by: https://github.com/albanD
2025-06-22 18:40:38 +00:00
54b8087f63 Improve torch.ops typing (#154555)
Summary:
Cloned https://github.com/pytorch/pytorch/pull/153558 from benjaminglass1 and fixed internal typing errors.

Fixes longstanding issue where direct references to aten operations are seen as untyped by type checkers. This is accomplished by setting attributes on several classes more consistently, so that `__getattr__` can return a single type in all other cases.

Decisions made along the way:

1. `torch.ops.higher_order` is now implemented by a single-purpose class. This was effectively true before, but the class implementing it attempted to be generalized unnecessarily. Fixing this simplified typing for the `_Ops` class.
2. `__getattr__` is only called when all other lookup methods have failed, so several constant special-cases in the function could be implemented as class variables.

The remainder of this PR is fixing up all the bugs exposed by the updated typing, as well as all the nitpicky typing issues.

Test Plan: CI

Differential Revision: D75497142

Co-authored-by: Benjamin Glass <bglass@quansight.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154555
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/zou3519, https://github.com/benjaminglass1
2025-06-22 15:52:27 +00:00
10fb98a004 [Precompile] Hook up backend="inductor" (#155387)
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.

One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
2025-06-22 15:05:08 +00:00
b5c8b8d09f Revert "[dynamo] control one_graph behavior additionally through config (#154283)"
This reverts commit b46eb1ccaff944cdcd43e9ce3958819226d2952f.

Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
5e56db59d4 Revert "[dynamo] add set_fullgraph decorator/context manager (#154289)"
This reverts commit 2c372a0502578e0136a84423c3f49c19c26d6bb7.

Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
c10eeb5bad Revert "[dynamo] fix set_fullgraph for nested calls (#154782)"
This reverts commit 537b0877a87948bc221301a518fdbc1cf772bc7e.

Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
ee3d9969cc Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166)"
This reverts commit 24dc33b37b50ec92da08fc693dd83e7c87b74f8b.

Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/ezyang due to All of this is responsible for regression, see https://github.com/pytorch/pytorch/pull/156561 ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2994242583))
2025-06-22 14:22:07 +00:00
f1331f3f1b Revert "[BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)"
This reverts commit 3627270bdf17b0fb6f528ca1cb87d6f2ec32680a.

Reverted https://github.com/pytorch/pytorch/pull/156313 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
5b427c92a8 Revert "[BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314)"
This reverts commit ead741c5fb0036e0fc95b79d4fe1af3a426e1306.

Reverted https://github.com/pytorch/pytorch/pull/156314 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
145d4cdc11 Revert "[BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)"
This reverts commit c2f0292bd5b4b3206f5b295e96f81cd6c178eb18.

Reverted https://github.com/pytorch/pytorch/pull/156315 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
3f44fdc03d Revert "[BE][6/16] fix typos in torch/ (#156316)"
This reverts commit b210cf1ea56bcd9f937a2805d9e70d8684d25ee4.

Reverted https://github.com/pytorch/pytorch/pull/156316 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
035a68d25a Revert "[BE][7/16] fix typos in torch/ (torch/csrc/) (#156317)"
This reverts commit ee72815f1180fe2d8bcdb23493999256169ac2fa.

Reverted https://github.com/pytorch/pytorch/pull/156317 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:56 +00:00
1d3bca40ed Revert "[BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)"
This reverts commit a23ccaa8479e038e79532759a64e9947c0fac43d.

Reverted https://github.com/pytorch/pytorch/pull/156319 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:56 +00:00
4b55871e06 Revert "[BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)"
This reverts commit c95f7fa874a3116f1067f9092456ee7281003614.

Reverted https://github.com/pytorch/pytorch/pull/156321 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156321#issuecomment-2994163667))
2025-06-22 12:27:36 +00:00
afbf5420b8 [dynamo] fixes to lru_cache message and adding user stack trace in debug mode (#156463)
This PR refers to the issue: https://github.com/pytorch/pytorch/issues/155352

This PR uses torch._dynamo.utils.warn_once so that this warning only emits once, clarifies in the warning that silent incorrectness is potential, not observed, Doesn't warn for functions that come from torch.*

As of right now with this code change the terminal outputs:

if the code came from torch.* :
Nothing, as we shouldn't warn for functions that come from torch.*

else:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)

If the user runs the command 'TORCH_LOGS="+dynamo" python foo4.py', in the debug logs it shows(this log below is based on chillee's repro:
/data/users/ssubbarao8/pytorch/torch/_dynamo/variables/functions.py:1565: UserWarning: Dynamo detected a call to a `functools.lru_cache`-wrapped function. Dynamo ignores the cache wrapper and directly traces the wrapped function. Silent incorrectness is only a *potential* risk, not something we have observed. Enable TORCH_LOGS="+dynamo" for a DEBUG stack trace.
  torch._dynamo.utils.warn_once(msg)
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0] call to a lru_cache` wrapped function from user code at: /data/users/ssubbarao8/pytorch/foo4.py:9
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0]   File "/data/users/ssubbarao8/pytorch/foo4.py", line 9, in <module>
V0619 21:00:16.504000 956424 torch/_dynamo/variables/functions.py:1575] [0/0]     torch.compile(foo, backend="eager")(torch.randn(4))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156463
Approved by: https://github.com/williamwen42
2025-06-22 11:40:28 +00:00
aeaf6b59e2 [dynamo] Weblink generation when unimplemented_v2() is called (#156033)
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
2025-06-22 11:39:31 +00:00
c95f7fa874 [BE][11/16] fix typos in torch/ (torch/csrc/distributed/) (#156321)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156321
Approved by: https://github.com/jingsh
ghstack dependencies: #156313, #156314, #156315, #156316, #156317, #156319
2025-06-22 08:43:49 +00:00
a23ccaa847 [BE][9/16] fix typos in torch/ (torch/csrc/) (#156319)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156319
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316, #156317
2025-06-22 08:43:49 +00:00
ee72815f11 [BE][7/16] fix typos in torch/ (torch/csrc/) (#156317)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156317
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315, #156316
2025-06-22 08:43:41 +00:00
b210cf1ea5 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-22 08:43:33 +00:00
c2f0292bd5 [BE][5/16] fix typos in torch/ (torch/distributed/) (#156315)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156315
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #156313, #156314
2025-06-22 08:43:26 +00:00
ead741c5fb [BE][4/16] fix typos in torch/ (torch/_dynamo/) (#156314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156314
Approved by: https://github.com/jingsh
ghstack dependencies: #156313
2025-06-22 08:43:18 +00:00
3627270bdf [BE][3/16] fix typos in torch/ (torch/_inductor/) (#156313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156313
Approved by: https://github.com/jingsh
2025-06-22 08:43:09 +00:00
cyy
08dae945ae Use official CUDAToolkit module in CMake (#154595)
Use CUDA language in CMake and remove forked FindCUDAToolkit.cmake.
Some CUDA targets are also renamed with `torch::` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154595
Approved by: https://github.com/albanD
2025-06-22 05:44:29 +00:00
1d993fa309 Don't change set_skip_guard_eval_unsafe for DisableContext, since compiler won't run (#156490)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156490
Approved by: https://github.com/anijain2305
2025-06-22 00:51:32 +00:00
333e0e6147 Make build-deps drop builds into current venv again (#156200)
Signed-off-by: Edward Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156200
Approved by: https://github.com/malfet
2025-06-22 00:45:02 +00:00
74ebd8d14e use guard_or_false for expand utils reduction (#155868)
This is classic broadcast like pattern.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155868
Approved by: https://github.com/bobrenjc93
2025-06-21 23:42:19 +00:00
f70c80105e Enables NCCL symmetric memory kernels through mempool registration (#155134)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155134
Approved by: https://github.com/kwen2501

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-06-21 23:24:04 +00:00
9e132b770e [CUDA] Skip test on low vram machines (#156548)
I noticed some jobs error out after merging #155397 due to the test requiring >15GB GPU memory to execute and some of the machines it's running on has 8GB GPUs. This PR adds the skip option on those machines.

CC: @eqy @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156548
Approved by: https://github.com/eqy, https://github.com/malfet
2025-06-21 22:32:57 +00:00
e4ae60a413 [SymmMem] Add NVSHMEM Quiet support to Triton (#156475)
This PR introduces device-side NVSHMEM completion guarantees via the quiet API in Triton, enabling GPU kernels to ensure all pending remote memory operations are fully complete before proceeding with subsequent operations.

Changes:
- Added a new `core.extern` wrapper for `nvshmem_quiet` in `nvshmem_triton.py`
- Implemented `test_triton_quiet` in `test/distributed/test_nvshmem.py`, including:
  - A Triton kernel that performs `putmem_block` followed by `quiet()` to ensure completion
  - Flag-based signaling only after `quiet()` completes, guaranteeing data delivery
  - Consumer validation that when the completion flag arrives, all data transfers are guaranteed complete

Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_quiet`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156475
Approved by: https://github.com/kwen2501
ghstack dependencies: #156472, #156473, #156474
2025-06-21 22:19:58 +00:00
c2d1b225e6 [PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```

**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB

As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.

cc and credit to @ShatianWang for noticing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
2025-06-21 19:57:21 +00:00
04b91a9e43 [SymmMem] Add NVSHMEM Fence support to Triton (#156474)
This PR introduces device-side NVSHMEM memory ordering via the fence API in Triton, enabling GPU kernels to enforce completion and ordering of remote memory operations before subsequent operations proceed.

 Changes:
- Added a new `core.extern` wrapper for `nvshmem_fence` in `nvshmem_triton.py`
- Implemented `test_triton_fence` in `test/distributed/test_nvshmem.py`, including:
  - A Triton kernel that performs two ordered `putmem_block` operations separated by `fence()` calls
  - Final fence before flag update to ensure all data transfers complete before signaling
  - Consumer validation that both buffers contain expected values when flag arrives, proving ordering guarantees

 Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_fence`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156474
Approved by: https://github.com/mandroid6, https://github.com/kwen2501
ghstack dependencies: #156472, #156473
2025-06-21 18:57:05 +00:00
c06c2569ee [ca] Support TorchDispatchMode via pass through (#156516)
The CA initial trace just proxies nodes without dispatching any ops, we should hide it from ambient TorchDispatchModes

In terms of differences with eager autograd engine:
- For function mode, CA additionally disables/re-enables `_set_multithreading_enabled`
- For dispatch mode:
  - accumulate grad doesn't go down the stealing path (inaccurate compile-time refcount) so the grad `detach` ops are `copy_` instead
  - Since we always initial trace with dynamic shapes, and we filter out sizes, there's 1 aten.empty.memory_format for each mark_dynamic'd scalar

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156516
Approved by: https://github.com/jansel
ghstack dependencies: #156374, #156509
2025-06-21 18:33:47 +00:00
5f2f343e1e [ca] suggest to disable compiled autograd for trace-time NotImplementedErrors (#156509)
Example:

```python
  File "/home/xmfan/core/a/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: TorchDispatchMode not yet implemented for compiled autograd.
  You can disable compiled autograd for this operation by:
  1.  Relocating the unsupported autograd call outside the compiled region.
  2.  Wrapping the unsupported autograd call within a scope that disables compiled autograd.
  3.  Configuring the specific compilation unit to disable compiled autograd.
  4.  Globally disabling compiled autograd at the application's initialization.
```

No duplicate error messages for python side trace-time errors
```python
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/xmfan/core/a/pytorch/torch/_dynamo/compiled_autograd.py", line 344, in begin_capture
    raise NotImplementedError(
NotImplementedError: Found tensor of type <class 'torch.nn.utils._expanded_weights.expanded_weights_impl.ExpandedWeight'>, which is not supported by FakeTensorMode. You can turn off compiled autograd by either:
1. Moving the unsupported autograd call outside of the torch.compile'd region.
2. Wrapping the unsupported autograd call in the torch._dynamo.compiled_autograd._disable() context manager.
3. Setting torch._dynamo.config.compiled_autograd=False for the torch.compile call containing the unsupported autograd call.
4. Setting torch._dynamo.config.compiled_autograd=False at the start of the program.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156509
Approved by: https://github.com/jansel
ghstack dependencies: #156374
2025-06-21 18:33:46 +00:00
f1968a5e76 [ca] skip on some PYTORCH_TEST_WITH_DYNAMO=1 autograd tests (#156374)
These aren't supported. Not sure how they passed CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156374
Approved by: https://github.com/jansel
2025-06-21 18:33:38 +00:00
fab85fc5f9 [compile][hierarchical compilation] Release nested_compile_region API (#156449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156449
Approved by: https://github.com/zou3519, https://github.com/jansel
2025-06-21 15:14:59 +00:00
fb75dea2c1 [logging] dynamo_timed for CachingAutotuner.coordinate_descent_tuning (#156517)
Summary: Discussed internally at https://fburl.com/workplace/v3hllrs9. With coordinate descent tuning enabled, we're missing the dynamo_timed logging.

Test Plan:
`TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_COORDINATE_DESCENT_TUNING=1 buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 1 --performance --cold-start-latency`
* tlparse: https://fburl.com/bh2hxw4z
* dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/u88ogw39
* pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/yqljow6c

Rollback Plan:

Differential Revision: D77053918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156517
Approved by: https://github.com/mengluy0125
2025-06-21 14:17:19 +00:00
a47ca4fc74 Revert "[dynamo] Weblink generation when unimplemented_v2() is called (#156033)" (#156546)
Broke multiple CI jobs: dynamo/test_reorder_logs.py::ReorderLogsTests::test_constant_mutation [GH job link](https://github.com/pytorch/pytorch/actions/runs/15792695433/job/44521220864) [HUD commit link](9de23d0c29)

This reverts commit 9de23d0c29dfac8dc0f6f234bdbcd85a6375fa81.

PyTorch bot revert failed: https://github.com/pytorch/pytorch/pull/156033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156546
Approved by: https://github.com/jansel
2025-06-21 14:10:12 +00:00
d846e21355 Revert "[nativert] move layout planner algorithms to libtorch (#156508)"
This reverts commit eab45643f22e58ee12d95d8b0162d51ca0a50801.

Reverted https://github.com/pytorch/pytorch/pull/156508 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/15793524714/job/44524067679) [HUD commit link](eab45643f2) ([comment](https://github.com/pytorch/pytorch/pull/156508#issuecomment-2993589983))
2025-06-21 13:42:40 +00:00
1cfdcb975a [CUDA] fix illegal memory access in attention (#155397)
Fixes https://github.com/pytorch/pytorch/issues/150054

CI seemed to be messed up in the old one, old PR:
https://github.com/pytorch/pytorch/pull/155145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155397
Approved by: https://github.com/ngimel
2025-06-21 12:32:00 +00:00
cd75cf3cab [symm_mem] Add one side put API for nvshvem (#156443)
`nvshmem_put(Tensor tensor, int peer)`, where `tensor` must be a symmetric tensor, i.e. rendezvoused before this call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156443
Approved by: https://github.com/kwen2501

Co-authored-by: Ke Wen <kw2501@meta.com>
2025-06-21 12:16:36 +00:00
4ff0e033c1 [SymmMem] Add NVSHMEM signal_wait_until support to Triton (#156473)
This PR introduces device-side NVSHMEM signal synchronization via the signal_wait_until API in Triton, enabling GPU kernels to block until a signal variable meets a specified condition. This replaces previous barrier-based synchronization patterns with more efficient signal-based coordination between PEs.

Changes:
- Added a new `core.extern` wrapper for `nvshmem_signal_wait_until` in `nvshmem_triton.py`
- Updated existing `test_triton_put_signal` and `test_triton_put_signal_add` tests to use `signal_wait_until` instead of `dist.barrier()` for proper device-side synchronization ([per feedback](https://github.com/pytorch/pytorch/pull/156211#discussion_r2153035675))
- Implemented `test_triton_signal_wait_until` with:
  - Producer-consumer pattern where Rank 0 puts data and signals completion via `putmem_signal_block`
  - Consumer (Rank 1) uses `signal_wait_until` to block until the signal variable reaches the expected value
  - End-to-end validation of both data transfer and signal synchronization

Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_signal_wait_until`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156473
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
ghstack dependencies: #156472
2025-06-21 10:55:40 +00:00
8485f19507 remove gso from vector_norm (#156530)
guard_or_false here does same thing that guard_size_oblivuous do, note that
size is >=0 and this is size like by definition since its a tensor size
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156530
Approved by: https://github.com/bobrenjc93
2025-06-21 08:42:36 +00:00
6ffa03ef9e [Inductor-CPU] int8 WoQ concat linear (#153004)
### Summary

int8 WoQ GEMM concat linear optimization pertaining to the same activation applied to 3 sets of weights of the same shape.

### Perf data

GPT-J 128 input tokens, 128 output tokens.
32 physical cores of one socket of Intel(R) Xeon(R) 6972P (Xeon Gen 5). tcmalloc & Intel OpenMP were preloaded.

| May 8 nightly first token latency | First token latency with this implementation | Rest token latency with May 8 nightly | Rest token latency with this implementation combined with #149373  |
|---|---|---|---|
|202 ms | 190 ms | 33 ms | 30 ms|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153004
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel

Co-authored-by: Anthony Shoumikhin <anthony@shoumikh.in>
2025-06-21 08:40:09 +00:00
35321b2ad6 remove make_fast_binary_impl from make_fast_binary_impl (#156528)
This was added in https://github.com/pytorch/pytorch/pull/133584.
Take slow path when we cant determine fast path is valid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156528
Approved by: https://github.com/bobrenjc93
2025-06-21 08:27:54 +00:00
eab45643f2 [nativert] move layout planner algorithms to libtorch (#156508)
Summary: tt

Test Plan:
ci

Rollback Plan:

Differential Revision: D76832891

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156508
Approved by: https://github.com/zhxchen17
2025-06-21 07:35:40 +00:00
bf50d71553 Add missing inline namespace CPU_CAPABILITY to Gelu/Elu.h (#156512)
As I recently learned the hard way (#156243), it is necessary to put kernel code that uses Vectorized in headers in this namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156512
Approved by: https://github.com/malfet
2025-06-21 06:26:23 +00:00
e3b44edfd8 [SymmMem] Add NVSHMEM wait_until support to Triton (#156472)
This PR introduces device-side NVSHMEM synchronization via the wait_until API in Triton, enabling GPU kernels to block until a remote flag reaches a specified value. It also adds a corresponding end-to-end test to validate correct behavior across PEs.

 Changes:
- Added a new `core.extern` wrapper for `nvshmem_longlong_wait_until` in `nvshmem_triton.py`.
- Implemented `test_triton_wait_until` in `test/distributed/test_nvshmem.py`, including:
  - A simple Triton kernel that calls `nvshmem.wait_until` on a symmetric memory flag.
  - Coordination logic where Rank 0 blocks until Rank 1 atomically sets the flag and transfers data.

Tests:
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_wait_until`

```python
@triton.jit
def put_kernel(dst_ptr, src_ptr, numel: tl.constexpr, peer: tl.constexpr):
    nvshmem.putmem_block(dst_ptr, src_ptr, numel, peer)

@triton.jit
def wait_until_kernel(ivar_ptr, cmp_op: tl.constexpr, cmp_val: tl.constexpr):
    nvshmem.wait_until(ivar_ptr, cmp_op, cmp_val)

...

if rank == 0:
    print(f"[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21")
    wait_until_kernel[(1, 1, 1)](ivar_ptr, cmp_op=NVSHMEM_CMP_EQ, cmp_val=flag_val, extern_libs=nvshmem_lib)
    print(f"[RANK 0] WAIT IS OVER! Flag was set, checking data now...")
    print(f"[RANK 0] Current out buffer contents: {out.tolist()}")
    torch.testing.assert_close(out, val * torch.ones(numel, dtype=dtype, device=self.device))
    print(f"[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.")

if rank == 1:
    print(f"[RANK 1] About to PUT 8 elements of value 13 to rank 0")
    put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, peer=peer, extern_libs=nvshmem_lib)
    print(f"[RANK 1] About to PUT flag value 21 to wake up rank 0")
    put_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=1, peer=peer, extern_libs=nvshmem_lib)
    print(f"[RANK 1] FLAG PUT complete! Rank 0 should wake up now.")

...
```
Output:
```
[RANK 0] About to call wait_until_kernel - this will BLOCK until rank 1 sets flag to 21
[RANK 1] About to PUT 8 elements of value 13 to rank 0
[RANK 1] About to PUT flag value 21 to wake up rank 0
[RANK 1] FLAG PUT complete! Rank 0 should wake up now.
[RANK 0] WAIT IS OVER! Flag was set, checking data now...
[RANK 0] Current out buffer contents: [13, 13, 13, 13, 13, 13, 13, 13]
[RANK 0] ✓ DATA VERIFICATION PASSED! Got expected values.
[RANK 0] Test completed successfully! 🎉
[RANK 1] Test completed successfully! 🎉

...

----------------------------------------------------------------------
Ran 1 test in 18.773s
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156472
Approved by: https://github.com/kwen2501
2025-06-21 06:18:31 +00:00
92c79f36db [PGO] frame-specific whitelist logging (#155959)
Summary:
In D75617963, we started logging dynamic whitelist suggestions to PT2 Compile Events. The whitelists were aggregated across all frames, intending to avoid manual work for the user (e.g. if frame 0/1 saw L['x'] turn dynamic, and later 1/1 saw L['y'], we'd log "L['x'],L['y']" on frame 1/1).

This switches to frame-specific whitelists, as attributing dynamism changes to certain frames was difficult, and suggestions are sometimes polluted by problematic frames (e.g. optimizer states).

The globally aggregated whitelist is still available in tlparse, by looking at the final `put_local_code_state_*` entry.

Test Plan:
loggercli codegen GeneratedPt2CompileEventsLoggerConfig

Rollback Plan:

Differential Revision: D76628834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155959
Approved by: https://github.com/bobrenjc93
2025-06-21 06:15:51 +00:00
9de23d0c29 [dynamo] Weblink generation when unimplemented_v2() is called (#156033)
This PR includes the GBID weblink whenever a user encounters a graph break. I also had to include the JSON file in setup.py, so it can be part of the files that are packaged in during CI. It also fixes the issue of the hardcoded error messages stripping away one of the '/' in 'https'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156033
Approved by: https://github.com/williamwen42
2025-06-21 05:47:54 +00:00
b8ace6f951 Make dtensor tests device agnostic (#155687)
## MOTIVATION
This PR is a continuation of https://github.com/pytorch/pytorch/pull/154840 and we are trying to make the tests more device agnostic by removing hard coded references to any particular device.
Please refer to this RFC as well: https://github.com/pytorch/rfcs/pull/66

## CHANGES
1. test_convolution_ops.py:
    - Replace "cuda" with self.device_type
2. test_random_ops.py:
    - Remove setting and using TYPE_DEVICE variable since device_type is set as per the environment (device) in DTensorTestBase class.
    - Replace "cuda" with self.device_type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155687
Approved by: https://github.com/EikanWang, https://github.com/d4l3k
2025-06-21 04:51:59 +00:00
f3ec16c26a [MTIA Aten Backend][3/n] Migrate mm.out from out-of-tree to in-tree (#154393)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate mm.out from out-of-tree to in-tree.

We dispatch mm.out to MTIA separately from CPU/CUDA. So this diff adds the file `MTIAOps.cpp` under `ATen/native/mtia` to hold the dispatched functions. In future we can split `MTIAOps.cpp` to categorized ops files.

Differential Revision: [D74743849](https://our.internmc.facebook.com/intern/diff/D74743849/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154393
Approved by: https://github.com/albanD, https://github.com/egienvalue, https://github.com/nautsimon
2025-06-21 04:31:04 +00:00
88b9c285e0 Workaround for e4m2 dtype (#156461)
Found in: https://github.com/pytorch/ao/pull/2408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156461
Approved by: https://github.com/vkuzo
2025-06-21 04:01:44 +00:00
554b568040 Add internal use only utility to allow externally visible side effects within HOPs (#155715)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155715
Approved by: https://github.com/zou3519
2025-06-21 03:55:28 +00:00
c09b054878 Add runtime profiler info for AOTDispatcher prologue (#155785)
Fixes #155721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155785
Approved by: https://github.com/bdhirsh
2025-06-21 03:34:07 +00:00
fd8ea3c8a3 [symm_mem] Add nccl as a backend for symmetric memory (#155740)
Running unit test:

 TORCH_SYMMMEM=NCCL TORCH_DISTRIBUTED_DEBUG=INFO TORCH_CPP_LOG_LEVEL=INFO pytest test/distributed/test_nccl.py -k test_nccl_symmem_alloc

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155740
Approved by: https://github.com/kwen2501
2025-06-21 03:22:23 +00:00
ee56e9f8a8 [BE] Make Eigen an optional dependency (#155955)
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.

Remove eigen submodule and replace it with eigen_pin.txt

Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
2025-06-21 03:02:02 +00:00
b4228a94d1 Split the exclude pattern for CODESPELL linter (#156229)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156229
Approved by: https://github.com/albanD
ghstack dependencies: #156080, #156081
2025-06-21 02:47:40 +00:00
e3507c3777 [BE] fix typos in functorch/ and scripts/ (#156081)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156081
Approved by: https://github.com/albanD
ghstack dependencies: #156080
2025-06-21 02:47:40 +00:00
2ccfd14e23 [BE] fix typos in docs/ (#156080)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156080
Approved by: https://github.com/cyyever, https://github.com/albanD
2025-06-21 02:47:32 +00:00
clr
9aaa184105 dynamo: Don't crash when someone tries to access a non existent list member (#156335)
dynamo: Don't crash when someone tries to access a non existent list member

Test added which reproduces the failure. Note that I'm using the new
unimplemented_v2 API. Let me know if people have a strong preference that I use
something else.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156335
Approved by: https://github.com/jansel
2025-06-21 02:26:31 +00:00
ac86ec0e60 [Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)
Fixes  #154073

Reference: https://github.com/NVIDIA/Fuser/pull/4197

See PR #154097

@nWEIdia is currently out of the office, so I’ve temporarily taken over his work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156097
Approved by: https://github.com/ngimel

Co-authored-by: Wei Wang <weiwan@nvidia.com>
2025-06-21 01:34:41 +00:00
e98dd95446 [nativert] Move SerialGraphExecutor to PyTorch core (#156459)
Summary: `SerialGraphExecutor` inherits from `GraphExecutorBase` and executes all nodes in the graph in a serial manner

Test Plan:
CI

Rollback Plan:

Differential Revision: D76917966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156459
Approved by: https://github.com/zhxchen17, https://github.com/jingsh
2025-06-21 01:32:06 +00:00
a67eb1a0d6 [ez] remove unused functions (#156466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156466
Approved by: https://github.com/jingsh
2025-06-21 00:38:34 +00:00
2ee23175d9 [dynamo][guards] Catch exception and return false in the backend match (#156341)
Its difficult to write a test. I found this while debugging a sefgault.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156341
Approved by: https://github.com/williamwen42
2025-06-21 00:13:26 +00:00
0f0c010714 [c10d] init_process_group supports index-only device id (#156214)
Before:
```
acc = torch.accelerator.current_accelerator()
if acc:
  local_idx = ...
  dist.init_process_group(
    device_id=torch.device(acc.type, local_idx)
  )
```
After:
```
dist.init_process_group(device_id=local_idx)
```

That is, `init_process_group` checks `torch.accelerator.current_accelerator()` internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156214
Approved by: https://github.com/guangyey, https://github.com/albanD
2025-06-21 00:02:37 +00:00
fbbab794ef [ONNX] Implement Attention-23 (#156431)
Implement Attention-23 using sdpa and flexattention.

- I used copilot for this.
- Also updated the conversion logic to remove trailing None inputs.

@gramalingam @kunal-vaishnavi @titaiwangms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156431
Approved by: https://github.com/titaiwangms

Co-authored-by: kunal-vaishnavi <115581922+kunal-vaishnavi@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 23:54:57 +00:00
0ad88a2224 Support environement var for autotune log (#156254)
Summary: Titled

Test Plan:
See the scadcastle signal

Rollback Plan:

Differential Revision: D76860928

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156254
Approved by: https://github.com/Mingming-Ding
2025-06-20 23:06:33 +00:00
6098209bff [BE][5/X] Phase out usage of use_max_autotune() (#156269)
These look to be the last call sites using `use_max_autotune(...)`, so remove those and `use_max_autotune(...)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156269
Approved by: https://github.com/masnesral
2025-06-20 22:37:45 +00:00
5ab257c74c [invoke_subgraph] Make invoke_subgraph cacheable (#156448)
Its unclear to me what happens if the subgraph itself is not cacheable. Imo, there is nothing special about invoke_subgraph to prevent any caching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156448
Approved by: https://github.com/oulgen, https://github.com/zou3519
2025-06-20 21:20:23 +00:00
e2351f2dcf fix apparent copy-paste bug in log_softmax reduced-precision fp kernel (#156379)
This looks like a bug. Check if trying to fix it breaks existing tests; if not, will look into why no test coverage caught it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156379
Approved by: https://github.com/janeyx99
2025-06-20 20:54:53 +00:00
b8fc5e0c0d skip flaky test in CPython 3.13 tests (#155561)
Changed files:
* test_math.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155561
Approved by: https://github.com/zou3519
2025-06-20 20:25:35 +00:00
754c04aa06 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit 0aed855b2bde6d9bd045bb20cc24544a9f2fb72b.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/ezyang due to regresses functorch_maml_omniglot ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2992685744))
2025-06-20 20:18:24 +00:00
de1930a429 Add ONNX dynamo metadata documentation (#155816)
Describe auto-generated metadata when calling torch.onnx.export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155816
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-06-20 20:12:22 +00:00
a69e27ca5a Remove unused MultiKernelCall import from inductor codegen (#156158)
Since it's now actually used within async_compile.multi_kernel

```
    def multi_kernel(self, *args, **kwargs) -> Any:
        from torch._inductor.codegen.multi_kernel import MultiKernelCall

        # no need to call this in parallel since the sub-kernels are already parallel tasks
        return MultiKernelCall(*args, **kwargs)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156158
Approved by: https://github.com/jansel, https://github.com/shunting314
2025-06-20 19:55:24 +00:00
e5ea24fb27 [nativert] Move auto_functionalize_kernel (#156454)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:

fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Copied from original auto_functionalize Diff Summary D53776805:

This is a non-functional kernel implementation for auto_functionalize

In AutoFunctionalizeKernel, I directly call the underlying target without making a clone of mutating inputs.

This would mutates the input tensors inplace, which is unsafe in general.

However, Sigmoid is not doing any graph optimization, or node reordering at the moment, so it's ok do take this short cut.

In the proper functional implementation, it will

make a clone of the mutating input tensor

return these new instance of tensors as AutoFunctionalizeKernel output.

If the original exported program has some "bufferMutation" or "userInputMutation" fields, it will also need to honor such mutations in Sigmoid.

Test Plan: See internal for test plan

Differential Revision: D76926383

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156454
Approved by: https://github.com/zhxchen17
2025-06-20 19:53:16 +00:00
eb331b59fe Add shim fallback for narrow (#156496)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156496
Approved by: https://github.com/albanD
2025-06-20 19:47:00 +00:00
6ed85bfe6a Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-20 19:42:57 +00:00
ef6d2cee7a [BE][MPS] Refactor core matmul logic into matmul_core (#155969)
In preparation of adding integer addmm, move matmul computation part into matmul_inner function

Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
2025-06-20 18:54:38 +00:00
18e4c461fb Update index.md (#155143)
Related to: https://github.com/pytorch/pytorch/issues/152134
Update to index.md to add language for Stable and Unstable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155143
Approved by: https://github.com/AlannaBurke, https://github.com/atalman

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-20 18:53:32 +00:00
502486d946 [PT2]Add weight and constant config path template (#156359)
Summary: At title.

Test Plan:
N/A

Rollback Plan:

Differential Revision: D76925510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156359
Approved by: https://github.com/SherlockNoMad
2025-06-20 18:46:01 +00:00
4b6cbf528b Add C shim fallback for fill_ (#156245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156245
Approved by: https://github.com/desertfire
2025-06-20 18:45:48 +00:00
208ec60e72 Revert "[BE] Make Eigen an optional dependency (#155955)"
This reverts commit 1b50c12584909bda00009f4f0fd0d38ec792d019.

Reverted https://github.com/pytorch/pytorch/pull/155955 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155955#issuecomment-2992512124))
2025-06-20 18:43:52 +00:00
d309cd1d50 Revert "[BE][MPS] Refactor core matmul logic into matmul_core (#155969)"
This reverts commit 769d754ab2469813a3b790ec58c25c466099dd3d.

Reverted https://github.com/pytorch/pytorch/pull/155969 on behalf of https://github.com/atalman due to need to revert eigen test ([comment](https://github.com/pytorch/pytorch/pull/155969#issuecomment-2992502683))
2025-06-20 18:40:38 +00:00
96d082d06b Revert "[InductorBench] Fix accuracy validation logic for MPS (#156385)"
This reverts commit 242eb19c8383b4b197963a8a564475d52c85ac66.

Reverted https://github.com/pytorch/pytorch/pull/156385 on behalf of https://github.com/malfet due to Has some bug in error handling ([comment](https://github.com/pytorch/pytorch/pull/156385#issuecomment-2992441769))
2025-06-20 18:17:18 +00:00
39270430c9 [inductor] force min num-split (off by default) (#155941)
This is a fix for the 10% QPS regression of some internal model (internal doc: [here](https://docs.google.com/document/d/19EiSZSS_SNUNfRg3jmevyrDs9nVpyvyGX_LHfiz-SbU/edit?tab=t.0#heading=h.dim0r28ztzu5) and [here](https://docs.google.com/document/d/1DjRWJPl1cgpceaj8YXTyw6FubGb43Vw-lTAETF9XXnI/edit?tab=t.0#heading=h.ld0vvn8o77sp) ).

The regression is caused by un-representable example inputs for compilation with dynamic shapes. While the general problem is hard to solve and requires more work, for this specific one, there is a quick fix. When we compile LayerNormBackward with small xnumel and large rnumel, we do split reduction. With un-representative inputs, rnumel may be something in the range like 4K and we pick a small num-split (9 in this specific case). Later on when we get an inputs with larger rnumel (100K range. no recompile due to dynamic shape enabled), the small num-split does not introduce enough parallelism and cause sub-optimal performance.

 The quick fix is to force a minimum value for num_split. Let's say we split a reduction [xnueml, rnueml] to two in this order:
- [xnumel * num_split, rnumel / num_split]
- [xnumel, num_split]

A larger num_split always introduce more parallelism for kernel 1. It may results in more work in kernel 2. But if we set the minimum num_split to something not too large (like 256), for kernel2 each row may still be able to get done by reduction with a few or even a single warp. There may not be slow down for kernel 2.

Here are some benchmarking results.
```
import torch
from triton.testing import do_bench
import functools
from torch._inductor import config
from torch._dynamo.decorators import mark_dynamic
import os

@torch.compile(dynamic=True)
def f(x):
    return x.sum(dim=0)

N = 512
C = functools.partial(torch.randn, device="cuda")
x_small = C(4096, N)
x_large = C(4096 * 1000, N)

if os.getenv("HINT_WITH_SMALL_INPUT") == "1":
    x = x_small
else:
    x = x_large

mark_dynamic(x, 0)
f(x)

ms = do_bench(lambda: f(x_large))

# 4.03ms if hint with large input. Output code: https://gist.github.com/shunting314/0be562a0c14f8ec0852b12bbf53d7a15
# 8.32ms if hint with small input. Output code: https://gist.github.com/shunting314/79b924c266d5c562703c3bdfb48d8272
# 3.92ms if hint with small input, and force min num split: Output code: https://gist.github.com/shunting314/c82917a1849b698bf4d2be2fde2fd2ba
print(ms)
```
This test mimic what we see in the original problem.

- If we compile with large inputs and benchmark for large inputs, latency is 4.03ms
- if we compile with small input but benchmark for large inputs, we get more than 2x slowdown. latency is 8.32ms
- with the fix, even if we compile with small input and benchmark for large inputs, latency is 3.92ms. The perf is slightly better than the first case. So it's possible that the heuristic to decide num-split has room to improve

The minimum num-split restriction could be applied for dynamic shape case solely, but I found it can also help for static shape cases a little bit. So I plan to apply it without checking dynamic shape for now unless I see red signals in thorough perf test.
- Outer reduction with static shape: https://gist.github.com/shunting314/6a670a818e63533479399c4dbea5b29a . The fix improve perf from 0.01 ms to 0.009 ms
- Inner reduction with static shape: https://gist.github.com/shunting314/f12f20099126130b953e55ad325c0f62  Perf is neutral (0.011 ms v.s. 0.011ms)

A thorough perf test is running here: https://github.com/pytorch/pytorch/actions/runs/15642912325

# Update for not applying the change to static shape:
from the perf test result [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2009%20Jun%202025%2020%3A57%3A15%20GMT&stopTime=Mon%2C%2016%20Jun%202025%2020%3A57%3A15%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=62b8e191e027842d402fb046a429732616f87570&rBranch=main&rCommit=5b9db4335e61c1c903cb0769282cbea588e49036), it looks like the change hurts perf for static shape case. I think one reason is the change may increase the number of kernels and lose some fusion opportunities. Check the following code for example:
```
import torch
from torch._inductor import config

aten = torch.ops.aten

def f(x):
    return aten.bernoulli(x).sum()

x = torch.randn(8000 * 3, dtype=torch.bfloat16, device="cuda")
torch.compile(f)(x)
```

With the change the bernoulli kernel would NOT be able to fuse with the first layer reduction due to 8000 * 3 is not divisible by 256. Potentially we could improve the change to always pick num-split greater than 256 and divisible by rnumel . But I'll simply apply the change for dynamic shape for now since that's the original issue.

Another perf test only applying min-num-split to dynamic shape [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2011%20Jun%202025%2018%3A14%3A04%20GMT&stopTime=Wed%2C%2018%20Jun%202025%2018%3A14%3A04%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/210/head&lCommit=e7b2cf55f30a585acd4d907fc9127fcb30a256cc&rBranch=main&rCommit=d3d655ad14ee4cd1c135ac57bbf75d5623fc9fa6)

Differential Revision: [D76625617](https://our.internmc.facebook.com/intern/diff/D76625617)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155941
Approved by: https://github.com/jansel, https://github.com/bobrenjc93
2025-06-20 18:01:28 +00:00
55dae0bf7a Add a basic shim and stable::Tensor is_contiguous API (#156228)
Add a limited is_contiguous in shim, stable::Tensor API with a test case
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156228
Approved by: https://github.com/desertfire
2025-06-20 17:59:52 +00:00
49ee1e7106 [CI] Reuse old whl: loosen check for deleted files, do not handle renames (#156138)
Make the check for deleted files only be for files in the torch folder since docs only changes could not get through this
Use `--no-renames` to make both the old name and the old name show up in the diff.  Without it I think only the new name shows up in git diff
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156138
Approved by: https://github.com/huydhn, https://github.com/malfet, https://github.com/cyyever
2025-06-20 17:58:04 +00:00
e31f205292 [Inductor] Adjust boundary checking of dimensions using YBLOCK (#149504)
Apply the same logic introduced in https://github.com/pytorch/pytorch/pull/139751 to triton kernels using block ptrs. Here, if ynumel / YBLOCK > max_y_grids, dimensions dependent on YBLOCK need to be boundary checked, even if the block shape in such dimensions is a multiple of an expression in YBLOCK. This is because ynumel / YBLOCK % get_max_y_grids() may not be zero, so redundant programs will be launched that will attempt to read / write OOB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/149504
Approved by: https://github.com/blaine-rister

Co-authored-by: blaine-rister <145300525+blaine-rister@users.noreply.github.com>
2025-06-20 17:43:38 +00:00
d83ff89d3b Add toggle functionality for XPU profiler (#155135)
Fixes #154898 by adding ability to toggle XPU profiler on and off (which has already been added in pytorch/kineto#1088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155135
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-06-20 17:27:48 +00:00
1b50c12584 [BE] Make Eigen an optional dependency (#155955)
Whose version is controlled by `eigen_pin.txt`, but which will be installed only if BLAS providers could not be found.
Why this is good for CI: we don't really build with Eigen ever and gitlab can be down when github is up, which causes spurious CI failures in the past, for example.

Remove eigen submodule and replace it with eigen_pin.txt

Fixes https://github.com/pytorch/pytorch/issues/108773
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155955
Approved by: https://github.com/atalman
ghstack dependencies: #155947, #155954
2025-06-20 17:21:27 +00:00
63360e64da [BE][Easy] do not install yanked types-pkg-resources in lint environment (#156462)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156462
Approved by: https://github.com/ezyang
2025-06-20 16:00:43 +00:00
1036f6d114 Revert "[ROCm] Bump AOTriton to 0.10b (#156290)"
This reverts commit 34d8e64ef64d88324092a2028884c54c13e086b3.

Reverted https://github.com/pytorch/pytorch/pull/156290 on behalf of https://github.com/atalman due to failing multiple internal tests ([comment](https://github.com/pytorch/pytorch/pull/156290#issuecomment-2992072727))
2025-06-20 15:35:25 +00:00
b4442f42a9 Revert "Upgrade to DLPack 1.0. (#145000)"
This reverts commit 6e185c53124e1b5a0fe391959060c1249178bcb6.

Reverted https://github.com/pytorch/pytorch/pull/145000 on behalf of https://github.com/atalman due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/145000#issuecomment-2992055400))
2025-06-20 15:32:47 +00:00
edd45f3a02 Revert "[Precompile] Hook up backend="inductor" (#155387)"
This reverts commit 2c68c3e8d5e9a235f5861be6486de4959f80c840.

Reverted https://github.com/pytorch/pytorch/pull/155387 on behalf of https://github.com/atalman due to dynamo/test_precompile_context.py::PrecompileContextTests::test_basic [GH job link](https://github.com/pytorch/pytorch/actions/runs/15772892021/job/44464141039) [HUD commit link](2c68c3e8d5) ([comment](https://github.com/pytorch/pytorch/pull/155387#issuecomment-2992044073))
2025-06-20 15:30:04 +00:00
e1f28fe17b add device generalisation support for distributed tests (#152471)
### MOTIVATION
To generalize Distributed test cases for non-CUDA devices

### CHANGES

- test/distributed/optim/test_zero_redundancy_optimizer.py
- test/distributed/test_c10d_logger.py
- test/distributed/test_compute_comm_reordering.py

Replaced hard coded device names with get_devtype from torch.testing._internal.common_fsdp.
DistributedTestBase is used instead of MultiProcessTestCase, to make use of helper functions.

- torch/testing/_internal/common_distributed.py

extended common utility functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/152471
Approved by: https://github.com/d4l3k
2025-06-20 07:35:42 +00:00
0aed855b2b [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-20 07:03:29 +00:00
24dc33b37b [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-20 07:03:29 +00:00
537b0877a8 [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-20 07:03:16 +00:00
2c372a0502 [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-20 07:03:07 +00:00
b46eb1ccaf [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-20 07:02:57 +00:00
2c68c3e8d5 [Precompile] Hook up backend="inductor" (#155387)
This PR adds the necessary things to register and record backend ids from BundledAOTAutogradCacheEntry.

One TODO to point out; in this diff, if there are multiple backends that would have the same AOTAutogradCache key (traditional cache key, not backend_id), we just end up serializing the same BundledAOTAutogradCache entry multiple times. This is not ideal obviously, so we'll want to deduplicate these and just track the different keys that one BundledAOTAutogradCacheEntry is associated with instead. This shouldn't be super hard to do, though, as we just need to run a deduplication step on call to `serialize()`, I think.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155387
Approved by: https://github.com/oulgen
2025-06-20 06:38:29 +00:00
d5b4a32960 [BE] fix PYPROJECT linting errors in test/ and tools/ (#156021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156021
Approved by: https://github.com/Skylion007
2025-06-20 06:19:05 +00:00
4cbbc8b458 [MPS] Implement backward pass for interpolate_trilinear (#156373)
Backwards pass simply iterates over all 8 points current point contributed to, and back propagates them with the respective weights

TODO: Benchmark the performance of similar loop for the forward pas (i.e. compiler should be able to do loop unrolling, so no point of unrolling it by hand)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156373
Approved by: https://github.com/dcci
ghstack dependencies: #156375
2025-06-20 05:41:24 +00:00
c37ddcaefb Fix torchgen update-aoti-shim (#156323)
will remove the fill changes before landing and let Jane merge her changes!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156323
Approved by: https://github.com/janeyx99
2025-06-20 05:23:06 +00:00
f7a5ad6c29 [Inductor][CPP] Fix WOQ int4 accuracy issue when NC large than one (#156407)
**Summary**
There is an accuracy issue when `Nc_block` is greater than 1 in WOQ int4 GEMM. Previously, we used the slice `{%- set tile_W = kernel.slice_nd(W, [("n_start", "n_start + n_size"), ("k_start * Nr / 2", "k_end * Nr / 2")]) %}`, which means that each `ni` in `Nc_block` takes the exact same N slice from `n_start` to `n_start + n_size`, leading to the accuracy problem. This accuracy issue is exposed by [PR #156174](https://github.com/pytorch/pytorch/pull/156174), which changes `block_N` from 64 to 32. This change increases the likelihood of `Nc_block` being greater than 1, making it more likely to trigger the issue. This PR will fix this accuracy issue.

**Test Plan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx_Nc_larger_than_one
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156407
Approved by: https://github.com/CaoE
2025-06-20 03:08:02 +00:00
72c8751b61 Align meta deducing for fft_r2c with fft_r2c_mkl on XPU (#156048)
There is a memory layout mismatching between `fft_r2c` XPU and Inductor meta deducing.
Original `fft_r2c` Inductor meta deducing for XPU backend is aligned with CPU (fallback). This PR is to correct the Inductor meta deducing and update the torch-xpu-ops commit to [intel/torch-xpu-ops@`3a9419c`](3a9419c8bb).
The XPU implementation first performs the R2C transform on the last dimension, followed by iterative C2C transforms on the remaining dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156048
Approved by: https://github.com/guangyey, https://github.com/etaf, https://github.com/jansel
2025-06-20 01:41:03 +00:00
159a39ad34 Add an option for cpp_wrapper to compile entry and kernel separately (#156050)
Fixes #156037.
Compiling entry and kernel separately has a non-negligible impact on the performance. This PR is to add an option for cpp_wrapper to control whether to compile entry and kernel separately, and turn it off by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156050
Approved by: https://github.com/leslie-fang-intel, https://github.com/benjaminglass1, https://github.com/jansel
2025-06-20 01:11:16 +00:00
ebab279942 Forward fix inductor benchmark after #150287 (#156455)
Looks like https://github.com/pytorch/pytorch/pull/150287 stack fixed some inductor tests
HUD: https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor-periodic%20%2F%20linux-jammy-cpu-py3.9-gcc11-inductor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156455
Approved by: https://github.com/huydhn
2025-06-20 00:04:15 +00:00
cyy
3c2324c64a [2/N] Fix cppcoreguidelines-init-variables suppression (#146237)
This PR removes all `cppcoreguidelines-init-variables` suppressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/146237
Approved by: https://github.com/ezyang
2025-06-19 23:26:42 +00:00
52f873adc2 Add logging for async compile worker statistics (#155820)
Add some on-exit logging to the async compile workers. When you use `TORCH_LOGS=async_compile` (or `all`) it will now report how many workers were enqueued & dequeued (should be the same) as well as queuing time (how long workers sat on the queue before starting to run) and maximum depth (how many workers were waiting to start.

Tested manually by running a larger internal model and then lowering the number of available workers to see the time and depth get longer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155820
Approved by: https://github.com/masnesral
2025-06-19 23:10:15 +00:00
c60d8188d2 [nativert] Move GraphExecutorBase to PyTorch core (#156196)
Summary:
Moves GraphExecutorBase class to PyTorch core.
GraphExecutorBase is a lightweight abstraction to execute a graph with  execution frames without actually owning the graph nor the weights. This is introduced to decouple the state management of the top level runtime from the kernel executions so that sub graphs from higher order ops can be supported.

Torch Native Runtime RFC: pytorch/rfcs#72

Test Plan:
CI

Rollback Plan:

Differential Revision: D76830436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156196
Approved by: https://github.com/zhxchen17
2025-06-19 22:42:35 +00:00
34d8e64ef6 [ROCm] Bump AOTriton to 0.10b (#156290)
Notable new features/optimizations for SDPA operators on AMD systems from AOTriton 0.10b:

* Official support of gfx950/gfx1201
* Experimental support of gfx1101/gfx1151/gfx1150/gfx1200
* Reduce libaotriton.so binary size by over 80%.
  + Without this optimization the binary size of `libaotriton.so` could be
    over 100MiB due to 2x more supported architectures compared with 0.9b.
    Now it is only about 11MiB.
* Support sliding window attention (SWA) in
  `_flash_attention_forward/backward`. Should fix #154582

See https://github.com/ROCm/aotriton/releases/tag/0.10b for full details,
including Known Problems.

Notable changes to SDPA backend:

* `std::optional<int64_t>` `window_size_left/right` are directly passed to
  ROCM's SDPA backend, because the default value `-1` is meaningful to
  AOTriton's backend and bottom-right aligned causal mask is implemented with
  negative `window_size_left/right`
* Some code clean up around `USE_CK_FLASH_ATTENTION`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156290
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
2025-06-19 21:13:58 +00:00
3644b41a7c [ONNX] Note on attention op symbolic function (#156441)
Follow up https://github.com/pytorch/pytorch/pull/156367
Explain why num_heads is provided when ONNX Attention op does not need it in torch case: The thread: https://github.com/pytorch/pytorch/pull/156367#discussion_r2155727038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156441
Approved by: https://github.com/justinchuby
2025-06-19 21:00:05 +00:00
443b5b43c3 xpu: fix AOT compilation in sycl cpp extension (#156364)
Commit fixes AOT compilation in sycl cpp extension which got accidentally dropped on aca2c99a652 (fallback to JIT compilation had happened). Commit also fixes override logic for default sycl targets allowing flexibility to specify targets externally. Further, commit extends test coverage to cover such a case and fixes issue in the test where consequent tests executed same (first) compiled extension due to name conflicts.

Fixes: #156249
Fixes: aca2c99a652 ("xpu: get xpu arch flags at runtime in cpp_extensions (#152192)")

CC: @pengxin99, @guangyey

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156364
Approved by: https://github.com/ezyang
2025-06-19 20:11:38 +00:00
d32deb664a [c10d] Disable NCCL NVLS when using deterministic mode (#156381)
via setting env `NCCL_ALGO=^NVLS`.

Note that this setting must be made before the first NCCL init. Otherwise, it won't take effect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156381
Approved by: https://github.com/ngimel
2025-06-19 20:09:24 +00:00
69f2e09cc2 Add more shards to H100 benchmark, and also run it more frequently (#156429)
There are 32 H100 `linux.aws.h100` and they are still not fully utilized with more than half staying idle, so we could add more shards to finish the whole suite within 4 hours.  I add 1 more for `TIMM` and 3 more for `TorchBench` using the duration from a sample run https://github.com/pytorch/pytorch/actions/runs/15753185459/job/44411825090

With this computing power, we could also run the whole suite every 4 hours now.  I could run this less frequently later if I see queueing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156429
Approved by: https://github.com/atalman
2025-06-19 20:02:56 +00:00
aac0e8f0e9 [build] Create target for flash attention (#156235)
Create a target for flash attention? so it can be built using ninja flash_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-06-19 20:02:38 +00:00
c2f4cc59a7 [MPS] Fix bug in 3d coords calculation (#156375)
Which was not caught by CI beforehand, as all 3D examples right now are symmetric, so add an uneven shape to `sample_inputs_interpolate`

Though it's indirectly tested by `test_upsample_nearest3d` inductor test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156375
Approved by: https://github.com/atalman
2025-06-19 19:56:15 +00:00
c0ee01c2fb tools/nightly.py: only download torch via pip and install dependenices via uv (#156409)
Setup time (cpu-only): 70s -> 27.6s -> 17.4s

The tool can setup the pinned NVIDIA dependencies correctly:

```console
$ make setup-env-cuda PYTHON="${HOMEBREW_PREFIX}/bin/python3.13" && source venv/bin/activate
make setup-env PYTHON="/home/linuxbrew/.linuxbrew/bin/python3.13" NIGHTLY_TOOL_OPTS="pull --cuda"
make[1]: Entering directory '/home/PanXuehai/Projects/pytorch'
/home/linuxbrew/.linuxbrew/bin/python3.13 tools/nightly.py pull --cuda
log file: /home/PanXuehai/Projects/pytorch/nightly/log/2025-06-19_21h16m16s_94cd1471-4d0f-11f0-b120-b88584c06696/nightly.log
Creating virtual environment
Removing existing venv: /home/PanXuehai/Projects/pytorch/venv
Creating venv (Python 3.13.4): /home/PanXuehai/Projects/pytorch/venv
Installing packages
Upgrading package(s) (https://download.pytorch.org/whl/nightly/cu128):
  - uv
  - pip
  - setuptools
  - packaging
  - wheel
  - build[uv]
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128
Collecting uv
  Using cached f2e96cec5e/uv-0.7.13-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.8 MB)
Requirement already satisfied: pip in ./venv/lib/python3.13/site-packages (25.1.1)
Collecting setuptools
  Using cached 17031897da/setuptools-80.9.0-py3-none-any.whl (1.2 MB)
Collecting packaging
  Using cached 38679034af/packaging-25.0-py3-none-any.whl (66 kB)
Collecting wheel
  Using cached 87f3254fd8/wheel-0.45.1-py3-none-any.whl (72 kB)
Collecting build[uv]
  Using cached 80633736cd/build-1.2.2.post1-py3-none-any.whl (22 kB)
Collecting pyproject_hooks (from build[uv])
  Using cached 12818598c3/pyproject_hooks-1.2.0-py3-none-any.whl (10 kB)
Installing collected packages: wheel, uv, setuptools, pyproject_hooks, packaging, build
Successfully installed build-1.2.2.post1 packaging-25.0 pyproject_hooks-1.2.0 setuptools-80.9.0 uv-0.7.13 wheel-0.45.1
Installing packages took 6.251 [s]
Creating virtual environment took 9.050 [s]
Downloading packages
Downloading package(s) (https://download.pytorch.org/whl/nightly/cu128): torch
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://download.pytorch.org/whl/nightly/cu128
Collecting torch
  Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl.metadata (30 kB)
Using cached https://download.pytorch.org/whl/nightly/cu128/torch-2.8.0.dev20250619%2Bcu128-cp313-cp313-manylinux_2_28_x86_64.whl (1040.3 MB)
Saved /tmp/pip-download-xeqmhrww/torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl
Successfully downloaded torch
Downloaded 1 file(s) to /tmp/pip-download-xeqmhrww:
  - torch-2.8.0.dev20250619+cu128-cp313-cp313-manylinux_2_28_x86_64.whl
Downloading packages took 6.284 [s]
Unpacking wheel file
Unpacking to: /tmp/wheel-kugk2os0/torch-2.8.0.dev20250619+cu128...OK
Unpacking wheel file took 15.107 [s]
Installing dependencies
Installing packages
Installing package(s) (https://download.pytorch.org/whl/nightly/cu128):
  - filelock
  - typing-extensions>=4.10.0
  - setuptools; python_version >= "3.12"
  - sympy>=1.13.3
  - networkx
  - jinja2
  - fsspec
  - nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cuda-runtime-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cuda-cupti-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cudnn-cu12==9.10.2.21; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cublas-cu12==12.8.4.1; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cufft-cu12==11.3.3.83; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-curand-cu12==10.3.9.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusolver-cu12==11.7.3.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusparse-cu12==12.5.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cusparselt-cu12==0.7.1; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nccl-cu12==2.27.3; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvshmem-cu12==3.2.5; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvtx-cu12==12.8.90; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-nvjitlink-cu12==12.8.93; platform_system == "Linux" and platform_machine == "x86_64"
  - nvidia-cufile-cu12==1.13.1.3; platform_system == "Linux" and platform_machine == "x86_64"
  - pytorch-triton==3.3.1+gitc8757738; platform_system == "Linux"
  - numpy
  - cmake
  - ninja
  - packaging
  - ruff
  - mypy
  - pytest
  - hypothesis
  - ipython
  - rich
  - clang-format
  - clang-tidy
  - sphinx
Using Python 3.13.4 environment at: venv
Resolved 78 packages in 2.95s
Installed 76 packages in 93ms
 + alabaster==1.0.0
 + asttokens==3.0.0
 + attrs==24.2.0
 + babel==2.17.0
 + certifi==2024.8.30
 + charset-normalizer==3.3.2
 + clang-format==20.1.6
 + clang-tidy==20.1.0
 + cmake==3.25.0
 + decorator==5.2.1
 + docutils==0.21.2
 + executing==2.2.0
 + filelock==3.18.0
 + fsspec==2025.5.1
 + hypothesis==6.135.11
 + idna==3.10
 + imagesize==1.4.1
 + iniconfig==2.1.0
 + ipython==9.3.0
 + ipython-pygments-lexers==1.1.1
 + jedi==0.19.2
 + jinja2==3.1.6
 + markdown-it-py==3.0.0
 + markupsafe==2.1.5
 + matplotlib-inline==0.1.7
 + mdurl==0.1.2
 + mpmath==1.3.0
 + mypy==1.16.1
 + mypy-extensions==1.0.0
 + networkx==3.5
 + ninja==1.11.1.4
 + numpy==2.3.0
 + nvidia-cublas-cu12==12.8.4.1
 + nvidia-cuda-cupti-cu12==12.8.90
 + nvidia-cuda-nvrtc-cu12==12.8.93
 + nvidia-cuda-runtime-cu12==12.8.90
 + nvidia-cudnn-cu12==9.10.2.21
 + nvidia-cufft-cu12==11.3.3.83
 + nvidia-cufile-cu12==1.13.1.3
 + nvidia-curand-cu12==10.3.9.90
 + nvidia-cusolver-cu12==11.7.3.90
 + nvidia-cusparse-cu12==12.5.8.93
 + nvidia-cusparselt-cu12==0.7.1
 + nvidia-nccl-cu12==2.27.3
 + nvidia-nvjitlink-cu12==12.8.93
 + nvidia-nvshmem-cu12==3.2.5
 + nvidia-nvtx-cu12==12.8.90
 + parso==0.8.4
 + pathspec==0.12.1
 + pexpect==4.9.0
 + pluggy==1.6.0
 + prompt-toolkit==3.0.51
 + ptyprocess==0.7.0
 + pure-eval==0.2.3
 + pygments==2.19.1
 + pytest==8.4.1
 + pytorch-triton==3.3.1+gitc8757738
 + requests==2.32.3
 + rich==14.0.0
 + roman-numerals-py==3.1.0
 + ruff==0.12.0
 + snowballstemmer==3.0.1
 + sortedcontainers==2.4.0
 + sphinx==8.2.3
 + sphinxcontrib-applehelp==2.0.0
 + sphinxcontrib-devhelp==2.0.0
 + sphinxcontrib-htmlhelp==2.1.0
 + sphinxcontrib-jsmath==1.0.1
 + sphinxcontrib-qthelp==2.0.0
 + sphinxcontrib-serializinghtml==2.0.0
 + stack-data==0.6.3
 + sympy==1.14.0
 + traitlets==5.14.3
 + typing-extensions==4.14.0
 + urllib3==2.2.3
 + wcwidth==0.2.13
Installing packages took 3.080 [s]
Installing dependencies took 3.080 [s]
Pulling nightly PyTorch
Found released git version 5622038e20ddb12b9a011c9a9128190d71a21cba
Found nightly release version 2625c70aecc6eced1dbe108279feab7509733bef
Already up to date.
Pulling nightly PyTorch took 0.017 [s]
Moving nightly files into repo
Moving nightly files into repo took 4.898 [s]
Writing pytorch-nightly.pth
Writing pytorch-nightly.pth took 0.021 [s]
-------
PyTorch Development Environment set up!
Please activate to enable this environment:

  $ source /home/PanXuehai/Projects/pytorch/venv/bin/activate

make[1]: Leaving directory '/home/PanXuehai/Projects/pytorch'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156409
Approved by: https://github.com/ezyang
ghstack dependencies: #156408
2025-06-19 19:42:15 +00:00
71faa7e5b9 tools/nightly.py: use uv pip install instead of pip install (#156408)
Setup time: 70s -> 27.6s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156408
Approved by: https://github.com/ezyang
2025-06-19 19:42:15 +00:00
134dfb3fe6 [dynamo] Fix cycle reference problem caused by recursive collect_temp_source in codegen (#155791)
Recursive function collect_temp_source with closure in PyCodegen caused cycle reference issue when torch.compile is used.
This issue may cause major tensors will not freed timely even there are no user references to these tensors.

We saw OOM issues because of this problem in many cases including training and inference using torch.compile.
The fix is to use iterative function implementation to replace the recursive function implementation.

Fixes #155778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155791
Approved by: https://github.com/ezyang
2025-06-19 19:37:44 +00:00
e4c9f6d9a2 [nativert] Move c10_kernel (#156208)
Summary:
Torch Native Runtime RFC: https://github.com/pytorch/rfcs/pull/72

As part of the effort to open source TorchNativeRuntime (or what we call Sigmoid), we are moving the Pytree implementation to torch/:

fbcode/sigmoid/kernels -> fbcode/caffe2/torch/nativert/kernels

Test Plan:
```
buck run fbcode//mode/dev-nosan  //caffe2/test/cpp/nativert:c10_kernel_test
```

Differential Revision: D76825830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156208
Approved by: https://github.com/zhxchen17
2025-06-19 17:36:23 +00:00
f402eed4d9 [ROCm] Enable BF16 NCHW Mixed batchnorm on MIOpen if ROCm>=6.4 (#154611)
This PR enables MIOpen for BF16 NCHW Mixed batchnorm if MIOpen version >=3.4 (ROCm >= 6.4)

CUDAHooks::versionMIOpen() was added to detect MIOpen version

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154611
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2025-06-19 17:22:37 +00:00
085f270a00 [ROCm] Enable more parallelism for multi-dimensional reductions (#155806)
Enable more parallelism for multi-dimensional reductions. In the case of multi-dimensional reductions the grid often start with a single active block. In such cases, we need to allow the parallelism to be extended along the y-direction of the grid to avoid having a single block running.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155806
Approved by: https://github.com/Skylion007, https://github.com/jeffdaily
2025-06-19 17:19:40 +00:00
eaf704914e [aoti] package weights to disk and dedup (#155241)
We package the weights and save them in `data/weights/` (`WEIGHTS_DIR`). In addition, we store a `weights_config.json` in the model folder for each model to specify which weight file corresponding to which weight name.

Models can share weights. We dedup the weights based on their underlying storage (`tensor.untyped_storate()`).

- Use `"aot_inductor.package_constants_on_disk": True` config to produce the `Weights` in aot_compile
- If we see `Weights` in aoti_files, we'll automatically package them to disk
- `"aot_inductor.package_constants_on_disk"` config and `"aot_inductor.package_constants_in_so"` config work independently.
- Use `load_pt2(package_path, load_weights_from_disk=True)` to load the weights from disk. `load_weights_from_disk` defaults to False.

Test Plan:
```
buck2 run @//mode/dev-nosan //caffe2/test/inductor:aot_inductor_package -- -r "test_package_shared_weights"
```

Tested with whisper at https://github.com/pytorch-labs/torchnative/pull/7

Rollback Plan:

Differential Revision: D74747190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155241
Approved by: https://github.com/desertfire
2025-06-19 17:17:17 +00:00
6e185c5312 Upgrade to DLPack 1.0. (#145000)
This PR makes the necessary changes in order to upgrade PyTorch DLPack
support to version 1.0. In summary, we add support for the following:

- Support both `DLManagedTensor` and `DLManagedTensorVersioned` when
  producing and consuming DLPack capsules
- New parameter for `__dlpack__` method: `max_version`
- Version checks:
    - Fallback to old implementation if no `max_version` or if version
      lower than 1.0
    - Check that the to-be-consumed capsule is of version up to 1.X

In order to accommodate these new specifications, this PR adds the
following main changes:

- `torch._C._to_dlpack_versioned` Python API (Module.cpp): new Python
API for creating a versioned DLPack capsule (called by `__dlpack__`
method)
- `DLPackTraits<T>` class (DLConvertor.h): select the correct
traits (e.g. capsule name, conversion functions) depending on which
DLPack tensor class is being used
- `toDLPackImpl<T>` function (DLConvertor.cpp): populates the
common fields of both classes
- `fromDLPackImpl<T>` function (DLConvertor.cpp): constructs a tensor
from a DLPAck capsule
- `fillVersion<T>` function (DLConvertor.cpp): populates the version
field for `DLManagedTensorVersioned` (no-op for `DLManagedTensor`)
- `tensor_fromDLPackImpl<T>` function (tensor_new.cpp): outer function
for constructing a tensor out of a DLPack capsule that also marks the
capsule as used

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145000
Approved by: https://github.com/albanD
2025-06-19 16:27:42 +00:00
6eb6f198e1 update codebase structure documentation to include mps (#156297)
📚 The doc update

adding description about mps folder in code structure guide

@albanD @malfet @svekars @sekyondaMeta
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156297
Approved by: https://github.com/ezyang
2025-06-19 16:16:29 +00:00
7f0cddfb55 [dynamo] Add documentation for guard_filter_fn (#156114)
Summary: Adding a section of doc for guard_filter_fn.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76756743

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156114
Approved by: https://github.com/jansel
2025-06-19 16:13:12 +00:00
c9afcffed0 [AOTInductor] Call most runtime fallback ops without calling into Python (#154142)
Uses the new aoti_torch_call_dispatcher interface to call runtime fallback ops without calling back into Python.  This supports a limited subset of input and output datatypes, but a significant majority of remaining fallback ATen ops are covered.

Fixes #150988
Fixes #153478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154142
Approved by: https://github.com/desertfire
2025-06-19 15:27:15 +00:00
317af4c87b Revert "[cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)"
This reverts commit a5f59cc2eab3a5201712c52fe48c268357ba4f3c.

Reverted https://github.com/pytorch/pytorch/pull/156140 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/156140#issuecomment-2988441548))
2025-06-19 15:09:29 +00:00
ab3393e923 [ROCm][CI] fix mi300 test failure after 6.4.1 update (#156368)
Fixes failures such as https://github.com/pytorch/pytorch/actions/runs/15739699156/job/44365395854: `test/test_linalg.py::TestLinalgCUDA::test_broadcast_batched_matmul_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156368
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-19 15:02:40 +00:00
0b62465b99 Revert "Refine alignment check along dynamic dimension for grouped MMs (#155466)"
This reverts commit 830a335a7da5fec00395d440ba568749cb4e2e9e.

Reverted https://github.com/pytorch/pytorch/pull/155466 on behalf of https://github.com/atalman due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/155466#issuecomment-2988285117))
2025-06-19 14:25:38 +00:00
fec8af8b98 [bugfix] [build] guard cuda version for ipc with fabric handle (#156394)
https://github.com/pytorch/pytorch/pull/156074 adds the support of ipc with fabric handle, but the code cannot compile for cuda < 12.3 (in particular, e.g. cuda 11.8).

this pr improves the support by adding some compilation-time check against cuda versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156394
Approved by: https://github.com/ngimel
2025-06-19 13:54:01 +00:00
769d754ab2 [BE][MPS] Refactor core matmul logic into matmul_core (#155969)
In preparation of adding integer addmm, move matmul computation part into matmul_inner function

Change callstack from group_id, thread_id_in_group to thread_id, threadid_in_group, which eliminates the need of calculating the index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155969
Approved by: https://github.com/Skylion007
2025-06-19 13:22:41 +00:00
8cb0c4a4da [Intel GPU][AOTI] Add xpu mkldnn ops support for AOTInductor. (#154586)
This PR is closely related to the previous one in the stack(https://github.com/pytorch/pytorch/pull/150287). The previous PR enabled MKLDNN ops for XPU, which caused several test cases to fail in test_aot_inductor.py. This PR addresses those failing cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154586
Approved by: https://github.com/EikanWang, https://github.com/desertfire
ghstack dependencies: #150287
2025-06-19 13:17:22 +00:00
83259cf7a7 [Inductor][Intel GPU] Support mkldnn Conv post op fusion for XPU. (#150287)
This PR adds support for MKLDNN Conv post-op fusion in the Inductor Intel GPU backend under freezing mode.
The implementation reuses the CPU's MKLDNN pattern fusion mechanism, as well as the corresponding Inductor unit tests for CPU MKLDNN pattern fusion.

The performance improvement:

| Suite       | Inductor Speedup (Baseline) | Inductor Speedup (Compared) | Acc Failed | Perf Failed | Inductor Perf Ratio | Speedup  |
|-------------|-----------------------------|------------------------------|------------|--------------|----------------------|----------|
| Huggingface | 2.134838                    | 2.125740314                  | 0          | 0            | 1.001462504          | 100.43%  |
| Torchbench  | 1.808558                    | 1.675100479                  | 0          | 0            | 1.075722187          | 107.97%  |
| Timm        | 2.343893                    | 2.070476653                  | 0          | 0            | 1.131023832          | 113.21%  |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150287
Approved by: https://github.com/ZhiweiYan-96, https://github.com/EikanWang, https://github.com/jansel
2025-06-19 13:17:22 +00:00
0504480f37 Add CUDA 12.9 libtorch nightly (#155895)
https://github.com/pytorch/pytorch/issues/155196

with libtorch docker added, we can add the build script

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155895
Approved by: https://github.com/atalman
2025-06-19 13:15:42 +00:00
ccb1f687d6 Port two dynamo test cases for Intel GPU (#156056)
For https://github.com/pytorch/pytorch/issues/114850, we will port more cases to Intel GPU. This PR is for 2 dynamo cases. We adopted "torch.accelerator.current_accelerator()" to determine the backend, and added XPU support in decorators like @requires_gpu, also enabled XPU for some test path.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156056
Approved by: https://github.com/guangyey, https://github.com/jansel
2025-06-19 12:49:04 +00:00
a8fe982993 Revert "[build] Create target for flash attention (#156235)"
This reverts commit 6d02321472ee0761092166dd273eb3ec386cf0c0.

Reverted https://github.com/pytorch/pytorch/pull/156235 on behalf of https://github.com/ZainRizvi due to Weird, but seems to have broken trunk: test_jit_fuser_te.py::TestTEFuserDynamic::test_skip_grad_in_check [GH job link](https://github.com/pytorch/pytorch/actions/runs/15748768079/job/44390494621) [HUD commit link](6d02321472) ([comment](https://github.com/pytorch/pytorch/pull/156235#issuecomment-2987784207))
2025-06-19 11:47:27 +00:00
4da98351b9 [SymmMem] Add NVSHMEM PUT with Signal support to Triton (#156211)
Adds NVSHMEM PUT with Signal operation support for Triton kernels:

- Added`putmem_signal_block` core.extern wrapper for nvshmemx_putmem_signal_block
- Added kernel for 2-rank PUT operation with atomic SET signaling (`test_triton_put_signal_set`)
- Added kernel for 2-rank PUT operation with atomic ADD signaling (`test_triton_put_signal_add`)

**Tests:**
`$ TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py`

`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_set`
`TORCH_SYMMMEM=NVSHMEM python test/distributed/test_nvshmem.py -k test_triton_put_signal_add`

```python
@skipIfRocm
@requires_triton()
def test_triton_put_signal_set(self) -> None:
    @triton.jit
    def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr,
                         signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr):
        nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer)

    # ... setup code ...

    val = 11
    inp = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(val)
    out = symm_mem.empty(numel, dtype=dtype, device=self.device).fill_(-1)  # destination buffer

    # Signal flag buffer - starts at 0, will be set to 1 upon completion
    flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0)

    peer = 1 - rank
    NVSHMEM_SIGNAL_SET = 0  # atomic set operation
    SIGNAL_VAL = 1  # completion signal value

    if rank == 0:
        # Rank 0 atomically: (1) puts data to rank 1, (2) sets rank 1's flag to 1
        put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr,
                                    signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_SET,
                                    peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   # Rank 1 can check flag to know data transfer completed!
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] flag buffer: {flag}")
```

```
[Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8)
[Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8)
[Rank 0] got data from peer 1
[Rank 0] flag buffer: tensor([0], device='cuda:0')
[Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 1] flag buffer: tensor([1], device='cuda:1')

----------------------------------------------------------------------
Ran 2 tests in 17.046s

OK
```

Working as expected! Data is received, and flag set to 1 for completion signal!

```python
@skipIfRocm
@requires_triton()
def test_triton_put_signal_add(self) -> None:
   @triton.jit
   def put_signal_kernel(dst_ptr, src_ptr, numel: tl.constexpr, sig_ptr,
                        signal_val: tl.constexpr, sig_op: tl.constexpr, peer: tl.constexpr):
       nvshmem.putmem_signal_block(dst_ptr, src_ptr, numel, sig_ptr, signal_val, sig_op, peer)

   # ... setup code ...

   # Signal buffer (uint64 flag)
   flag = symm_mem.empty(1, dtype=torch.int64, device=self.device).fill_(0)

   peer = 1 - rank
   NVSHMEM_SIGNAL_ADD = 5  # atomic add operation
   SIGNAL_VAL = 16  # Signal value to add

   if rank == 0:
       # Rank 0 puts into Rank 1 and adds to signal
       put_signal_kernel[(1, 1, 1)](dst_ptr, src_ptr, numel=numel, sig_ptr=sig_ptr,
                                   signal_val=SIGNAL_VAL, sig_op=NVSHMEM_SIGNAL_ADD,
                                   peer=peer, extern_libs=nvshmem_lib)

   dist.barrier()
   print(f"[Rank {rank}] inp buffer: {inp}")
   print(f"[Rank {rank}] out buffer: {out}")
   print(f"[Rank {rank}] flag buffer: {flag}")

```

```
[Rank 0] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:0', dtype=torch.int8)
[Rank 0] out buffer: tensor([-1, -1, -1, -1, -1, -1, -1, -1], device='cuda:0', dtype=torch.int8)
[Rank 0] got data from peer 1
[Rank 0] flag buffer: tensor([0], device='cuda:0')
[Rank 1] inp buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] out buffer: tensor([11, 11, 11, 11, 11, 11, 11, 11], device='cuda:1', dtype=torch.int8)
[Rank 1] got data from peer 0
[Rank 1] flag buffer: tensor([16], device='cuda:1')

----------------------------------------------------------------------
Ran 1 test in 17.145s

OK
```

The flag transition from [0] → [16] confirms both data delivery and atomic signal completion in a single operation!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156211
Approved by: https://github.com/kwen2501, https://github.com/mandroid6
2025-06-19 10:24:30 +00:00
348e2a76df s/defer_runtime_assert/guard_or_defer_runtime_assert (#156397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156397
Approved by: https://github.com/laithsakka
2025-06-19 10:18:28 +00:00
02080c2cd9 Fix num_heads inference in ONNX Attention-23 exporter (#156367)
Fixes issue in torch-onnx exporter for Attention: https://github.com/pytorch/pytorch/issues/156105

Previously the number of heads attributes inferred by the exporter is incorrect. It should be read from input dimension -3 not dimension 3:

![image](https://github.com/user-attachments/assets/26f10e15-bc98-42ac-807a-2e089a7d996a)

But in fact, [torch sdpa](https://docs.pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html) doesn't support combined num_heads and head_size dimensions like [ONNX](https://onnx.ai/onnx/operators/onnx__Attention.html) does, so this num_heads attribute is not needed.

Extending support to rank>4 can be left as future work if there is use case for that. The translation logic will look like: Reshape(Q,K,V to 4d) -> Attention -> Reshape(Y to original rank).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156367
Approved by: https://github.com/justinchuby, https://github.com/titaiwangms
2025-06-19 09:40:01 +00:00
8fcda2c60d [SymmMem] Add runtime detection of NVSHMEM (#156291)
so that we can pick the default backend for SymmetricMemory without
fully relying on env var `TORCH_SYMMMEM=CUDA | NVSHMEM`

On Python side, the following API is added:
`torch.distributed._symmetric_memory.is_nvshmem_available()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156291
Approved by: https://github.com/Skylion007
ghstack dependencies: #155506, #155835, #155968, #155971, #155975, #156116, #156117
2025-06-19 08:26:11 +00:00
eabf7cd3c5 [export] update docs for Dims (#156262)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156262
Approved by: https://github.com/angelayi
2025-06-19 06:25:21 +00:00
ec0276103f [PGO] fix whitelist scalar bug (#156194)
Test Plan:
test_pgo

Rollback Plan:

Differential Revision: D76830552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156194
Approved by: https://github.com/bobrenjc93
2025-06-19 05:51:21 +00:00
1c960c5638 [Makefile] lazily setup lintrunner on first make lint run (#156058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156058
Approved by: https://github.com/ezyang
2025-06-19 05:43:35 +00:00
242eb19c83 [InductorBench] Fix accuracy validation logic for MPS (#156385)
As it does not support full fp64, validate against float32

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156385
Approved by: https://github.com/Skylion007
2025-06-19 05:37:51 +00:00
ce8180a61d [c10d] Disable stack trace call in logging (#156362)
Summary: We noticed std::future_error: Broken promise errors in logging, so let's disable for now and will investigate more.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76929722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156362
Approved by: https://github.com/fegin
2025-06-19 05:11:57 +00:00
a21806f038 [ez][export] Better error message for schema check in torch.export.load (#156361)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/156354

torch.export.load() only supports files generated by torch.export.save()

Test Plan:
CI

Rollback Plan:

Differential Revision: D76928725

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156361
Approved by: https://github.com/zhxchen17
2025-06-19 04:50:56 +00:00
3f69e3b3a0 Add view_simple as meta function for view, and avoid calling reshape_view_helper for unbacked (#154757)
address https://github.com/pytorch/pytorch/issues/153303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154757
Approved by: https://github.com/bobrenjc93, https://github.com/leslie-fang-intel
2025-06-19 04:50:18 +00:00
3bec588bf5 [aot][ca] save bw_module in AOTAutogradCache (#151860)
Compiled Autograd retraces AOT's bw_module at backward runtime into a larger graph, and today this runs into an issue on warm cache runs because the bw_module is not restored. This PR adds it to the cache, by first stripping it bare from unserializable metadata. I also intentionally differentiate the cached and non-cached versions to avoid accidental attempts of AOT compilation with a restored bw_module (would probably crash).

The bw_module's generated code is then serialized, and at compiled autograd runtime, it is restored via symbolic_trace. This also means that presence of tensor constructors will be lifted as constants. Something we will address separately.

Note that since the cache entry may be used by runs that use compiled autograd and runs that do not, we need to cache both the lowered backward and the bw_module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151860
Approved by: https://github.com/jamesjwu
ghstack dependencies: #156120
2025-06-19 03:47:41 +00:00
6d02321472 [build] Create target for flash attention (#156235)
Create a target for flash attention? so it can be built using ninja flash_attention

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156235
Approved by: https://github.com/Skylion007, https://github.com/cyyever
2025-06-19 03:35:04 +00:00
77518d1a13 [CI] fix xpu-smi hang in XPU test container (#156171)
Apply same fix #155443 for XPU test container, refer https://github.com/pytorch/pytorch/actions/runs/15589866881/job/43907973867#step:15:911
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156171
Approved by: https://github.com/huydhn
2025-06-19 02:48:11 +00:00
19ffdf4ea0 [dcp] add new checkpoint staging to preserve storage sharing and support mutable state_dicts (#155192)
Summary:
This implements staging in way that doesnt mess up checkpointing semantics. We want to be close to torch.save/load semantics and when async checkpointing is used it messes up shared storages, doesnt handle custom objects or tensors well. EG: users passes a state_dict with a cuda tensor in datatype.  this is deepcloned causing the staging tensor to be created on GPU. This can cause ooms is hard to debug.

This diffs hooks into deepcopy of storages to move them to cpu using the cached storages created for async checkpoint staging.  This allows reusing storages created for staging to avoid recreating them on each checkpoint while also being flexible enough to handle any changes - clean up old storages or create new ones as needed.

Lifetime of staging storages is tied to the original storage object. when the original storage object is gc-ed, we delete the corresponding staging storage from cache possibly causing it to gc-ed is there are no other references.  I am using data_ptr of the storage to keep track of this. Please share thoughts on this.
The alternative is to use fqn's instead of storage_id and verify the underlying storage object has same shape/size,etc to make the caching logic work. Current implementation is much simpler and cleaner.

The API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
 with staging_context(stager):
     cpu_state_dict = copy.deepcopy(state_dict)
```

Also, adds support for pinned-memory.

One problem this implementation does not address is that we lose the original device.

The only alternatives here are - pickle synchronously like torch.save but with special handling for storages. It is valuable to keep state_dict throughout the checkpointing process. so users can manipulate and debug as needed. so we need to unpickle in the background process. I think this is flexible, not performant and not very different to current solution but needs more code. One idea if we really want to address is this to stick the original device in a some variable on storage and then use it recover on load side. I think we do not need this for now and can be explicit about losing device type for async checkpointing.

Update:
Note: Due to reservations on hooking into deepcopy to customize it, the PR is now updated to use deepcopy like logic to clone the state_dict. There are some caveats to this solution:
1. Duplicated deepcopy code to hook into for tensors. There is a risk of this code getting outdated with python version changes. This is needed to handle several different types like NamedTuples, frozen dataclasses, nested dataclasses. deepcopy logic is relying on reduce_ex to get a function with which these can be constructed.
2. Since we are bypassing deepcopy and adding custom logic to clone a tensor, we are missing some of the functionality that exists in deepcopy for torch.Tensor like _clear_non_serializable_cached_data(), or other logic. Would like thoughts on which logic or if everything should be copied?
3. If any object implemented deepcopy , we will not be able to handle any tensors in the attrs with this logic because they likely just call copy.deepcopy on the attrs instead of this deepcopy logic. We are taking care of subclasses of torch.Tensor to workaround this.

The new API:
```
# construct a stager once per job in checkpointing.
stager = StateDictStager(pin_memory=pin_memory, share_memory=share_memory)

# do this on every checkpoint:
cpu_state_dict = copy.stage(state_dict)
```

Test Plan:
unit tests

Differential Revision: D75993324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155192
Approved by: https://github.com/mikaylagawarecki, https://github.com/pradeepfn
2025-06-19 02:04:21 +00:00
d4ad280429 Enable querying the build and runtime NCCL versions (#156305)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156305
Approved by: https://github.com/wconstab, https://github.com/Skylion007, https://github.com/fegin
2025-06-19 02:00:08 +00:00
bc9bd2a766 Use linux.2xlarge runner (#156351)
The cuda version of this job uses a linux.2xlarge here so matching that to see if this job really needs a 12xlarge system or not.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156351
Approved by: https://github.com/jeffdaily, https://github.com/cyyever
2025-06-19 01:50:56 +00:00
e5a1197191 Fix fx tracing for mark dynamic (#156346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156346
Approved by: https://github.com/tony-ivchenko
2025-06-19 01:03:09 +00:00
6959b5febe Context on torch.cuda.memory._record_memory_history max_entries (#155889)
Context on torch.cuda.memory._record_memory_history buffer behavior

## Description

Answer questions:
- Can I keep _record_memory_history() always enabled with the default max_entries=sys.maxsize (9223372036854775807)? Will it consume a significant amount of CPU RAM?
- If I set max_entries to a lower value, e.g. 2000, will it keep the first 2000 entries and then stop recording or will it keep the most recent 2000 entries before each snapshot (fifo-style)?
- What is the expected size on disk of the snapshots? Some KBs, MBs?

Fixes #129674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155889
Approved by: https://github.com/ngimel
2025-06-19 00:44:43 +00:00
6303cc41b7 [ROCm] support CUDA_KERNEL_ASSERT using abort() (#155262)
We won't have the full message that __assert_fail would provide, but at least we won't silently do nothing.

Fixes #155045.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155262
Approved by: https://github.com/hongxiayang, https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-18 23:52:35 +00:00
b8c2d4c259 add a corner test case of dynamic sizes for combo kernel (#156035)
Summary:
Added a unit test case for a corner case of combo kernel where all below are true:
1. more than 1 dimensions are dynamic size
2. no_x_dim presistent reduce op

Test Plan:
```
buck2 test mode/opt caffe2/test/inductor:combo_kernels -- test_dynamic_shapes_persistent_reduction_no_x_dim_2
```

Rollback Plan:

Differential Revision: D76699002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156035
Approved by: https://github.com/mlazos
2025-06-18 22:57:09 +00:00
76d07e919f Unbreak //c10/util:base (#156216)
Missing dep.

Bifferential Revision: [D76840057](https://our.internmc.facebook.com/intern/diff/D76840057/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156216
Approved by: https://github.com/janeyx99, https://github.com/desertfire
2025-06-18 22:44:20 +00:00
9bfefda296 [DCP][PyTorch Staging APIs][2/x] Handle 0-elem case + ShardedTensor copy for staging (#156092)
Summary:
### Diff Context

1. Sometimes, a tensor might have non-zero size and 0 numel. In this case, pinning memory will fail
so we take a best guess at how to replicate the tensor below to maintain symmetry in the returned
state dict.

2. ShardedTensor copying was not handled originally in PyTorch state_dict copy APIs, handled in this diff.

Test Plan: CI

Differential Revision: D75553096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156092
Approved by: https://github.com/pradeepfn
2025-06-18 22:41:25 +00:00
a5b4463d60 [nativert] session state (#156190)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76827309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156190
Approved by: https://github.com/zhxchen17
2025-06-18 22:40:44 +00:00
6918758f55 [export] Update documents for ExportGraphSiganture (#156244)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/156184

The current document for ExportGraphSignature doesn't reflect `torch.export.export()` returns non-functional graph by default. And users may get confused.

Test Plan:
Document change only. CI

Rollback Plan:

Differential Revision: D76849097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156244
Approved by: https://github.com/yushangdi
2025-06-18 22:37:34 +00:00
1e474cc9c8 [ONNX] Fix how shapes are computed for float4 (#156353)
Changed the way we compute shapes for unpacked float4. Previously we always added a last dimension [2] to existing shape, but this doesn't really make sense because it prevents use from being able to represent any shape other than those with a list dim [2]. I updated the logic to be `[*shape[:-1], shape[-1]*2]` which doubles the last dimension. This is more in line with what we see in practice when people are using 4bit types, and it allows us to represent any shape with an even dimension at the end, which is much more reasonable in my opinion.

Also clarified in https://github.com/pytorch/pytorch/pull/148791#discussion_r2155395647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156353
Approved by: https://github.com/titaiwangms
2025-06-18 22:28:02 +00:00
9afee0fa96 [inductor] Set num_workers to number of available cpu divided by number of available gpu (#156201)
internal: https://fb.workplace.com/groups/1075192433118967/posts/1689562705015267/?comment_id=1690284241609780&notif_id=1749770611538976&notif_t=work_group_comment&ref=notif

Right now it doesn't have the divided by 2 logic yet. Not sure how to tell if we are on a dev machine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156201
Approved by: https://github.com/masnesral
2025-06-18 22:15:32 +00:00
e5a0b73ce9 [MTIA Aten Backend] Migrate logical_and.out (#156286)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logical_and.out to in-tree

Differential Revision: [D76874551](https://our.internmc.facebook.com/intern/diff/D76874551/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156286
Approved by: https://github.com/nautsimon, https://github.com/jingsh
ghstack dependencies: #155634, #156046, #156047, #156283, #156284, #156285
2025-06-18 21:57:05 +00:00
bfccfa0b31 Revert "[Draft][CUDA] Use runtime driver API for cuStreamWriteValue32 (#156097)"
This reverts commit cf90c9f8d1632777ec5f4b6ccaa14bc5bf259e9c.

Reverted https://github.com/pytorch/pytorch/pull/156097 on behalf of https://github.com/atalman due to break internal tests ([comment](https://github.com/pytorch/pytorch/pull/156097#issuecomment-2985785811))
2025-06-18 21:48:50 +00:00
f5eb42e4c0 [nativert] move layoutplanneralgorithm to libtorch (#156205)
Summary: att

Test Plan:
ci

Rollback Plan:

Differential Revision: D76831634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156205
Approved by: https://github.com/zhxchen17
2025-06-18 21:46:38 +00:00
d1c924c68a [MTIA Aten Backend] Migrate lt.Tensor_out / lt.Scalar_out (#156285)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate t.Tensor_out / lt.Scalar_out to in-tree.

Differential Revision: [D76873997](https://our.internmc.facebook.com/intern/diff/D76873997/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156285
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047, #156283, #156284
2025-06-18 21:40:26 +00:00
5c7e1d39ab [MTIA Aten Backend] Migrate logit (#156284)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logit to in-tree.

Differential Revision: [D76871451](https://our.internmc.facebook.com/intern/diff/D76871451/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156284
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047, #156283
2025-06-18 21:36:27 +00:00
706e236b08 [MTIA Aten Backend] Migrate logical_or.out / log.out / log2.out (#156283)
# Context

See the first PR https://github.com/pytorch/pytorch/pull/153670

# This diff

Migrate logical_or.out / log.out / log2.out to in-tree.

Differential Revision: [D76857072](https://our.internmc.facebook.com/intern/diff/D76857072/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156283
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046, #156047
2025-06-18 21:27:58 +00:00
ab81fb846c [MTIA Aten Backend] Migrate remainder.Tensor_out / reciprocal.out / neg.out (#156047)
Migrate remainder.Tensor_out / reciprocal.out / neg.out

Differential Revision: [D76696710](https://our.internmc.facebook.com/intern/diff/D76696710/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156047
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634, #156046
2025-06-18 21:17:34 +00:00
c26ce593d8 [MTIA Aten Backend] Migrate nan_to_num.out (#156046)
Migrate nan_to_num.out

Differential Revision: [D76696155](https://our.internmc.facebook.com/intern/diff/D76696155/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156046
Approved by: https://github.com/nautsimon
ghstack dependencies: #155634
2025-06-18 21:14:13 +00:00
2f1c5c4131 [MTIA Aten Backend] Achieve CPU fallback by overriding registration (#155634)
# Context

MTIA supports CPU fallback, and people can set it using env vars. By migrating aten backend to in-tree, we also need to provide this support.

# This diff

Suggested by Alban(pytorch core), instead of skipping registration, this diff achieves CPU fallback by doing additional registration and override.

The benefits of this approach:
1. The previous solution has problem handling ops that have default dispatch key(e.g. CompositeImplicitAutograd), and can't really achieve CPU fallback.
2. The CPU fallback related logic can be aggregated in aten_mtia_cpu_fallback.cpp.

----------------

p.s. D76314740 also tried reusing the yaml parsing logic in mtia's python script, but realized that the env vars are only available in runtime but not compile/codegen time

Differential Revision: [D76376644](https://our.internmc.facebook.com/intern/diff/D76376644/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155634
Approved by: https://github.com/nautsimon, https://github.com/albanD
2025-06-18 21:10:18 +00:00
e99cc126a4 [AOTInductor] Reuse input information instead of directly applying unbacked_symint_fallback (#156133)
Summary:
When we encounter unbacked symint during autotuning, we try to reuse existing
symbols from user provided inputs, then fallback.

Test Plan:
python test/inductor/test_aot_inductor.py -k test_triton_dynamic_launcher_grid

Rollback Plan:

Differential Revision: D76769711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156133
Approved by: https://github.com/jingsh
2025-06-18 20:53:21 +00:00
728cf6721e Revert "[PT2]load dense delta by trimming prefixes (#155872)"
This reverts commit c74fd35050a7241f0c439501ef735aa6cdde751f.

Reverted https://github.com/pytorch/pytorch/pull/155872 on behalf of https://github.com/malfet due to Broke lint, internal has been backed out ([comment](https://github.com/pytorch/pytorch/pull/155872#issuecomment-2985542895))
2025-06-18 20:05:56 +00:00
c74fd35050 [PT2]load dense delta by trimming prefixes (#155872)
Summary:
In PT2 with GPU with AOTI, weight names are like
```merge.submod_0._run_on_acc_0.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```

but when publishing delta snapshots, lowering is skipped so weights are like
```merge.main_module.user_embedding_arch.relevance_pmas.ig_feed.pos_emb```

so when loading delta weights in original model runner, we need to:
1. Redo tensorName -> weight idx look up, because the weight ordering may be different.
2. use trimmed tensorName to find the correct weight path.

Note that with this diff, delta snapshot loading still does NOT use xl weights. This should be fine for now as we are still publishing full model with non-xl weights.

Test Plan:
Merge only:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODULE=merge
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13

CUDA_VISIBLE_DEVICES=2,3 buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=DenseOnly --baseNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}  --moduleName=${MODULE} --predictor_hardware_type 1 --submodToDevice "" --deltaNetFile /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/delta_${DENSE_DELTA_SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE}
```

Local replayer:
```
MODEL_TYPE=mtml_ctr_instagram_model
MODEL_ENTITY_ID=900234243
SNAPSHOT_ID=7
DENSE_DELTA_SNAPSHOT_ID=13

USE_SERVABLE=0 HARDWARE_TYPE=0 DENSE_DELTA_IDS=${DENSE_DELTA_SNAPSHOT_ID} ENABLE_REALTIME_UPDATE=1 CUDA_VISIBLE_DEVICES=6,7 sh ./sigrid/predictor/scripts/start_gpu_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} 7455

USE_SERVABLE=0 sh sigrid/predictor/scripts/start_gpu_replayer_localhost_with_gif.sh ${MODEL_ENTITY_ID}_${SNAPSHOT_ID} 10 ${MODEL_TYPE} /data/users/$USER/requests/filter_requests_mtml_ctr_instagram_model_500 localhost /data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID} true 7455
```

Rollback Plan:

Differential Revision: D76520301

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155872
Approved by: https://github.com/SherlockNoMad
2025-06-18 19:13:22 +00:00
48de3da253 fix: avoid flamegraph script setup conflicts (#156310)
Fixes #156309

Instead of any kind of locking and busy waits leaving room for multiple script downloads to happen, while only one `rename` will succeed and others will silently fail, removing any temporary files created during this process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156310
Approved by: https://github.com/malfet

Co-authored-by: Alexander Zhipa <azzhipa@amazon.com>
2025-06-18 19:06:22 +00:00
cbafba5794 Allow forcing FSDP2 to always use SUM reductions (#155915)
NCCL zero-copy support only works for SUM reductions. FSDP2, by default, was prefering AVG reductions or, when using `set_reduce_scatter_divide_factor`, PreMulSum reductions.

Moreover, PreMulSum reductions had a few bugs, such as #155903 and #155904.

This PR adds a flag to always use SUM reductions, potentially requiring separate pre-/post-scaling kernels, and reworks the `set_reduce_scatter_divide_factor` logic to make it safer (and renaming it to avoid confusion).

Differential Revision: [D76895058](https://our.internmc.facebook.com/intern/diff/D76895058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155915
Approved by: https://github.com/xunnanxu
2025-06-18 18:57:47 +00:00
9944cd0949 Convert to markdown: quantization-accuracy-debugging.rst, quantization-backend-configuration.rst, quantization-support.rst, random.rst (#155520)
Related to #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 18:46:04 +00:00
30d3cf62fb support CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F (#154680)
Requires CUDA >= 12.9 and sm_90.

hipBLASLt has a similar enum but is not available until ROCm 7.0. Support the new enum early using a cmake test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154680
Approved by: https://github.com/malfet, https://github.com/atalman
2025-06-18 18:39:01 +00:00
aee2bfc5ba [Intel GPU] Update xpu triton commit pin for PyTorch release 2.8. (#154194)
As title.
Thanks @anmyachev  for the work on compatibility adaptation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154194
Approved by: https://github.com/jansel
2025-06-18 18:17:07 +00:00
2620361d19 Add batching rule for torch.matrix_exp (#155202)
## Summary

Adds the missing batching rule for `torch.matrix_exp` to enable efficient `vmap` support.
Previously, using `vmap` with `matrix_exp` would trigger a performance warning and fall back to a slow loop-based implementation, even though `matrix_exp` natively supports batched inputs.

Fixes #115992

## Details

`torch.matrix_exp` is an alias for `torch.linalg.matrix_exp`. This PR adds vmap support by registering `matrix_exp` with `OP_DECOMPOSE`, which reuses the existing CompositeImplicitAutograd decomposition to automatically generate batching behavior from the operation's simpler component operations.

## Testing

The existing test suite for vmap and matrix_exp should cover this change. The fix enables:
- No performance warning when using `vmap(torch.matrix_exp)`
- Efficient native batched execution instead of loop-based fallback

**Edit:** Updated Details section to accurately reflect the implementation approach (decomposition rather than batch rule registration)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155202
Approved by: https://github.com/zou3519
2025-06-18 17:35:35 +00:00
eqy
a5f59cc2ea [cuDNN][64-bit indexing] update conv depthwise 64bit indexing dispatch condition to match native kernel (#156140)
The native kernel doesn't support batch splitting so the previous check wasn't aggressive enough in dispatching to cuDNN

https://github.com/pytorch/pytorch/issues/155225

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156140
Approved by: https://github.com/ngimel
2025-06-18 17:32:36 +00:00
94f8679019 Revert "[PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)"
This reverts commit 6d3a4356f61b28a14abd95f641e2615deb186365.

Reverted https://github.com/pytorch/pytorch/pull/155809 on behalf of https://github.com/laithsakka due to pr_time_benchmarks ([comment](https://github.com/pytorch/pytorch/pull/155809#issuecomment-2985022572))
2025-06-18 16:52:19 +00:00
36f7a027b5 [MPS] Implement upsample_trilinear as Metal shader (#156263)
But only forward for now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156263
Approved by: https://github.com/dcci
ghstack dependencies: #156256, #156090
2025-06-18 16:10:02 +00:00
bf06190e21 Integrated AMD AWS runners into Pytorch CI (#153704)
Integrated AMD AWS runners into PyTorch CI, including the linux.24xl.amd for performance tests, the linux.8xl.amd with AVX512 support for unit and periodic tests, and the linux.12xl.amd with AVX2 support for unit and periodic tests.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153704
Approved by: https://github.com/malfet, https://github.com/jithunnair-amd

Co-authored-by: kiriti-pendyala <kiriti.pendyala@amd.com>
2025-06-18 15:58:22 +00:00
ce3406817d Revert "[dynamo] control one_graph behavior additionally through config (#154283)"
This reverts commit fe37db4f1270745d6c523623143332ddf263af55.

Reverted https://github.com/pytorch/pytorch/pull/154283 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154283#issuecomment-2984795214))
2025-06-18 15:53:32 +00:00
c5d3e7a4ff Revert "[dynamo] add set_fullgraph decorator/context manager (#154289)"
This reverts commit 920f6e681ec70b664ed952255b8c1f97962f5de0.

Reverted https://github.com/pytorch/pytorch/pull/154289 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154289#issuecomment-2984774814))
2025-06-18 15:51:06 +00:00
408d9884b0 Revert "[dynamo] fix set_fullgraph for nested calls (#154782)"
This reverts commit 3c8c48f79344356c58e91b9c8588f85ff806e1c8.

Reverted https://github.com/pytorch/pytorch/pull/154782 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda GH job link HUD commit link ([comment](https://github.com/pytorch/pytorch/pull/154782#issuecomment-2984764330))
2025-06-18 15:47:21 +00:00
6201981f48 Revert "[dynamo] handle fullgraph toggle using nested torch.compile (#155166)"
This reverts commit 614a41514545cbdd15757ef2586d433d7d34041c.

Reverted https://github.com/pytorch/pytorch/pull/155166 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](a6a3a44144) ([comment](https://github.com/pytorch/pytorch/pull/155166#issuecomment-2984751600))
2025-06-18 15:43:22 +00:00
d290fe7690 Remove legacy export testing path (#156093)
Summary: After this diff stack lands, we are pretty much done with the training IR migration. So there is no need to run extensive legacy export test.

Test Plan:
CI

Rollback Plan:

Differential Revision: D76734378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156093
Approved by: https://github.com/desertfire
2025-06-18 15:36:44 +00:00
7531bd6491 [ROCm] upgrade to 6.4.1 patch release (#156112)
Fixes #155292.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156112
Approved by: https://github.com/jeffdaily

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2025-06-18 15:21:44 +00:00
830a335a7d Refine alignment check along dynamic dimension for grouped MMs (#155466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155466
Approved by: https://github.com/ngimel
2025-06-18 15:15:05 +00:00
6d3a4356f6 [PT2][partitioners] raise getitems in partitioners to allow earlier release of buffers (#155809)
**Problem & Solution:**
Assume we have something like:
```
x = some_op(...)
x0 = x[0]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
x1 = x[1]
```
In this case, the memory associated with `x0` cannot be released until `x1 = x[1]`. Since `x1 = x[1]` does not use additional memory, it would be beneficial to move and `x1 = x[1]` and all such `getitem` operations to be immediately after `x = some_op(...)` such as
```
x = some_op(...)
x0 = x[0]
x1 = x[1]
do_something_with_and_is_last_use_of(x0)
do_a_bunch_of_other_things()
```

**Results:**
For instance, for the `res2net101_26w_4s` model in pytorch benchmark, when running with `aot_eager` backend and with `activation_memory_budget=0.4`, the peak memory are
* baseline: 7.73GiB
* with the chage: 6.45GiB

As a sanity check, for the same setting with `inductor` backend, the peak memory is not regressed.

cc and credit to @ShatianWang for noticing this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155809
Approved by: https://github.com/fmassa, https://github.com/bdhirsh
ghstack dependencies: #155943
2025-06-18 14:38:55 +00:00
c177abd217 Disable pinning check when loading sparse tensors (#154638)
Disables pinning check as unnecessary and to fix https://github.com/pytorch/pytorch/issues/153143 when loading sparse tensor from external storage with sparse tensor invariants check enabled.

Fixes https://github.com/pytorch/pytorch/issues/153143 .

For FC, to be landed two weeks after https://github.com/pytorch/pytorch/pull/154617, see https://github.com/pytorch/pytorch/pull/154617#issuecomment-2919643612.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154638
Approved by: https://github.com/amjames, https://github.com/ngimel
2025-06-18 14:33:36 +00:00
8f02161d10 Revert "[dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)"
This reverts commit a6a3a441442a96f38d0771c985f753223cea2ba0.

Reverted https://github.com/pytorch/pytorch/pull/154564 on behalf of https://github.com/atalman due to inductor/test_flex_decoding.py::TestFlexDecodingCUDA::test_do_not_trigger_dynamic_shapes_on_empty_block_mask_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/15726606697/job/44333233942) [HUD commit link](a6a3a44144) ([comment](https://github.com/pytorch/pytorch/pull/154564#issuecomment-2984409088))
2025-06-18 14:19:39 +00:00
b30e04b3c8 Make the NCCL PG Options and Config copyable and safe to init standalone (#155700)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155700
Approved by: https://github.com/kwen2501
2025-06-18 13:36:27 +00:00
1bb9b1858b [CPU][Inductor] Improve A16W4 GEMM template performance by using block_n=32 (#156174)
**Summary**
We found that using `block_n=32` brings better performance for A16W4 than `block_n=64` because cache locality is better and parallelism is better if N is small and more cores are used.
For example, when running Llama-3.1-8B with A16W4 and batch size = 16 on 43 cores, `block_n=32` is faster by >10% E2E for both first and next token.

**Test plan**
```
pytest test/inductor/test_cpu_select_algorithm.py -k test_int4_woq_mm_amx
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156174
Approved by: https://github.com/leslie-fang-intel
2025-06-18 13:17:46 +00:00
d99cac2816 [Kineto][submodule] Update kineto pin for XPU toggle feature (#155488)
Part of #154898
Update kineto submodule

Summary: We add the toggleCollectionDynamic functionality to XPUPTI in Kineto, so profiler can be enabled/disabled dynamically.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155488
Approved by: https://github.com/guangyey, https://github.com/sraikund16
2025-06-18 12:39:58 +00:00
c11888e7a6 Skip more tests on s390x (#155210)
Make CI for s390x green before fixing and restoring tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155210
Approved by: https://github.com/seemethere
2025-06-18 12:07:17 +00:00
402ae09e41 [BE] fix typos in c10/ (#156078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156078
Approved by: https://github.com/malfet, https://github.com/cyyever
2025-06-18 10:24:44 +00:00
f45f483884 [user triton] AOT Inductor support for new host-side TMA api (#155879)
This adds support for the host-side TMA api (TensorDescriptor.from_tensor) for AOTI. Note: this should support all the same features as the old (experimental) TMA api, but not some new features of the new TMA, like mxfp4 support.

Note: one complexity with the new TMA api is that a single TMA descriptor passed to the python kernel turns into 1 + 2 * N args in the cubin function signature, for a rank-N tensor.

What this PR contains:
1) device_op_overrides.py: add a rough copy of fillTMADescriptor from https://github.com/triton-lang/triton/blob/main/third_party/nvidia/backend/driver.c#L283. However, the fillTMADescriptor implementation in Triton is significantly modified, so that much of the computation (about swizzling and data types) is done before the time of the TMA construction. For simplicity, I've moved the computation into the cuda helper kernel (as was the previous strategy with fill2DTMADescriptor); but long term we might want to unify our implementation with the upstream implementation
2) device_op_overrides.py: introduces a struct "StableTMADescriptor" which stores some of the 1 + 2 * N args for the cubin signature (along with the global shape, which is not strictly needed, but this cleans up the call to the triton kernel
3) plumbing through cpp_wrapper_gpu.py. The main thing to note is: the code generated by cpp_wrapper_gpu.py generally refers to the StableTMADescriptor object when it passes around a "tma descriptor" variable. At the very end (in generate_args_decl), the StableTMADescriptor is unwrapped and the individual arguments are passed into the cubin.

Tests: test_aot_inductor.py's test_triton_kernel_tma_descriptor_{N}d_dynamic_{D}_tma_version_{V}_cuda: for N in {1, 2}  and D in {True, False}, and V = {new, old}, this test passes (or is skipped, if the appropriate TMA API is not available). Tested on H100 for Triton 3.3 and Triton 3.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155879
Approved by: https://github.com/desertfire
2025-06-18 09:35:11 +00:00
577baa4116 [c10d] Add a logger for all nccl collectives with its time duration when completed (#156008)
Summary: We want to build a logging table for tracking the collective time spent on GPU for all internal workloads. Since we have a cudaEventQuery for both the start and end of a collective (We rolled out ECudaEventStart (enableTiming) fully already), we plan to add this logging table inside the watchdog of PyTorch ProcessGroupNCCL so that we get to know the duration of collectives.

Test Plan:
CI + dry run.

Rollback Plan:

Differential Revision: D76552340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156008
Approved by: https://github.com/fegin, https://github.com/eqy
2025-06-18 09:08:42 +00:00
c5a4fe9c17 [CI] fix the ci image name for public copy in ghcr (#156169)
After the PR #152209 landed, the name of ci image public copy in ghcr is not correct. For example, https://github.com/pytorch/pytorch/actions/runs/15698468716/job/44228133522#step:10:8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156169
Approved by: https://github.com/malfet
2025-06-18 08:16:56 +00:00
a6a3a44144 [dynamo] raise hard error if error is encountered while tracing resume function prologue (#154564)
This should prevent bad resume function prologues from slipping by. In particular, graph breaks in resume function prologues will now hard error.

Implementation details:
- The resume function prologue is surrounded by `LOAD_CONST arg, STORE_FAST __is_tracing_resume_prologue` instructions. The first sequence has `arg=True` and the second sequence has `arg=False`.
- InstructionTranslator will know when it is tracing a resume function prologue when it detects `STORE_FAST __is_tracing_resume_prologue`. The top of stack will be True to mark the start of the prologue, False to mark the end.
- When `convert_frame.py` detects that an error occurred while the InstructionTranslator was tracing a resume function prologue, we will wrap the exception and hard error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154564
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782, #155166
2025-06-18 07:27:20 +00:00
614a415145 [dynamo] handle fullgraph toggle using nested torch.compile (#155166)
See added test for the case that this PR handles. In particular, the semantics for nested torch.compile with toggled fullgraph settings was strange before - `@torch.compile(fullgraph=True)` overrides the existing fullgraph setting, while `@torch.compile(fullgraph=False)` does not.

Note that this change will add an extra frame to any inlined torch.compile'd function (which I don't expect to happen frequently).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155166
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289, #154782
2025-06-18 07:27:20 +00:00
3c8c48f793 [dynamo] fix set_fullgraph for nested calls (#154782)
- Make the fullgraph argument of set_fullgraph a positional argument
- Fix behavior on nested calls by updating `tracer.error_on_graph_break` in more places. In particular, a tracer's error_on_graph_break is set to the inlined tracer's error_on_graph_break upon the latter's exit. We also track error_on_graph_break in the speculation log now, since if we encounter a nested graph break, we will restart analysis and we need to somehow remember the error_on_graph_break setting after attempting to run the nested function (but we don't actually trace into it in the restart analysis).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154782
Approved by: https://github.com/jansel
ghstack dependencies: #154283, #154289
2025-06-18 07:27:09 +00:00
920f6e681e [dynamo] add set_fullgraph decorator/context manager (#154289)
Implements https://github.com/pytorch/pytorch/issues/144908.

Implementation notes:
- `set_fullgraph` is implemented using `patch_config`, which changes config correctly during runtime and tracing.
- Moved setting `config.error_on_graph_break` from convert_frame.py to eval_frame.py. This is because this should only be done at the top-level decorated function. If we kept this in convert_frame.py, we would be changing `config.error_on_graph_break` on every top-level frame, which causes confusing behavior (see added test for example).
- InstructionTranslator reads from `config.error_on_graph_break` every `step()`. This is to determine the value of `config.error_on_graph_break` at the time of the graph break, because tracer cleanup will restore the value of `config.error_on_graph_break` .
- `convert_frame.py` determines whether we should abort tracing (fullgraph=True) or continue (fullgraph=False) by reading the value of the tracer's `error_on_graph_break`. If there is no tracer (failed to initialize), then default to reading `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154289
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #154283
2025-06-18 07:27:00 +00:00
fe37db4f12 [dynamo] control one_graph behavior additionally through config (#154283)
`torch.compile` now always goes through `torch._dynamo._optimize`. fullgraph is now implemented in `torch.compile` by looking at `config.error_on_graph_break`. Export still goes through `torch._dynamo._optimize_assert`, which uses `tx.one_graph` instead of `config.error_on_graph_break`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154283
Approved by: https://github.com/jansel, https://github.com/anijain2305
2025-06-18 07:26:52 +00:00
ccc6279b40 flex attention: fix dispatch order for tensor subclasses, avoid hardcoding call to faketensor impl in dynamo (#151719)
This is enough to get @XilunWu 's stack in a state where his flex_attention DTensor implementations worked E2E for me. It also required these changes on the DTensor side, to properly add a DTensor rule for flex backward: P1789852198

There are two problems:

(1) in the normal dispatcher, we have a precedence ordering between modes and subclasses. Modes are dispatched to first, but modes are allowed to return NotImplemented, giving subclasses a chance to run.

This normally happens automatically in `FakeTensorMode.__torch_dispatch__` and `FunctionalTensorMode.__torch_dispatch__`. However, since HOPs implement these two modes themselves, HOPs do not get this benefit. For now, I ended up hardcoding this `NotImplemented` logic directly into the functional/fake rules for flex attention.

Having to do this for every HOP seems a bit painful. If we could plumb every HOP through `Fake[|Functional]TensorMode.__torch_dispatch__` then we would get this support. Another option could be to just assume that most HOP <> mode implementations want the same treatment by default, and hardcode this `NotImplemented` logic into `torch/_ops.py`. I'm not sure if we'd need a way for the HOP to opt out of this though.

(2) We were hardcoding a call to flex attention's fake implementation in dynamo to run fake prop. This is technically wrong for subclasses, because it doesn't give subclasses the chance to interpose on the op and desugar it before fake prop runs. I tweaked dynamo's logic to call the op, and let the dispatcher handle invoking the fake implementation.

**Testing** Xilun is adding some DTensor tests in his PR that will end up testing this logic. If folks would prefer, though, I can try to add a test that uses another subclass instead that is maybe more basic.

This is the tlparse that his DTensor test gnerated for me: https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/hirsheybar/0196c1d3-a9a2-46ea-a46d-aa21618aa060/custom/rank_0/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151719
Approved by: https://github.com/ydwu4

Co-authored-by: drisspg <drisspguessous@gmail.com>
2025-06-18 07:02:04 +00:00
bdb1553b77 [inductor][cutlass] binary remote cache (#156248)
Summary:
# Why

speed up cutlass kernel generation and retrieval

# What

using the _ManifoldCache, make a KernelBinaryCache that uploads/downloads kernels and their error files. only register the handler internally

this is the OSS only part of the change, to facilitate integration

Test Plan:
## prove that we can upload successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
      673184 cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
      649776 cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can download successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:48:38.759000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so
I0611 12:48:38.760000 935012 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:65] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so
```

## prove that we can upload errors successfully
```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
manifold ls coconutruben-test-01/tree/cutlass_concept_2
        4846 cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
        4846 cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## prove that we can download errors successfully

```
buck2 run @mode/opt scripts/coconutruben/torchmm:experiment 2>&1
```

```
I0611 12:56:14.078000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qi/cqiq4vjbvytdofutoxisa3pqjplgpgmt2sh7dtatiw4bqt5rtjgc.so.error
I0611 12:56:14.079000 1001022 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:74] Successfully downloaded /var/tmp/torchinductor_coconutruben/qy/cqymdwsfsirhkqglv7sbjyvqkrt3ryql4mtb45tekt76347ee6sx.so.error
```

## showing timing information

```
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/fk/cfkykew2fw5572hjr4e7jbog7oix7xjkegtn2ovikyhxe6pr4tcw.so (download: 0.842s, write: 0.000s, total: 0.842s)
I0616 11:22:29.169000 2249769 /data/users/coconutruben/fbsource/fbcode/caffe2/torch/_inductor/fb/kernel_binary_remote_cache.py:71] Successfully downloaded /var/tmp/torchinductor_coconutruben/pj/cpjqda67c6ojj75z3ddnmfbxinpm7yp7rc2q2oxwsrtwsnacklqv.so (download: 0.838s, write: 0.001s, total: 0.838s)
```

Reviewed By:
henrylhtsang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156248
Approved by: https://github.com/henrylhtsang
2025-06-18 06:51:22 +00:00
96df866410 [audio hash update] update the pinned audio hash (#156259)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156259
Approved by: https://github.com/pytorchbot
2025-06-18 06:02:46 +00:00
a5df6ffbc2 Improve IPC for Expandable Segments to use fabric handle when possible (#156074)
Improve upon https://github.com/pytorch/pytorch/pull/130890 , inspired by https://github.com/pytorch/pytorch/pull/130890#issuecomment-2278882984 , we can automatically use the fabric handle for IPC when possible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156074
Approved by: https://github.com/ngimel, https://github.com/malfet
2025-06-18 05:22:06 +00:00
29867b211a [cutlass backend] Add __init__.py to cutlass_lib_extensions (#156234)
When using docker with cutlass backend, we can get
```
No module named 'torch._inductor.codegen.cuda.cutlass_lib_extensions'
```
First reported by @nWEIdia in https://github.com/pytorch/pytorch/issues/155888

Evidence that this fixes: https://github.com/pytorch/pytorch/pull/156136

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156234
Approved by: https://github.com/mlazos, https://github.com/Skylion007
2025-06-18 05:03:43 +00:00
c28e74e457 [MPS] Add nearest_3d forward and backward (#156090)
Introduce generalizable `UpsampleParams` structure in `UpSample.h`, which could be shared between CPU and MPS
Delete `upsample_nearest3d` MPS fallback and replace it with proper shader
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156090
Approved by: https://github.com/kulinseth, https://github.com/dcci
ghstack dependencies: #156256
2025-06-18 04:48:15 +00:00
a82c171bb2 remove skipifrocm from composability tests (#156036)
Porting over DTensor training codebase to rocm atm and was reading through a 2D unit tests and noticed a couple of the unit tests already work on rocm even though it is being skipped. pipeline parallel tests pass too

tested locally
<img width="561" alt="image" src="https://github.com/user-attachments/assets/7c40c0f2-2de8-4cf1-8e36-0ba2bba46baa" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156036
Approved by: https://github.com/jeffdaily
2025-06-18 04:24:42 +00:00
9ed0060225 Provide access to the cudaGraph_t underlying a CUDAGraph. (#155164)
There are a few considerations here:

1. A user might want to modify the cudaGraph_t either during the stream capture or after the stream capture (but before instantiation). This draft implements modification after stream capture only, though support could be added for modification during stream capture by applying
https://github.com/pytorch/pytorch/pull/140979/files#diff-d7302d133bb5e0890fc94de9aeea4d9d442555a3b40772c9db10edb5cf36a35cR391-R404

2. Previously, the cudaGraph_t would be destroyed before the end of capture_end() unless the user had previously called enable_debug_mode(). There is no way to implement this correctly without removing this restriction, or forcing the user to always call enable_debug_mode(). However, enable_debug_mode() is a confusing API (despite being an instance method, it would modify a static global variable; thus, putting one CUDAGraph object into debug mode puts all of them into debug mode, which is not acceptable in my opinion). Therefore, I made enable_debug_mode() into a no-op. This means that the CPU memory usage will increase after this change. I think this is likely to be fine.

3. No python bindings yet. These should be easy to add. It is probably worthwhile to take some time to make sure that the returned cudaGraph_t can be converted into the cuda-python cudaGraph_t in a reasonable, hopefully type-safe, manner (but without making cuda-python a dependency of pytorch), since I imagine most users will use the pip cuda-python package to make modifications.

4. There are two foot guns:

   a. The cudaGraph_t returned by raw_cuda_graph() is not owned by the user, so it will be destroyed once the owning CUDAGraph is destroyed (or calls reset()).

   b. The following seuquence won't work as intended:

```
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    foo()
g.replay()
raw_graph = g.raw_cuda_graph()
modify(raw_graph)
g.replay()
```

This won't work because the user must call instantiate() again after modifying cudaGraph_t. You could add a "safety" mechanism by traversing the cudaGraph_t to create a hash and seeing if the hash changes between calls to replay(), but this is likely way too expensive.

I think these two foot guns are probably okay given that this a bit of an experts' API.

Fixes #155106

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155164
Approved by: https://github.com/ngimel
2025-06-18 03:39:28 +00:00
17b38b850e [ca] Allow using compiled autograd context managers during backward runtime (#156120)
Added an invariant that nested compiled autograd context managers must exit before their parent context manager. This allows us to defer the thread check.

FIXES https://github.com/pytorch/pytorch/issues/152219

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156120
Approved by: https://github.com/jansel
ghstack dependencies: #155521, #155480
2025-06-18 03:01:15 +00:00
10d41c7d20 Add SDPA patterns for T5 models (#155455)
* Add SDPA patterns for T5 models.
* Remove the stride check of mask, and do contiguous for mask in flash attention when the stride of last dim != 1 & != 0. This allows more SDPAs with complex mask to be accelerated using flash attention, such as the T5 model, where the generated masks may be not continuous.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155455
Approved by: https://github.com/Valentine233, https://github.com/leslie-fang-intel, https://github.com/jansel
2025-06-18 02:09:55 +00:00
4851863e3f fix hack to check if register_buffer has been overridden (#155963)
Followup on https://github.com/pytorch/pytorch/pull/125971

`self.register_buffer` will always be a a bound method on the instance (`self`) while `torch.nn.Module.register_buffer` is an unbound class method. `is`-ing these two things will never yield `True`. Instead, lets check the [original function object](https://docs.python.org/3/reference/datamodel.html#method.__func__). Note that the current logic doesn't break anything because the `else` branch will still do the "right thing" in the case `register_buffer` hasn't been overrridden, but it does mean we do less work!

Example demonstration:

```python
class Base:
    def register_buffer(self, buffer):
        pass

class InheritedOk(Base):
    pass

class InheritedOverride(Base):
    def register_buffer(self, buffer):
        pass

b = Base()
ok = InheritedOk()
override = InheritedOverride()

print(f"b.register_buffer is Base.register_buffer: {b.register_buffer is Base.register_buffer}") # False
print(f"ok.register_buffer is Base.register_buffer: {ok.register_buffer is Base.register_buffer}") # False
print(f"override.register_buffer is Base.register_buffer: {override.register_buffer is Base.register_buffer}") # False

print(f"b.register_buffer.__func__ is Base.register_buffer: {b.register_buffer.__func__ is Base.register_buffer}") # True
print(f"ok.register_buffer.__func__ is Base.register_buffer: {ok.register_buffer.__func__ is Base.register_buffer}") # True
print(f"override.register_buffer.__func__ is Base.register_buffer: {override.register_buffer.__func__ is Base.register_buffer}") # False
```

(I can make an associated issue if needed, but didnt see it required [in the contributing guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#merging-your-change))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155963
Approved by: https://github.com/mikaylagawarecki
2025-06-18 01:50:30 +00:00
202d2ae53a Convert rst to md: rpc.rst, signal.rst, size.rst, special.rst (#155430)
Fixes #155033

- [x] [rpc.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/rpc.rst)
- [x] [signal.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/signal.rst)
- [x] [size.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/size.rst)
- [sparse.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/sparse.rst) fixed in #155438 due to large size.
- [x] [special.rst](https://github.com/pytorch/pytorch/tree/main/docs/source/special.rst)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155430
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-18 01:27:04 +00:00
68996dc183 [BE][2/X] Phase out usage of use_max_autotune() (#155848)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155848
Approved by: https://github.com/masnesral
2025-06-18 01:18:09 +00:00
e8bfce9a43 Document how to use stack-based APIs with StableIValue (#155984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155984
Approved by: https://github.com/albanD, https://github.com/zou3519
2025-06-18 01:10:23 +00:00
541297daae [Build] Allow metal shaders to include ATen headers (#156256)
No-op change that will be used later to share structs between CPU and Metal
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156256
Approved by: https://github.com/dcci
2025-06-18 01:03:25 +00:00
3dabc351bb [Break XPU] Fix XPU UT failures introduced by community. (#156091)
Fixes #15089, Fixes #156063, Fixes #155689, Fixes #155692, Fixes #156146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156091
Approved by: https://github.com/jansel
2025-06-17 23:43:37 +00:00
38e1e5d54c Add get_pipeline_order() for Gpipe and 1F1B (#155935)
The [schedule visualizer](https://github.com/pytorch/pytorch/blob/main/torch/distributed/pipelining/_schedule_visualizer.py) relies on `self.pipeline_order` to be populated. The `_PipelineScheduleRuntime` also depends on this to run the IR.

The single stage schedules do not implement this so this PR adds that. Also fixes a bug in the schedule visualizer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155935
Approved by: https://github.com/wconstab
2025-06-17 23:39:17 +00:00
5435e75399 [ez] rename choice_timings -> choice_timings_fn (#156099)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156099
Approved by: https://github.com/mlazos
ghstack dependencies: #155982, #155996, #156053
2025-06-17 23:30:27 +00:00
12b02137af [MPS] Add benchmark for scan operations (#156241)
Comparison of cumsum performance before and after Metal implementaton:

Previous performance (using torch==2.7.1):
```[-------------------------------  -------------------------------]
                                              |  eager  |  compile
1 threads: -------------------------------------------------------
      cumsum-dim0-32x32 (torch.float16)       |  131.0  |   136.9
      cumsum-dim0-128x128 (torch.float16)     |  116.9  |   121.2
      cumsum-dim0-512x512 (torch.float16)     |  132.5  |   151.9
      cumsum-dim0-1024x1024 (torch.float16)   |  150.0  |   163.0
      cumsum-dim1-32x32 (torch.float16)       |  125.9  |   140.9
      cumsum-dim1-128x128 (torch.float16)     |  116.4  |   129.4
      cumsum-dim1-512x512 (torch.float16)     |  135.9  |   150.1
      cumsum-dim1-1024x1024 (torch.float16)   |  139.5  |   154.2
      cumsum-1d-100 (torch.float16)           |  119.5  |   127.1
      cumsum-1d-10000 (torch.float16)         |  128.9  |   142.5
      cumsum-1d-1000000 (torch.float16)       |  140.6  |   145.6
      cumsum-dim0-32x32 (torch.float32)       |  115.7  |   132.5
      cumsum-dim0-128x128 (torch.float32)     |  118.0  |   131.5
      cumsum-dim0-512x512 (torch.float32)     |  138.8  |   151.6
      cumsum-dim0-1024x1024 (torch.float32)   |  155.5  |   164.2
      cumsum-dim1-32x32 (torch.float32)       |  127.2  |   141.7
      cumsum-dim1-128x128 (torch.float32)     |  117.7  |   130.5
      cumsum-dim1-512x512 (torch.float32)     |  138.2  |   152.3
      cumsum-dim1-1024x1024 (torch.float32)   |  144.4  |   158.6
      cumsum-1d-100 (torch.float32)           |  118.6  |   128.0
      cumsum-1d-10000 (torch.float32)         |  125.5  |   141.5
      cumsum-1d-1000000 (torch.float32)       |  143.9  |   158.4
      cumsum-dim0-32x32 (torch.bfloat16)      |  106.6  |   137.6
      cumsum-dim0-128x128 (torch.bfloat16)    |  118.1  |   131.0
      cumsum-dim0-512x512 (torch.bfloat16)    |  140.0  |   154.3
      cumsum-dim0-1024x1024 (torch.bfloat16)  |  153.2  |   164.4
      cumsum-dim1-32x32 (torch.bfloat16)      |  127.9  |   132.6
      cumsum-dim1-128x128 (torch.bfloat16)    |  116.5  |   129.6
      cumsum-dim1-512x512 (torch.bfloat16)    |  136.5  |   151.2
      cumsum-dim1-1024x1024 (torch.bfloat16)  |  139.8  |   144.8
      cumsum-1d-100 (torch.bfloat16)          |  115.7  |   129.4
      cumsum-1d-10000 (torch.bfloat16)        |  125.0  |   143.3
      cumsum-1d-1000000 (torch.bfloat16)      |  127.8  |   143.4

Times are in microseconds (us).
```

Current performance:
```
[--------------------------------  --------------------------------]
                                              |   eager   |  compile
1 threads: ---------------------------------------------------------
      cumsum-dim0-32x32 (torch.float16)       |    107.4  |    123.8
      cumsum-dim0-128x128 (torch.float16)     |    134.2  |    145.8
      cumsum-dim0-512x512 (torch.float16)     |    207.3  |    231.6
      cumsum-dim0-1024x1024 (torch.float16)   |    318.9  |    355.3
      cumsum-dim1-32x32 (torch.float16)       |     98.0  |    114.3
      cumsum-dim1-128x128 (torch.float16)     |    110.8  |    121.6
      cumsum-dim1-512x512 (torch.float16)     |    193.0  |    209.1
      cumsum-dim1-1024x1024 (torch.float16)   |    844.7  |    870.8
      cumsum-1d-100 (torch.float16)           |    108.4  |    125.0
      cumsum-1d-10000 (torch.float16)         |    784.7  |    852.3
      cumsum-1d-1000000 (torch.float16)       |  65855.2  |  66725.9
      cumsum-dim0-32x32 (torch.float32)       |    114.7  |    115.7
      cumsum-dim0-128x128 (torch.float32)     |    139.0  |    151.6
      cumsum-dim0-512x512 (torch.float32)     |    197.3  |    208.0
      cumsum-dim0-1024x1024 (torch.float32)   |    312.7  |    332.9
      cumsum-dim1-32x32 (torch.float32)       |     92.0  |    110.8
      cumsum-dim1-128x128 (torch.float32)     |    114.2  |    125.0
      cumsum-dim1-512x512 (torch.float32)     |    186.2  |    196.1
      cumsum-dim1-1024x1024 (torch.float32)   |    752.0  |    825.0
      cumsum-1d-100 (torch.float32)           |    112.4  |    122.0
      cumsum-1d-10000 (torch.float32)         |    793.5  |    863.5
      cumsum-1d-1000000 (torch.float32)       |  66431.8  |  66040.0
      cumsum-dim0-32x32 (torch.bfloat16)      |    111.6  |    121.6
      cumsum-dim0-128x128 (torch.bfloat16)    |    139.0  |    138.4
      cumsum-dim0-512x512 (torch.bfloat16)    |    217.6  |    230.1
      cumsum-dim0-1024x1024 (torch.bfloat16)  |    305.2  |    325.6
      cumsum-dim1-32x32 (torch.bfloat16)      |    100.5  |    110.9
      cumsum-dim1-128x128 (torch.bfloat16)    |    112.8  |    125.0
      cumsum-dim1-512x512 (torch.bfloat16)    |    187.8  |    208.9
      cumsum-dim1-1024x1024 (torch.bfloat16)  |    790.9  |    864.7
      cumsum-1d-100 (torch.bfloat16)          |    111.6  |    124.6
      cumsum-1d-10000 (torch.bfloat16)        |    778.1  |    844.9
      cumsum-1d-1000000 (torch.bfloat16)      |  64654.3  |  64082.5

Times are in microseconds (us).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156241
Approved by: https://github.com/malfet
2025-06-17 22:30:22 +00:00
fa4f07b5b8 Revert "[Docs] Convert to markdown to fix 155032 (#155520)"
This reverts commit cd66ff80307862ef8e75520054ecd19a5eff9f7e.

Reverted https://github.com/pytorch/pytorch/pull/155520 on behalf of https://github.com/atalman due to breaks multiple test_quantization.py::TestQuantizationDocs::test_quantization_ ([comment](https://github.com/pytorch/pytorch/pull/155520#issuecomment-2981996091))
2025-06-17 22:22:50 +00:00
54998c2daa Document padding size limitations in nn.modules.padding (#134840) (#155618)
Fixes #134840

Added documentation to clarify padding size constraints for all padding modes in nn.modules.padding:

- Circular padding: size must be less than or equal to the corresponding input dimension
- Reflection padding: size must be less than the corresponding input dimension
- Replication padding: output dimensions must remain positive

These changes help prevent runtime errors when users attempt to use large padding values.

## PR Checklist
- [x] The PR title and message follow our [commit guidelines](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#commit-message-format)
- [x] The PR is made against the correct branch
- [x] The PR is labeled with `docathon`
- [x] The PR is labeled with `module: nn`
- [x] The PR is labeled with `documentation`
- [x] The PR description includes a reference to the issue being fixed
- [x] The PR includes tests if applicable
- [x] The PR includes documentation changes
- [x] The PR has been tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155618
Approved by: https://github.com/AlannaBurke, https://github.com/malfet
2025-06-17 22:16:48 +00:00
937529f0b3 Pass by const ref instead of by value in StableIValue from (#156126)
I realize I was passing stable::Tensors by value (thus making a copy every time) which is not what I want from the `from` function that converts Ts to StableIValues. `from` should not mutate the input and should be read-only.

I asked an LLM whether this is API BC breaking (with an intuition that it shouldn't be), and it said no, cuz:
1. "Passing by const reference is more permissive than passing by value. e.g., if T is a type that has a deleted or inaccessible copy constructor (e.g., std::unique_ptr), the original code would have been invalid, while the new code would be valid." Nice. We are good with additive.
2. We didn't modify the original input before (cuz we took a copy) and we don't now (cuz we promise const).

Update: The LLM failed to mention primitives, with which we should not pass references around, so we are only changing the signatures of std::optional<T> and stable::Tensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156126
Approved by: https://github.com/swolchok
ghstack dependencies: #155367, #155977
2025-06-17 22:11:30 +00:00
4c0aa37dda Support stream capture of event record and wait nodes in cuda graphs (#155372)
These are created by the user passing cudaEventRecordExternal and
cudaEventWaitExternal to cudaEventRecordWithFlags() and
cudaStreamWaitEvent() respectively.

We do this by allowing the user to specify external=True when
constructing a torch.cuda.Event().

If external=False, the cudaEventRecord and cudaStreamWaitEvent API's
have a different meaning described here:

https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cross-stream-dependencies-and-events

In short, they will be used to experess fork and join operations in
the graph if external=False.

External events can be used for expressing a fine-grained dependency
on the outcome of some nodes in a cuda graph (rather than all
nodes). They can also be used for timing parts of a cuda graph's
execution, rather than timing the entire graph's execution.

Finishes #146145

I'm a dummy and don't know how to use ghstack at this time. The first commit is a bug fix for _CudaKernel, which would previously always launch work on the NULL stream, rather than the user-passed stream.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155372
Approved by: https://github.com/ngimel
2025-06-17 21:44:51 +00:00
8e02cd9c5a Skip cache related configs for cache config serialization (#156195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156195
Approved by: https://github.com/masnesral
2025-06-17 21:24:07 +00:00
3106a33e41 [fr] Fix one error in analysis script when subPG world size is smaller than global size (#156156)
Summary: We run into an interesting case when we see so many mismatches while lot of mismatch turns out to be a fully match. The reason is that we use the dump ranks (which is from 0 to 79) to compare against the local pg ranks (0 to 7) this leads to false positive of mismatches. We can just check whether dump ranks contain all expected ranks or not, that should be sufficient.

Test Plan:
Test with the failed case with the script and we now see the correct behavior + new unit test case.

Rollback Plan:

Differential Revision: D76775373

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156156
Approved by: https://github.com/VieEeEw
2025-06-17 21:17:58 +00:00
bb462a6237 [cutlass backend] Fix prescreening non-deterministic problem (#156144)
Differential Revision: [D76642615](https://our.internmc.facebook.com/intern/diff/D76642615/)

What do we expect to see when we run two identical matmul back to back? We expect to see the second one spending no time in precompilation, autotuning and prescreening.

However, the introduction of prescreening bring some non-deterministics-ness. Basically, we have
1. prescreening of first matmul chooses a set of kernels to advance to autotuning
2. autotuning re-does the autotuning of the winners, potentially changing their timings a bit
3. second prescreening results in a slightly different set of kernels
4. since not all timings are present, an autotune is re-done.

With this diff:
```
SingleProcess AUTOTUNE benchmarking takes 3.8633 seconds and 134.7364 seconds precompiling for 32 choices and 24.4472 seconds prescreening
SingleProcess AUTOTUNE benchmarking takes 0.0003 seconds and 0.0027 seconds precompiling for 32 choices and 0.0006 seconds prescreening
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156144
Approved by: https://github.com/mlazos
2025-06-17 20:39:06 +00:00
cd66ff8030 [Docs] Convert to markdown to fix 155032 (#155520)
Fix #155032

-   quantization-accuracy-debugging.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-accuracy-debugging.html) vs [main](https://docs.pytorch.org/docs/main/quantization-accuracy-debugging.html)
-  quantization-backend-configuration.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-backend-configuration.html) vs [main](https://docs.pytorch.org/docs/main/quantization-backend-configuration.html)
-  quantization-support.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization-support.html) vs [main](https://docs.pytorch.org/docs/main/quantization-support.html)
-  quantization.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/quantization.html) vs [main](https://docs.pytorch.org/docs/main/quantization.html)
-  random.rst: [Preview](https://docs-preview.pytorch.org/pytorch/pytorch/155520/random.html) vs [main](https://docs.pytorch.org/docs/main/random.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155520
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2025-06-17 20:29:45 +00:00
50940270ae [BE][3/X] Phase out usage of use_max_autotune() (#155849)
See #155847 for context

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155849
Approved by: https://github.com/masnesral
2025-06-17 20:26:29 +00:00
b020971e78 [BE] fix typos in torchgen/ (#156083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156083
Approved by: https://github.com/jingsh
ghstack dependencies: #156079, #156082
2025-06-17 19:25:50 +00:00
a69785b3ec [BE] fix typos in tools/ (#156082)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082
Approved by: https://github.com/soulitzer
ghstack dependencies: #156079
2025-06-17 19:25:50 +00:00
ccea6ddac3 [BE] fix typos in cmake/ (#156079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156079
Approved by: https://github.com/Skylion007
2025-06-17 19:25:43 +00:00
5eb5c3700b [ROCm] enable batched eigen decomposition (syevD_batched) on ROCm (#154525)
This PR implements `Batched Eigen Decomposition` (syevD_batched) on ROCm by calling rocSolver directly.
cuSolver doesn't support syevD_batched and neither does hipSolver. Direct call to rocSolver is required.

`syevD_batched` will be used on ROCm if all the following conditions are met:
- `rocSolver version >= 3.26`
- input data type is `float` or `double`
- batch size >= 2

Otherwise, non-batched `syevD` will be used on ROCm (complex data types, batch size==1,  rocSolver <3.26)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/154525
Approved by: https://github.com/Mellonta
2025-06-17 19:20:36 +00:00
ec08eb8ba2 Revert "[inductor][cutlass] binary remote cache (#156106)"
This reverts commit 9a2c669425379eb264f896390b8fcd8d3f2ce959.

Reverted https://github.com/pytorch/pytorch/pull/156106 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/156106#issuecomment-2981533904))
2025-06-17 19:07:49 +00:00
4a26bb8a12 [C10][CUDA] Eagerly create context on torch.cuda.set_device(device) call (#155900)
Fixes #155668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155900
Approved by: https://github.com/ngimel
2025-06-17 18:59:44 +00:00
fc177801af Enable FP8 row-wise scaled-mm for sm12x (#155991)
## Update using Cutlass 3.x (2025/06/15)

Following @alexsamardzic's advice, I tried out Cutlass 3.x API and it's impressive (rated specs is 419 TFLOPS)

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|17.56
64|4096|4096|69.63
256|4096|4096|266.57
1024|4096|4096|339.28
4096|4096|4096|388.91

This uses the same SM100 template. The only difference is
- Cluster size is fixed to `<1,1,1>` since sm120 does not have multicast feature
- ~~Tile size is fixed to `<128,128,128>` due to default kernel schedule does not support `<64,128,128>`. I will work a bit on improve perf for small M.~~ Fixed. Use `KernelTmaWarpSpecializedPingpong` when TileShape.M == 64

Perf for small M is still bad since it seems like Cutlass does not support TileShape.M < 64 for this kernel. It's possible to boost perf a bit by using TileShape `<64,64,128>`.

## Original using SM89

I tried using cutlass FP8 row-wise scaled-mm for sm89 on sm120 (5090) and it works. I guess it makes sense because sm120 matmul uses the standard sm80 PTX instructions (`cp.async`+`mma` and friends).

Simple benchmark script

```python
import torch
from torch._inductor.utils import do_bench_using_profiling

N, K = 4096, 4096
for M in [16, 64, 256, 1024, 4096]:
    A = torch.randn(M, K, device="cuda").to(torch.float8_e4m3fn)
    B = torch.randn(N, K, device="cuda").to(torch.float8_e4m3fn).T
    scale_A = torch.ones(M, 1).cuda()
    scale_B = torch.ones(1, N).cuda()

    out = torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16)
    out_ref = ((A.float() @ B.float()) * scale_A * scale_B).bfloat16()
    torch.testing.assert_close(out, out_ref)

    latency_us = do_bench_using_profiling(lambda: torch._scaled_mm(A, B, scale_A, scale_B, out_dtype=torch.bfloat16))
    tflops = (2 * M * N * K) / latency_us / 1e9
    print(f"{M=}\t{N=}\t{K=}\t{tflops:.2f} TFLOPS")
```

M | N | K | TFLOPS
---|---|---|---
16 | 4096 | 4096 | 25.73 TFLOPS
64 | 4096 | 4096 | 71.84 TFLOPS
256 | 4096 | 4096 | 86.40 TFLOPS
1024 | 4096 | 4096 | 112.12 TFLOPS
4096 | 4096 | 4096 | 121.24 TFLOPS

Accodring to [RTX Blackwell Whitepaper](https://images.nvidia.com/aem-dam/Solutions/geforce/blackwell/nvidia-rtx-blackwell-gpu-architecture.pdf), FP8 MMA with FP32 accumulate is 419 TFLOPS. So the result is quite bad here...

However, if I change `ThreadblockSwizzle` to `cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<1>`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|27.13 TFLOPS
64|4096|4096|84.84 TFLOPS
256|4096|4096|96.75 TFLOPS
1024|4096|4096|110.21 TFLOPS
4096|4096|4096|122.98 TFLOPS

Small M slightly improves, but large M is still bad.

If I further change `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3` for M>256, which is taken from [cutlass example 58](https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/58_ada_fp8_gemm/ada_fp8_gemm.cu), I get the following results

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|313.28
4096|4096|4096|376.73

Which is much closer to hardware limit. And it also means this kernel is sufficient to get the most perf out of sm120. Only need better tuned configs.

To make sure this high perf is only obtainable with `GemmIdentityThreadblockSwizzle<1>` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`, I also try using `ThreadblockSwizzleStreamK` + `ThreadBlockShape=<128, 64, 128>, WarpShape=<64, 32, 128>, NumStages=3`

 M | N | K | TFLOPS
---|---|---|--------
1024|4096|4096|144.03
4096|4096|4096|156.86

A bit better than current configs, but still very far away from hardware limit.

@alexsamardzic I noticed you chose this configs in #149978. Do you have any numbers how the current configs perform on sm89?

Update: Using triton codegen-ed from inductor `compiled_scaled_mm = torch.compile(torch._scaled_mm, dynamic=False, mode="max-autotune-no-cudagraphs")`

 M | N | K | TFLOPS
---|---|---|--------
16|4096|4096|25.60
64|4096|4096|71.74
256|4096|4096|161.64
1024|4096|4096|185.89
4096|4096|4096|215.53

Better than default configs, but still far away from the config above for compute-bound

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155991
Approved by: https://github.com/drisspg, https://github.com/eqy
2025-06-17 18:52:44 +00:00
e323d46b61 ELU: compute ELU(0) with the cheaper definition (#155765)
Both halves of the ELU definition yield 0 when evaluated at 0. Let's choose the half that doesn't require expm1. (I have no particular evidence that the input is often 0 in any case, but this seems like a free win.)

Differential Revision: [D76481038](https://our.internmc.facebook.com/intern/diff/D76481038/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155765
Approved by: https://github.com/ezyang
2025-06-17 18:20:22 +00:00
8b0e0e4f23 [dynamo] Support tracing of functools.lru_cached method (#156125)
Fixes https://github.com/pytorch/pytorch/issues/155841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156125
Approved by: https://github.com/williamwen42
2025-06-17 18:11:32 +00:00
fc5ae12293 Fix issue with right-nav (#156119)
Enable on page right nav. For autosummary, we need to set `"show_toc_level": 2` so that navigation is enabled. Example:
* Main: https://docs.pytorch.org/docs/main/special.html - right nav (under On this page) is empty.
* Preview: https://docs-preview.pytorch.org/pytorch/pytorch/156119/special.html - right nav (under On this page) has a all the object listed
<img width="1125" alt="Screenshot 2025-06-16 at 2 48 16 PM" src="https://github.com/user-attachments/assets/0790bb72-5997-4542-9847-0a89be4598c0" />
vs
<img width="1030" alt="Screenshot 2025-06-16 at 2 48 55 PM" src="https://github.com/user-attachments/assets/4897c49c-044d-4bea-a8cd-490c90cca2b0" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156119
Approved by: https://github.com/albanD
2025-06-17 18:09:51 +00:00
32c1611263 [CI][run_test] Fix rerun logic for failing at exit (#155853)
Sometimes a test file reports success according to pytest, but fails afterwards, and the rerun logic doesn't handle that correctly.

The name of the last run test is saved in order to do more efficient reruns (target the last run test for a rerun without rerunning the entire file).  This usually correct, ex test fails and pytest catches it -> lastrun = the test that failed, test segfaults (pytest doesn't catch) -> lastrun is the test that segfaulted.  But sometimes pytest reports a success, but the process has non zero exit code.  The two cases I know of are hangs and double freeing at exit.  In this case, its unclear which test caused the failure, so lastrun is set to be the first test that ran in that session, so that during the next session it will start from the beginning in an attempt to replicate the error (an alternate solution would be to just fail and not rerun, which might be the better option).  But then it reruns with runsingle, which prevents lastrun from being reset (not sure why, I'm pretty sure there's no difference between resetting and not normally), so lastrun becomes the last test that ran, and its not always true that lastrun is the one that caused it. Then on the next run, it starts from the last test and the process now exits cleanly

Short term solution here: ensure the lastrun is always set to the initial value if the session succeeds.  This is correct even in the normal path because initial value shouldn't change in that case

Things that still need to be fixed:
* log says "running single test" which is not true
* no xml reports get generated here
* also no xml reports get generated on segfault
* docs for this

I think I have a PR that fixes the above but its old so I need to take another look

Testing:
This from when I was based on a commit that had a hang for macs, and before I added the skips in inductor array ref:
cc862d2c14

Pull Request resolved: https://github.com/pytorch/pytorch/pull/155853
Approved by: https://github.com/malfet
2025-06-17 17:51:40 +00:00
6629eaf0c6 [CMAKE] Fix torch_cpu relink logic if metal shaders are recompiled (#156193)
Beforehand, shader recompilation updated `caffe2/aten/src/ATen/metallib_dummy.cpp` but `torch_cpu` were dependent on `aten/src/ATen/metallib_dummy.cpp`

Test plan: Run `python3 ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal` and observe that torch_cpu is being relinked

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156193
Approved by: https://github.com/manuelcandales
2025-06-17 17:49:33 +00:00
2038 changed files with 76306 additions and 20431 deletions

View File

@ -79,6 +79,7 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:
os.system(f"unzip {wheel_path} -d {folder}/tmp")
libs_to_copy = [
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12",
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so",
"/usr/local/cuda/lib64/libcudnn.so.9",
"/usr/local/cuda/lib64/libcublas.so.12",
"/usr/local/cuda/lib64/libcublasLt.so.12",
@ -88,6 +89,7 @@ def package_cuda_wheel(wheel_path, desired_cuda) -> None:
"/usr/local/cuda/lib64/libcusparseLt.so.0",
"/usr/local/cuda/lib64/libcusolver.so.11",
"/usr/local/cuda/lib64/libcurand.so.10",
"/usr/local/cuda/lib64/libnccl.so.2",
"/usr/local/cuda/lib64/libnvJitLink.so.12",
"/usr/local/cuda/lib64/libnvrtc.so.12",
"/usr/local/cuda/lib64/libcudnn_adv.so.9",

View File

@ -275,17 +275,6 @@ case "$tag" in
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang12-asan)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=12
VISION=yes
TRITON=yes
;;
pytorch-linux-jammy-py3-clang15-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=15
VISION=yes
;;
pytorch-linux-jammy-py3-clang18-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=18

View File

@ -1 +1 @@
b0e26b7359c147b8aa0af686c20510fb9b15990a
ae324eeac8e102a2b40370e341460f3791353398

View File

@ -1 +1 @@
c8757738a7418249896224430ce84888e8ecdd79
ae848267bebc65c6181e8cc5e64a6357d2679260

View File

@ -3,11 +3,10 @@
set -uex -o pipefail
PYTHON_DOWNLOAD_URL=https://www.python.org/ftp/python
PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads # @lint-ignore
GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py
# Python versions to be installed in /opt/$VERSION_NO
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t 3.14.0 3.14.0t"}
function check_var {
if [ -z "$1" ]; then
@ -24,9 +23,8 @@ function do_cpython_build {
tar -xzf Python-$py_ver.tgz
local additional_flags=""
if [ "$py_ver" == "3.13.0t" ]; then
if [[ "$py_ver" == *"t" ]]; then
additional_flags=" --disable-gil"
mv cpython-3.13/ cpython-3.13t/
fi
pushd $py_folder
@ -76,24 +74,20 @@ function do_cpython_build {
function build_cpython {
local py_ver=$1
check_var $py_ver
check_var $PYTHON_DOWNLOAD_URL
local py_ver_folder=$py_ver
local py_suffix=$py_ver
local py_folder=$py_ver
if [ "$py_ver" = "3.13.0t" ]; then
PY_VER_SHORT="3.13"
PYT_VER_SHORT="3.13t"
check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH
wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz
do_cpython_build $py_ver cpython-$PYT_VER_SHORT
elif [ "$py_ver" = "3.13.0" ]; then
PY_VER_SHORT="3.13"
check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH
wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz
do_cpython_build $py_ver cpython-$PY_VER_SHORT
else
wget -q $PYTHON_DOWNLOAD_URL/$py_ver_folder/Python-$py_ver.tgz
do_cpython_build $py_ver Python-$py_ver
# Special handling for nogil
if [[ "${py_ver}" == *"t" ]]; then
py_suffix=${py_ver::-1}
py_folder=$py_suffix
fi
# Only b3 is available now
if [ "$py_suffix" == "3.14.0" ]; then
py_suffix="3.14.0b3"
fi
wget -q $PYTHON_DOWNLOAD_URL/$py_folder/Python-$py_suffix.tgz -O Python-$py_ver.tgz
do_cpython_build $py_ver Python-$py_suffix
rm -f Python-$py_ver.tgz
}

View File

@ -10,6 +10,8 @@ else
arch_path='sbsa'
fi
NVSHMEM_VERSION=3.3.9
function install_cuda {
version=$1
runfile=$2
@ -40,13 +42,52 @@ function install_cudnn {
rm -rf tmp_cudnn
}
function install_nvshmem {
cuda_major_version=$1 # e.g. "12"
nvshmem_version=$2 # e.g. "3.3.9"
case "${arch_path}" in
sbsa)
dl_arch="aarch64"
;;
x86_64)
dl_arch="x64"
;;
*)
dl_arch="${arch}"
;;
esac
tmpdir="tmp_nvshmem"
mkdir -p "${tmpdir}" && cd "${tmpdir}"
# nvSHMEM license: https://docs.nvidia.com/nvshmem/api/sla.html
filename="libnvshmem_cuda${cuda_major_version}-linux-${arch_path}-${nvshmem_version}"
url="https://developer.download.nvidia.com/compute/redist/nvshmem/${nvshmem_version}/builds/cuda${cuda_major_version}/txz/agnostic/${dl_arch}/${filename}.tar.gz"
# download, unpack, install
wget -q "${url}"
tar xf "${filename}.tar.gz"
cp -a "libnvshmem/include/"* /usr/local/include/
cp -a "libnvshmem/lib/"* /usr/local/lib/
# cleanup
cd ..
rm -rf "${tmpdir}"
echo "nvSHMEM ${nvshmem_version} for CUDA ${cuda_major_version} (${arch_path}) installed."
}
function install_126 {
CUDNN_VERSION=9.10.2.21
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
echo "Installing CUDA 12.6.3 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
install_cuda 12.6.3 cuda_12.6.3_560.35.05_linux
install_cudnn 12 $CUDNN_VERSION
install_nvshmem 12 $NVSHMEM_VERSION
CUDA_VERSION=12.6 bash install_nccl.sh
CUDA_VERSION=12.6 bash install_cusparselt.sh
@ -56,13 +97,15 @@ function install_126 {
function install_129 {
CUDNN_VERSION=9.10.2.21
echo "Installing CUDA 12.9.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
echo "Installing CUDA 12.9.1 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
# install CUDA 12.9.1 in the same container
install_cuda 12.9.1 cuda_12.9.1_575.57.08_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 12 $CUDNN_VERSION
install_nvshmem 12 $NVSHMEM_VERSION
CUDA_VERSION=12.9 bash install_nccl.sh
CUDA_VERSION=12.9 bash install_cusparselt.sh
@ -106,13 +149,15 @@ function prune_126 {
function install_128 {
CUDNN_VERSION=9.8.0.87
echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NCCL and cuSparseLt-0.7.1"
echo "Installing CUDA 12.8.1 and cuDNN ${CUDNN_VERSION} and NVSHMEM and NCCL and cuSparseLt-0.7.1"
# install CUDA 12.8.1 in the same container
install_cuda 12.8.1 cuda_12.8.1_570.124.06_linux
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
install_cudnn 12 $CUDNN_VERSION
install_nvshmem 12 $NVSHMEM_VERSION
CUDA_VERSION=12.8 bash install_nccl.sh
CUDA_VERSION=12.8 bash install_cusparselt.sh

View File

@ -20,7 +20,7 @@ pip_install \
pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnxscript==0.3.0
pip_install onnxscript==0.3.1
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

View File

@ -26,6 +26,11 @@ Pin: release o=repo.radeon.com
Pin-Priority: 600
EOF
# we want the patch version of 6.4 instead
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
ROCM_VERSION="${ROCM_VERSION}.1"
fi
# Add amdgpu repository
UBUNTU_VERSION_NAME=`cat /etc/os-release | grep UBUNTU_CODENAME | awk -F= '{print $2}'`
echo "deb [arch=amd64] https://repo.radeon.com/amdgpu/${ROCM_VERSION}/ubuntu ${UBUNTU_VERSION_NAME} main" > /etc/apt/sources.list.d/amdgpu.list
@ -67,19 +72,23 @@ EOF
# ROCm 6.3 had a regression where initializing static code objects had significant overhead
# ROCm 6.4 did not yet fix the regression, also HIP branch names are different
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]] || [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
HIP_BRANCH=rocm-6.3.x
VER_STR=6.3
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.3) ]] && [[ $(ver $ROCM_VERSION) -lt $(ver 7.0) ]]; then
if [[ $(ver $ROCM_VERSION) -eq $(ver 6.4.1) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
VER_PATCH=.1
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.4) ]]; then
HIP_BRANCH=release/rocm-rel-6.4
VER_STR=6.4
elif [[ $(ver $ROCM_VERSION) -eq $(ver 6.3) ]]; then
HIP_BRANCH=rocm-6.3.x
VER_STR=6.3
fi
# clr build needs CppHeaderParser but can only find it using conda's python
/opt/conda/bin/python -m pip install CppHeaderParser
git clone https://github.com/ROCm/HIP -b $HIP_BRANCH
HIP_COMMON_DIR=$(readlink -f HIP)
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}-statco-hotfix
git clone https://github.com/jeffdaily/clr -b release/rocm-rel-${VER_STR}${VER_PATCH}-statco-hotfix
mkdir -p clr/build
pushd clr/build
cmake .. -DCLR_BUILD_HIP=ON -DHIP_COMMON_DIR=$HIP_COMMON_DIR

View File

@ -5,7 +5,12 @@ set -eou pipefail
function do_install() {
rocm_version=$1
rocm_version_nodot=${1//./}
if [[ ${rocm_version} =~ ^[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
# chop off any patch version
rocm_version="${rocm_version%.*}"
fi
rocm_version_nodot=${rocm_version//./}
# Version 2.7.2 + ROCm related updates
MAGMA_VERSION=a1625ff4d9bc362906bd01f805dbbe12612953f6

View File

@ -98,6 +98,10 @@ fi
if [ -n "${NUMPY_VERSION}" ]; then
pip_install "numpy==${NUMPY_VERSION}"
fi
# IMPORTANT: helion needs to be installed without dependencies.
# It depends on torch and triton. We don't want to install
# triton and torch from production on Docker CI images
if [[ "$ANACONDA_PYTHON_VERSION" != 3.9* ]]; then
pip_install helion
pip_install helion --no-deps
fi

View File

@ -39,6 +39,10 @@ case ${DOCKER_TAG_PREFIX} in
DOCKER_GPU_BUILD_ARG=""
;;
rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.1"
fi
BASE_TARGET=rocm
GPU_IMAGE=rocm/dev-ubuntu-22.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1200;gfx1201"

View File

@ -120,15 +120,19 @@ RUN python3 -mpip install cmake==3.28.0
# so just build it from upstream repository.
# h5py is dependency of onnxruntime_training.
# h5py==3.11.0 builds with hdf5-devel 1.10.5 from repository.
# h5py 3.11.0 doesn't build with numpy >= 2.3.0.
# install newest flatbuffers version first:
# for some reason old version is getting pulled in otherwise.
# packaging package is required for onnxruntime wheel build.
RUN pip3 install flatbuffers && \
pip3 install h5py==3.11.0 && \
pip3 install cython 'pkgconfig>=1.5.5' 'setuptools>=77' 'numpy<2.3.0' && \
pip3 install --no-build-isolation h5py==3.11.0 && \
pip3 install packaging && \
git clone https://github.com/microsoft/onnxruntime && \
cd onnxruntime && git checkout v1.21.0 && \
git submodule update --init --recursive && \
wget https://github.com/microsoft/onnxruntime/commit/f57db79743c4d1a3553aa05cf95bcd10966030e6.patch && \
patch -p1 < f57db79743c4d1a3553aa05cf95bcd10966030e6.patch && \
./build.sh --config Release --parallel 0 --enable_pybind \
--build_wheel --enable_training --enable_training_apis \
--enable_training_ops --skip_tests --allow_running_as_root \

View File

@ -75,6 +75,10 @@ case ${image} in
DOCKERFILE_SUFFIX="_cuda_aarch64"
;;
manylinux2_28-builder:rocm*)
# we want the patch version of 6.4 instead
if [[ $(ver $GPU_ARCH_VERSION) -eq $(ver 6.4) ]]; then
GPU_ARCH_VERSION="${GPU_ARCH_VERSION}.1"
fi
TARGET=rocm_final
MANY_LINUX_VERSION="2_28"
DEVTOOLSET_VERSION="11"

View File

@ -339,7 +339,7 @@ onnx==1.18.0
#Pinned versions:
#test that import:
onnxscript==0.2.6
onnxscript==0.3.1
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
@ -382,3 +382,7 @@ cmake==4.0.0
tlparse==0.3.30
#Description: required for log parsing
cuda-bindings>=12.0,<13.0 ; platform_machine != "s390x"
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py

View File

@ -19,9 +19,10 @@ sphinx_sitemap==2.6.0
#Description: This is used to generate sitemap for PyTorch docs
#Pinned versions: 2.6.0
matplotlib==3.5.3
matplotlib==3.5.3 ; python_version < "3.13"
matplotlib==3.6.3 ; python_version >= "3.13"
#Description: This is used to generate PyTorch docs
#Pinned versions: 3.5.3
#Pinned versions: 3.6.3 if python > 3.12. Otherwise 3.5.3.
tensorboard==2.13.0 ; python_version < "3.13"
tensorboard==2.18.0 ; python_version >= "3.13"

View File

@ -1 +1 @@
3.3.1
3.4.0

View File

@ -1 +1 @@
3.3.1
3.4.0

View File

@ -51,16 +51,22 @@ else
fi
cuda_version_nodot=$(echo $CUDA_VERSION | tr -d '.')
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6"
case ${CUDA_VERSION} in
12.8|12.9)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX" #removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
#removing sm_50-sm_70 as these architectures are deprecated in CUDA 12.8/9 and will be removed in future releases
12.8)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0"
;;
12.9)
TORCH_CUDA_ARCH_LIST="7.5;8.0;8.6;9.0;10.0;12.0+PTX"
# WAR to resolve the ld error in libtorch build with CUDA 12.9
if [[ "$PACKAGE_TYPE" == "libtorch" ]]; then
TORCH_CUDA_ARCH_LIST="7.5;8.0;9.0;10.0;12.0+PTX"
fi
;;
12.6)
TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST};9.0"
EXTRA_CAFFE2_CMAKE_FLAGS+=("-DATEN_NO_TEST=ON")
TORCH_CUDA_ARCH_LIST="5.0;6.0;7.0;7.5;8.0;8.6;9.0"
;;
*)
echo "unknown cuda version $CUDA_VERSION"
@ -103,12 +109,11 @@ DEPS_SONAME=(
)
# CUDA_VERSION 12.6, 12.8
# CUDA_VERSION 12.6, 12.8, 12.9
if [[ $CUDA_VERSION == 12* ]]; then
export USE_STATIC_CUDNN=0
# Try parallelizing nvcc as well
export TORCH_NVCC_FLAGS="-Xfatbin -compress-all --threads 2"
if [[ -z "$PYTORCH_EXTRA_INSTALL_REQUIREMENTS" ]]; then
echo "Bundling with cudnn and cublas."
DEPS_LIST+=(
@ -124,11 +129,12 @@ if [[ $CUDA_VERSION == 12* ]]; then
"/usr/local/cuda/lib64/libcublasLt.so.12"
"/usr/local/cuda/lib64/libcusparseLt.so.0"
"/usr/local/cuda/lib64/libcudart.so.12"
"/usr/local/cuda/lib64/libnvToolsExt.so.1"
"/usr/local/cuda/lib64/libnvrtc.so.12"
"/usr/local/cuda/lib64/libnvrtc-builtins.so"
"/usr/local/cuda/lib64/libcufile.so.0"
"/usr/local/cuda/lib64/libcufile_rdma.so.1"
"/usr/local/cuda/extras/CUPTI/lib64/libcupti.so.12"
"/usr/local/cuda/extras/CUPTI/lib64/libnvperf_host.so"
)
DEPS_SONAME+=(
"libcudnn_adv.so.9"
@ -143,12 +149,18 @@ if [[ $CUDA_VERSION == 12* ]]; then
"libcublasLt.so.12"
"libcusparseLt.so.0"
"libcudart.so.12"
"libnvToolsExt.so.1"
"libnvrtc.so.12"
"libnvrtc-builtins.so"
"libcufile.so.0"
"libcufile_rdma.so.1"
"libcupti.so.12"
"libnvperf_host.so"
)
# Add libnvToolsExt only if CUDA version is not 12.9
if [[ $CUDA_VERSION != 12.9* ]]; then
DEPS_LIST+=("/usr/local/cuda/lib64/libnvToolsExt.so.1")
DEPS_SONAME+=("libnvToolsExt.so.1")
fi
else
echo "Using nvidia libs from pypi."
CUDA_RPATHS=(

View File

@ -198,10 +198,8 @@ fi
# We only build FlashAttention files for CUDA 8.0+, and they require large amounts of
# memory to build and will OOM
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]] && [ -z "$MAX_JOBS_OVERRIDE" ]; then
echo "WARNING: FlashAttention files require large amounts of memory to build and will OOM"
echo "Setting MAX_JOBS=(nproc-2)/3 to reduce memory usage"
export MAX_JOBS="$(( $(nproc --ignore=2) / 3 ))"
if [[ "$BUILD_ENVIRONMENT" == *cuda* ]] && [[ 1 -eq $(echo "${TORCH_CUDA_ARCH_LIST} >= 8.0" | bc) ]]; then
export BUILD_CUSTOM_STEP="ninja -C build flash_attention -j 2"
fi
if [[ "${BUILD_ENVIRONMENT}" == *clang* ]]; then
@ -257,6 +255,7 @@ if [[ "$BUILD_ENVIRONMENT" == *-bazel-* ]]; then
set -e -o pipefail
get_bazel
python3 tools/optional_submodules.py checkout_eigen
# Leave 1 CPU free and use only up to 80% of memory to reduce the change of crashing
# the runner
@ -394,10 +393,8 @@ else
# This is an attempt to mitigate flaky libtorch build OOM error. By default, the build parallelization
# is set to be the number of CPU minus 2. So, let's try a more conservative value here. A 4xlarge has
# 16 CPUs
if [ -z "$MAX_JOBS_OVERRIDE" ]; then
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
fi
MAX_JOBS=$(nproc --ignore=4)
export MAX_JOBS
# NB: Install outside of source directory (at the same level as the root
# pytorch folder) so that it doesn't get cleaned away prior to docker push.

View File

@ -13,6 +13,13 @@ if [[ "$BUILD_ENVIRONMENT" != *win-* ]]; then
fi
if which sccache > /dev/null; then
# Clear SCCACHE_BUCKET and SCCACHE_REGION if they are empty, otherwise
# sccache will complain about invalid bucket configuration
if [[ -z "${SCCACHE_BUCKET:-}" ]]; then
unset SCCACHE_BUCKET
unset SCCACHE_REGION
fi
# Save sccache logs to file
sccache --stop-server > /dev/null 2>&1 || true
rm -f ~/sccache_error.log || true

View File

@ -5,11 +5,6 @@ set -x
# shellcheck source=./macos-common.sh
source "$(dirname "${BASH_SOURCE[0]}")/macos-common.sh"
if [[ -n "$CONDA_ENV" ]]; then
# Use binaries under conda environment
export PATH="$CONDA_ENV/bin":$PATH
fi
# Test that OpenMP is enabled
pushd test
if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

View File

@ -11,6 +11,8 @@ export TERM=vt100
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
# shellcheck source=./common-build.sh
source "$(dirname "${BASH_SOURCE[0]}")/common-build.sh"
# Do not change workspace permissions for ROCm and s390x CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
@ -163,8 +165,6 @@ elif [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
export PYTORCH_TESTING_DEVICE_ONLY_FOR="xpu"
# setting PYTHON_TEST_EXTRA_OPTION
export PYTHON_TEST_EXTRA_OPTION="--xpu"
# Disable sccache for xpu test due to flaky issue https://github.com/pytorch/pytorch/issues/143585
sudo rm -rf /opt/cache
fi
if [[ "$TEST_CONFIG" == *crossref* ]]; then
@ -196,7 +196,7 @@ if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# shellcheck disable=SC1091
source /opt/intel/oneapi/mpi/latest/env/vars.sh
# Check XPU status before testing
xpu-smi discovery
timeout 30 xpu-smi discovery || true
fi
if [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]] ; then
@ -330,6 +330,15 @@ test_h100_distributed() {
assert_git_not_dirty
}
test_h100_symm_mem() {
# symmetric memory test
time python test/run_test.py --include distributed/test_symmetric_memory.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time python test/run_test.py --include distributed/test_nvshmem.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time python test/run_test.py --include distributed/test_nvshmem_triton.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
time python test/run_test.py --include distributed/test_nccl.py $PYTHON_TEST_EXTRA_OPTION --upload-artifacts-while-running
assert_git_not_dirty
}
test_lazy_tensor_meta_reference_disabled() {
export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1
echo "Testing lazy tensor operations without meta reference"
@ -344,6 +353,7 @@ test_dynamo_wrapped_shard() {
exit 1
fi
python tools/dynamo/verify_dynamo.py
python tools/dynamo/gb_id_mapping.py verify
# PLEASE DO NOT ADD ADDITIONAL EXCLUDES HERE.
# Instead, use @skipIfTorchDynamo on your tests.
time python test/run_test.py --dynamo \
@ -358,6 +368,17 @@ test_dynamo_wrapped_shard() {
assert_git_not_dirty
}
test_einops() {
pip install einops==0.6.1
time python test/run_test.py --einops --verbose --upload-artifacts-while-running
pip install einops==0.7.0
time python test/run_test.py --einops --verbose --upload-artifacts-while-running
pip install einops==0.8.1
time python test/run_test.py --einops --verbose --upload-artifacts-while-running
assert_git_not_dirty
}
test_inductor_distributed() {
# Smuggle a few multi-gpu tests here so that we don't have to request another large node
echo "Testing multi_gpu tests in test_torchinductor"
@ -415,14 +436,21 @@ test_inductor_aoti() {
python3 tools/amd_build/build_amd.py
fi
if [[ "$BUILD_ENVIRONMENT" == *sm86* ]]; then
BUILD_AOT_INDUCTOR_TEST=1 TORCH_CUDA_ARCH_LIST=8.6 USE_FLASH_ATTENTION=OFF python setup.py develop
BUILD_COMMAND=(TORCH_CUDA_ARCH_LIST=8.6 USE_FLASH_ATTENTION=OFF python setup.py develop)
# TODO: Replace me completely, as one should not use conda libstdc++, nor need special path to TORCH_LIB
LD_LIBRARY_PATH=/opt/conda/envs/py_3.10/lib/:${TORCH_LIB_DIR}:$LD_LIBRARY_PATH
CPP_TESTS_DIR="${BUILD_BIN_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile
TEST_ENVS=(CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="/opt/conda/envs/py_3.10/lib:${TORCH_LIB_DIR}:${LD_LIBRARY_PATH}")
else
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile
BUILD_COMMAND=(python setup.py develop)
TEST_ENVS=(CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}")
fi
# aoti cmake custom command requires `torch` to be installed
# initialize the cmake build cache and install torch
/usr/bin/env "${BUILD_COMMAND[@]}"
# rebuild with the build cache with `BUILD_AOT_INDUCTOR_TEST` enabled
/usr/bin/env CMAKE_FRESH=1 BUILD_AOT_INDUCTOR_TEST=1 "${BUILD_COMMAND[@]}"
/usr/bin/env "${TEST_ENVS[@]}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference -dist=loadfile
}
test_inductor_cpp_wrapper_shard() {
@ -435,47 +463,26 @@ test_inductor_cpp_wrapper_shard() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
if [[ "$1" -eq "2" ]]; then
# For now, manually put the opinfo tests in shard 2, and all other tests in
# shard 1. Run all CPU tests, as well as specific GPU tests triggering past
# bugs, for now.
python test/run_test.py \
--include inductor/test_torchinductor_opinfo \
-k 'linalg or to_sparse or TestInductorOpInfoCPU' \
--verbose
exit
fi
# Run certain inductor unit tests with cpp wrapper. In the end state, we
# should be able to run all the inductor unit tests with cpp_wrapper.
#
# TODO: I'm pretty sure that "TestInductorOpInfoCPU" is not a valid filter,
# but change that in another PR to more accurately monitor the increased CI
# usage.
python test/run_test.py \
--include inductor/test_torchinductor_opinfo \
-k 'linalg or to_sparse or TestInductorOpInfoCPU' \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
python test/run_test.py \
--include inductor/test_torchinductor inductor/test_max_autotune inductor/test_cpu_repro \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
python test/run_test.py --inductor \
--include test_torch \
-k 'take' \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
python test/run_test.py --inductor --include test_torch -k 'take' --verbose
# Run inductor benchmark tests with cpp wrapper.
# Skip benchmark tests if it's in rerun-disabled-mode.
if [[ "${PYTORCH_TEST_RERUN_DISABLED_TESTS}" == "1" ]]; then
echo "skip dynamo benchmark tests for rerun-disabled-test"
else
echo "run dynamo benchmark tests with cpp wrapper"
python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \
--training --inductor --disable-cudagraphs --only vit_base_patch16_224 \
--output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_timm_training.csv"
python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/${MAYBE_ROCM}inductor_torchbench_inference.csv"
fi
}
# "Global" flags for inductor benchmarking controlled by TEST_CONFIG
@ -596,7 +603,9 @@ test_perf_for_dashboard() {
local device=cuda
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_x86_zen* ]]; then
device=cpu_x86_zen
elif [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
device=cpu_x86
elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then
device=cpu_aarch64
@ -612,7 +621,11 @@ test_perf_for_dashboard() {
for mode in "${modes[@]}"; do
if [[ "$mode" == "inference" ]]; then
dtype=bfloat16
if [[ "$device" == "cpu_x86" ]]; then
dtype=amp
else
dtype=bfloat16
fi
elif [[ "$mode" == "training" ]]; then
dtype=amp
fi
@ -624,6 +637,10 @@ test_perf_for_dashboard() {
target_flag+=( --no-translation-validation)
fi
if [[ "$DASHBOARD_TAG" == *freezing-true* ]]; then
target_flag+=( --freezing)
fi
if [[ "$DASHBOARD_TAG" == *default-true* ]]; then
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
@ -1137,6 +1154,12 @@ test_custom_backend() {
test_custom_script_ops() {
echo "Testing custom script operators"
if [[ "$BUILD_ENVIRONMENT" == *s390x* ]]; then
echo "Skipping custom script operators until it's fixed"
return 0
fi
CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"
pushd test/custom_operator
cp -a "$CUSTOM_OP_BUILD" build
@ -1671,11 +1694,11 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"
fi
elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper* ]]; then
install_torchaudio cuda
install_torchvision
checkout_install_torchbench hf_T5 llama moco
PYTHONPATH=$(pwd)/torchbench test_inductor_cpp_wrapper_shard "$SHARD_NUMBER"
test_inductor_aoti
if [[ "$SHARD_NUMBER" -eq "1" ]]; then
test_inductor_aoti
fi
elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
install_torchvision
test_inductor_shard "${SHARD_NUMBER}"
@ -1684,6 +1707,8 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
test_inductor_distributed
fi
fi
elif [[ "${TEST_CONFIG}" == *einops* ]]; then
test_einops
elif [[ "${TEST_CONFIG}" == *dynamo_wrapped* ]]; then
install_torchvision
test_dynamo_wrapped_shard "${SHARD_NUMBER}"
@ -1733,6 +1758,8 @@ elif [[ "${TEST_CONFIG}" == smoke ]]; then
test_python_smoke
elif [[ "${TEST_CONFIG}" == h100_distributed ]]; then
test_h100_distributed
elif [[ "${TEST_CONFIG}" == "h100-symm-mem" ]]; then
test_h100_symm_mem
else
install_torchvision
install_monkeytype

View File

@ -52,6 +52,9 @@ python -m pip install parameterized==0.8.1
# Install pulp for testing ilps under torch\distributed\_tools
python -m pip install pulp==2.9.0
# Install expecttest to merge https://github.com/pytorch/pytorch/pull/155308
python -m pip install expecttest==0.3.0
run_tests() {
# Run nvidia-smi if available
for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

View File

@ -29,7 +29,7 @@ IF "%NVTOOLSEXT_PATH%"=="" (
IF "%CUDA_PATH_V129%"=="" (
IF EXIST "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9\bin\nvcc.exe" (
set "CUDA_PATH_V128=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9"
set "CUDA_PATH_V129=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.9"
) ELSE (
echo CUDA 12.9 not found, failing
exit /b 1

View File

@ -8,6 +8,7 @@ copy "%CUDA_PATH%\bin\cusolver*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\cudnn*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\bin\nvrtc*64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\extras\CUPTI\lib64\cupti64_*.dll*" pytorch\torch\lib
copy "%CUDA_PATH%\extras\CUPTI\lib64\nvperf_host*.dll*" pytorch\torch\lib
copy "C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64\nvToolsExt64_1.dll*" pytorch\torch\lib
copy "%PYTHON_LIB_PATH%\libiomp*5md.dll" pytorch\torch\lib

View File

@ -18,3 +18,5 @@ start /wait "" python-amd64.exe /quiet InstallAllUsers=1 PrependPath=0 Include_t
if errorlevel 1 exit /b 1
set "PATH=%CD%\Python\Scripts;%CD%\Python;%PATH%"
%PYTHON_EXEC% -m pip install --upgrade pip setuptools packaging wheel
if errorlevel 1 exit /b 1

View File

@ -75,8 +75,8 @@ TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64'"
# CUDA 12.8 builds have triton for Linux and Linux aarch64 binaries.
if [[ "$DESIRED_CUDA" == cu128 ]]; then
# CUDA 12.9 builds have triton for Linux and Linux aarch64 binaries.
if [[ "$DESIRED_CUDA" == "cu129" ]]; then
TRITON_CONSTRAINT="platform_system == 'Linux'"
fi

View File

@ -14,6 +14,7 @@ self-hosted-runner:
- linux.12xlarge
- linux.24xlarge
- linux.24xlarge.ephemeral
- linux.24xlarge.amd
- linux.arm64.2xlarge
- linux.arm64.2xlarge.ephemeral
- linux.arm64.m7g.4xlarge

View File

@ -125,7 +125,7 @@ runs:
TAG: ${{ steps.parse-ref.outputs.tag }}
EVENT_NAME: ${{ github.event_name }}
SCHEDULE: ${{ github.event.schedule }}
HEAD_BRANCH: ${{ github.event.workflow_run.head_branch }}
HEAD_BRANCH: ${{ steps.parse-ref.outputs.branch }}
id: filter
run: |
echo "Workflow: ${GITHUB_WORKFLOW}"

View File

@ -132,17 +132,26 @@ def check_changed_files(sha: str) -> bool:
# Return true if all the changed files are in the list of allowed files to
# be changed to reuse the old whl
# Removing any files is not allowed since rsync will not remove files
# Removing files in the torch folder is not allowed since rsync will not
# remove files
removed_files = (
subprocess.check_output(
["git", "diff", "--name-only", sha, "HEAD", "--diff-filter=D"],
[
"git",
"diff",
"--name-only",
sha,
"HEAD",
"--diff-filter=D",
"--no-renames",
],
text=True,
stderr=subprocess.DEVNULL,
)
.strip()
.split()
)
if removed_files:
if any(file.startswith("torch/") for file in removed_files):
print(
f"Removed files between {sha} and HEAD: {removed_files}, cannot reuse old whl"
)
@ -150,7 +159,7 @@ def check_changed_files(sha: str) -> bool:
changed_files = (
subprocess.check_output(
["git", "diff", "--name-only", sha, "HEAD"],
["git", "diff", "--name-only", sha, "HEAD", "--no-renames"],
text=True,
stderr=subprocess.DEVNULL,
)
@ -295,8 +304,7 @@ def unzip_artifact_and_replace_files() -> None:
def set_output() -> None:
# Disable for now so we can monitor first
# pass
print("Setting output reuse=true")
if os.getenv("GITHUB_OUTPUT"):
with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
print("reuse=true", file=env)

View File

@ -1 +1 @@
1f643d1de912cfef21b1daefc8774ddcace7a0b2
70caf76066ef2c1054d6128b11769dc816a779e7

View File

@ -1 +1 @@
55a75404c9b75cd5fd62ab5d4deafc8c506b3af2
1c00dea2c9adb2137903c86b4191e8c247f8fda9

View File

@ -48,3 +48,12 @@
- "module: dynamic shapes"
then:
- "oncall: pt2"
- any:
- "release notes: distributed (c10d)"
- "release notes: distributed (symm_mem)"
- "release notes: distributed (pipeline)"
- "release notes: distributed (fsdp)"
- "release notes: distributed (dtensor)"
- "oncall: distributed"
then:
- "ciflow/h100-distributed"

View File

@ -384,6 +384,7 @@
- leslie-fang-intel
- jgong5
- EikanWang
- CaoE
mandatory_checks_name:
- EasyCLA
- Lint
@ -435,6 +436,7 @@
approved_by:
- leslie-fang-intel
- jgong5
- CaoE
mandatory_checks_name:
- EasyCLA
- Lint

View File

@ -4,6 +4,7 @@ ciflow_push_tags:
- ciflow/binaries
- ciflow/binaries_libtorch
- ciflow/binaries_wheel
- ciflow/triton_binaries
- ciflow/inductor
- ciflow/inductor-periodic
- ciflow/inductor-rocm
@ -11,6 +12,7 @@ ciflow_push_tags:
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
- ciflow/inductor-perf-test-nightly-x86-zen
- ciflow/inductor-cu126
- ciflow/linux-aarch64
- ciflow/mps
@ -29,6 +31,7 @@ ciflow_push_tags:
- ciflow/pull
- ciflow/h100
- ciflow/h100-distributed
- ciflow/h100-symm-mem
retryable_workflows:
- pull
- trunk

View File

@ -275,7 +275,7 @@ def delete_branches() -> None:
delete_branch(git_repo, branch)
def delete_old_ciflow_tags() -> None:
def delete_old_tags() -> None:
# Deletes ciflow tags if they are associated with a closed PR or a specific
# commit. Lightweight tags don't have information about the date they were
# created, so we can't check how old they are. The script just assumes that
@ -288,23 +288,29 @@ def delete_old_ciflow_tags() -> None:
delete_branch(git_repo, f"refs/tags/{tag}")
tags = git_repo._run_git("tag").splitlines()
open_pr_numbers = [x["number"] for x in get_open_prs()]
CIFLOW_TAG_REGEX = re.compile(r"^ciflow\/.*\/(\d{5,6}|[0-9a-f]{40})$")
AUTO_REVERT_TAG_REGEX = re.compile(r"^trunk\/[0-9a-f]{40}$")
for tag in tags:
try:
if ESTIMATED_TOKENS[0] > 400:
print("Estimated tokens exceeded, exiting")
break
if not tag.startswith("ciflow/"):
if not CIFLOW_TAG_REGEX.match(tag) and not AUTO_REVERT_TAG_REGEX.match(tag):
continue
re_match_pr = re.match(r"^ciflow\/.*\/(\d{5,6})$", tag)
re_match_sha = re.match(r"^ciflow\/.*\/([0-9a-f]{40})$", tag)
if re_match_pr:
pr_number = int(re_match_pr.group(1))
if pr_number in open_pr_numbers:
continue
delete_tag(tag)
elif re_match_sha:
# This checks the date of the commit associated with the tag instead
# of the tag itself since lightweight tags don't have this
# information. I think it should be ok since this only runs once a
# day
tag_info = git_repo._run_git("show", "-s", "--format=%ct", tag)
tag_timestamp = int(tag_info.strip())
# Maybe some timezone issues, but a few hours shouldn't matter
tag_age_days = (datetime.now().timestamp() - tag_timestamp) / SEC_IN_DAY
if tag_age_days > 7:
print(f"[{tag}] Tag is older than 7 days, deleting")
delete_tag(tag)
except Exception as e:
print(f"Failed to check tag {tag}: {e}")
@ -312,4 +318,4 @@ def delete_old_ciflow_tags() -> None:
if __name__ == "__main__":
delete_branches()
delete_old_ciflow_tags()
delete_old_tags()

View File

@ -18,6 +18,7 @@ import yaml
REENABLE_TEST_REGEX = "(?i)(Close(d|s)?|Resolve(d|s)?|Fix(ed|es)?) (#|https://github.com/pytorch/pytorch/issues/)([0-9]+)"
MAIN_BRANCH = "main"
PREFIX = "test-config/"
@ -97,7 +98,7 @@ def parse_args() -> Any:
parser.add_argument(
"--branch",
type=str,
default="main",
default=MAIN_BRANCH,
help="the branch name",
)
return parser.parse_args()
@ -456,6 +457,7 @@ def download_json(url: str, headers: dict[str, str], num_retries: int = 3) -> An
def set_output(name: str, val: Any) -> None:
print(f"Setting output {name}={val}")
if os.getenv("GITHUB_OUTPUT"):
with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
print(f"{name}={val}", file=env)
@ -495,13 +497,20 @@ def check_for_setting(labels: set[str], body: str, setting: str) -> bool:
def perform_misc_tasks(
labels: set[str], test_matrix: dict[str, list[Any]], job_name: str, pr_body: str
labels: set[str],
test_matrix: dict[str, list[Any]],
job_name: str,
pr_body: str,
branch: Optional[str] = None,
) -> None:
"""
In addition to apply the filter logic, the script also does the following
misc tasks to set keep-going and is-unstable variables
"""
set_output("keep-going", check_for_setting(labels, pr_body, "keep-going"))
set_output(
"keep-going",
branch == MAIN_BRANCH or check_for_setting(labels, pr_body, "keep-going"),
)
set_output(
"ci-verbose-test-logs",
check_for_setting(labels, pr_body, "ci-verbose-test-logs"),
@ -624,6 +633,7 @@ def main() -> None:
test_matrix=filtered_test_matrix,
job_name=args.job_name,
pr_body=pr_body if pr_body else "",
branch=args.branch,
)
# Set the filtered test matrix as the output

View File

@ -17,7 +17,7 @@ from typing import Optional
# NOTE: Please also update the CUDA sources in `PIP_SOURCES` in tools/nightly.py when changing this
CUDA_ARCHES = ["12.6", "12.8", "12.9"]
CUDA_STABLE = "12.6"
CUDA_STABLE = "12.8"
CUDA_ARCHES_FULL_VERSION = {
"12.6": "12.6.3",
"12.8": "12.8.1",
@ -54,7 +54,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -71,7 +71,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -88,6 +88,7 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'"
@ -222,13 +223,8 @@ def generate_libtorch_matrix(
if os == "linux":
arches += CUDA_ARCHES
arches += ROCM_ARCHES
# will add in a separate PR for 12.9
if "12.9" in arches:
arches.remove("12.9")
elif os == "windows":
arches += CUDA_ARCHES
if "12.9" in arches:
arches.remove("12.9")
if libtorch_variants is None:
libtorch_variants = [
"shared-with-deps",
@ -294,9 +290,6 @@ def generate_wheels_matrix(
arches += CUDA_ARCHES + ROCM_ARCHES + XPU_ARCHES
elif os == "windows":
arches += CUDA_ARCHES + XPU_ARCHES
# skip CUDA 12.9 builds on Windows
if "12.9" in arches:
arches.remove("12.9")
elif os == "linux-aarch64":
# Separate new if as the CPU type is different and
# uses different build/test scripts

View File

@ -136,10 +136,10 @@ def find_job_id_name(args: Any) -> tuple[str, str]:
def set_output(name: str, val: Any) -> None:
print(f"Setting output {name}={val}")
if os.getenv("GITHUB_OUTPUT"):
with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
print(f"{name}={val}", file=env)
print(f"setting {name}={val}")
else:
print(f"::set-output name={name}::{val}")

View File

@ -5,6 +5,7 @@ import re
def set_output(name: str, val: str) -> None:
print(f"Setting output {name}={val}")
if os.getenv("GITHUB_OUTPUT"):
with open(str(os.getenv("GITHUB_OUTPUT")), "a") as env:
print(f"{name}={val}", file=env)

View File

@ -1,64 +0,0 @@
import argparse
import subprocess
import generate_binary_build_matrix
def tag_image(
image: str,
default_tag: str,
release_version: str,
dry_run: str,
tagged_images: dict[str, bool],
) -> None:
if image in tagged_images:
return
release_image = image.replace(f"-{default_tag}", f"-{release_version}")
print(f"Tagging {image} to {release_image} , dry_run: {dry_run}")
if dry_run == "disabled":
subprocess.check_call(["docker", "pull", image])
subprocess.check_call(["docker", "tag", image, release_image])
subprocess.check_call(["docker", "push", release_image])
tagged_images[image] = True
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument(
"--version",
help="Version to tag",
type=str,
default="2.2",
)
parser.add_argument(
"--dry-run",
help="No Runtime Error check",
type=str,
choices=["enabled", "disabled"],
default="enabled",
)
options = parser.parse_args()
tagged_images: dict[str, bool] = {}
platform_images = [
generate_binary_build_matrix.WHEEL_CONTAINER_IMAGES,
generate_binary_build_matrix.LIBTORCH_CONTAINER_IMAGES,
]
default_tag = generate_binary_build_matrix.DEFAULT_TAG
for platform_image in platform_images: # type: ignore[attr-defined]
for arch in platform_image.keys(): # type: ignore[attr-defined]
if arch == "cpu-s390x":
continue
tag_image(
platform_image[arch], # type: ignore[index]
default_tag,
options.version,
options.dry_run,
tagged_images,
)
if __name__ == "__main__":
main()

View File

@ -0,0 +1,56 @@
import os
import unittest
from datetime import datetime
from unittest.mock import MagicMock, patch
os.environ["GITHUB_TOKEN"] = "test_token"
from delete_old_branches import delete_old_tags
@patch("delete_old_branches.delete_branch")
@patch("gitutils.GitRepo._run_git")
class TestDeleteTag(unittest.TestCase):
def test_delete_tag(
self, mock_run_git: "MagicMock", mock_delete_tag: "MagicMock"
) -> None:
for tag in [
"ciflow/branch/12345",
"ciflow/commitsha/1234567890abcdef1234567890abcdef12345678",
"trunk/1234567890abcdef1234567890abcdef12345678",
]:
mock_run_git.side_effect = [
tag,
str(int(datetime.now().timestamp() - 8 * 24 * 60 * 60)), # 8 days ago
]
delete_old_tags()
mock_delete_tag.assert_called_once()
mock_delete_tag.reset_mock()
# Don't delete if the tag is not old enough
mock_run_git.side_effect = [
tag,
str(int(datetime.now().timestamp() - 6 * 24 * 60 * 60)), # 6 days ago
]
delete_old_tags()
mock_delete_tag.assert_not_called()
def test_do_not_delete_tag(
self, mock_run_git: "MagicMock", mock_delete_tag: "MagicMock"
) -> None:
for tag in [
"ciflow/doesntseemtomatch",
"trunk/doesntseemtomatch",
"doesntseemtomatch",
]:
mock_run_git.side_effect = [
tag,
str(int(datetime.now().timestamp() - 8 * 24 * 60 * 60)), # 8 days ago
]
delete_old_tags()
mock_delete_tag.assert_not_called()
if __name__ == "__main__":
unittest.main()

View File

@ -69,11 +69,6 @@ on:
required: false
type: string
default: ""
max-jobs:
description: |
Overwrite the number of jobs to use for the build
required: false
type: string
disable-monitor:
description: |
Disable utilization monitoring for build job
@ -190,7 +185,7 @@ jobs:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
tag=${ECR_DOCKER_IMAGE##*:}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
@ -266,7 +261,6 @@ jobs:
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
MAX_JOBS_OVERRIDE: ${{ inputs.max-jobs }}
run: |
START_TIME=$(date +%s)
if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
@ -286,12 +280,6 @@ jobs:
DOCKER_SHELL_CMD=
fi
if [[ ${MAX_JOBS_OVERRIDE} == "" ]]; then
MAX_JOBS="$(nproc --ignore=2)"
else
MAX_JOBS="${MAX_JOBS_OVERRIDE}"
fi
# Leaving 1GB for the runner and other things
TOTAL_AVAILABLE_MEMORY_IN_GB=$(awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo)
# https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details, the 3GB swap
@ -303,8 +291,7 @@ jobs:
# shellcheck disable=SC2086
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS=${MAX_JOBS} \
-e MAX_JOBS_OVERRIDE \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e AWS_DEFAULT_REGION \
-e PR_NUMBER \
-e SHA1 \

View File

@ -90,10 +90,13 @@ jobs:
environment: ${{ github.ref == 'refs/heads/main' && 'scribe-protected' || startsWith(github.ref, 'refs/heads/release/') && 'scribe-protected' || contains(github.event.pull_request.labels.*.name, 'ci-scribe') && 'scribe-pr' || '' }}
runs-on: ${{ matrix.runner }}
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
permissions:
id-token: write
contents: read
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
if: ${{ !contains(matrix.runner, 'gcp.a100') && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
if: ${{ matrix.runner != 'B200' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
instructions: |
@ -105,18 +108,31 @@ jobs:
with:
no-sudo: true
- name: Setup Python
if: matrix.runner == 'B200'
uses: actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065 # v5.6.0
with:
python-version: '3.12'
cache: pip
- name: Setup Linux
uses: ./.github/actions/setup-linux
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
if: inputs.build-environment != 'linux-s390x-binary-manywheel' && matrix.runner != 'B200'
- name: configure aws credentials
if : ${{ inputs.aws-role-to-assume != '' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
if: ${{ inputs.aws-role-to-assume != '' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-test
aws-region: us-east-1
- name: Login to Amazon ECR
if: ${{ inputs.aws-role-to-assume != '' && matrix.runner == 'B200' }}
id: login-ecr
continue-on-error: true
uses: aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076 # v2.0.1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
@ -131,7 +147,7 @@ jobs:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
tag=${ECR_DOCKER_IMAGE##*:}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
@ -148,17 +164,17 @@ jobs:
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' && matrix.runner != 'B200' }}
- name: Setup GPU_FLAG for docker run
id: setup-gpu-flag
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && (steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' || matrix.runner == 'B200') }}
- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container
id: setup-sscache-port-flag
run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' && matrix.runner != 'B200' }}
- name: Lock NVIDIA A100 40GB Frequency
run: |
@ -225,6 +241,12 @@ jobs:
run: |
echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}"
- name: Preserve github env variables for use in docker
shell: bash
run: |
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
- name: Test
id: test
timeout-minutes: ${{ fromJson(steps.test-timeout.outputs.timeout) }}
@ -253,8 +275,8 @@ jobs:
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}
# Do not set SCCACHE_S3_KEY_PREFIX to share the cache between all build jobs
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_REGION: us-east-1
SCCACHE_BUCKET: ${{ matrix.runner != 'B200' && 'ossci-compiler-cache-circleci-v2' || '' }}
SCCACHE_REGION: ${{ matrix.runner != 'B200' && 'us-east-1' || '' }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
@ -264,7 +286,6 @@ jobs:
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
IS_A100_RUNNER: ${{ contains(matrix.runner, 'a100') && '1' || '0' }}
ARTIFACTS_FILE_SUFFIX: ${{ github.job }}-${{ matrix.config }}-${{ matrix.shard }}-${{ matrix.num_shards }}-${{ matrix.runner }}_${{ steps.get-job-id.outputs.job-id }}
run: |
set -x
@ -290,10 +311,6 @@ jobs:
# if for some reason cleanup action doesn't stop container
# when job is cancelled
DOCKER_SHELL_CMD="sleep 12h"
# since some steps are skipped on s390x, if they are necessary, run them here
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
else
SHM_OPTS="--shm-size=${SHM_SIZE}"
JENKINS_USER="--user jenkins"
@ -345,7 +362,6 @@ jobs:
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e DASHBOARD_TAG \
-e IS_A100_RUNNER \
-e ARTIFACTS_FILE_SUFFIX \
--memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \
--memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \
@ -384,8 +400,18 @@ jobs:
test_config: ${{ matrix.config }}
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
- name: Authenticate with AWS
if: ${{ matrix.runner == 'B200' }}
uses: aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722 # v4.1.0
with:
role-to-assume: arn:aws:iam::308535385114:role/gha_workflow_upload-benchmark-results
# The max duration enforced by the server side
role-duration-seconds: 18000
aws-region: us-east-1
- name: Upload the benchmark results
uses: pytorch/test-infra/.github/actions/upload-benchmark-results@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
benchmark-results-dir: test/test-reports
dry-run: false

View File

@ -123,7 +123,7 @@ jobs:
else
# The runner has access to the S3 bucket via IAM profile without the need
# for any credential
echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"0
echo "SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2" >> "${GITHUB_ENV}"
echo "SCCACHE_S3_KEY_PREFIX=${GITHUB_WORKFLOW}" >> "${GITHUB_ENV}"
fi

View File

@ -60,8 +60,6 @@ jobs:
test:
# Don't run on forked repos or empty test matrix
if: github.repository_owner == 'pytorch' && toJSON(fromJSON(inputs.test-matrix).include) != '[]'
# For setup-miniconda, see https://github.com/conda-incubator/setup-miniconda/issues/179
# Also ensure that we always run with the right architecture
defaults:
run:
shell: bash -e -l {0}
@ -90,6 +88,10 @@ jobs:
pkill "${PROCESS}" || true
done
- name: Clean up leftover miniconda installation
continue-on-error: true
run: brew uninstall miniconda || true
- name: Clean up leftover local python3 site-packages on MacOS pet runner
continue-on-error: true
run: |
@ -124,8 +126,8 @@ jobs:
MONITOR_LOG_INTERVAL: ${{ inputs.monitor-log-interval }}
MONITOR_DATA_COLLECT_INTERVAL: ${{ inputs.monitor-data-collect-interval }}
run: |
${CONDA_RUN} python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7
${CONDA_RUN} python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
python3 -m pip install psutil==5.9.1 dataclasses_json==0.6.7
python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 &
echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}"
- name: Download build artifacts
@ -140,11 +142,10 @@ jobs:
with:
use-gha: true
- name: Setup miniconda
uses: pytorch/test-infra/.github/actions/setup-miniconda@main
- name: Setup Python
uses: pytorch/test-infra/.github/actions/setup-python@main
with:
python-version: ${{ inputs.python-version }}
environment-file: .github/requirements/conda-env-macOS-ARM64
pip-requirements-file: .github/requirements/pip-requirements-macOS.txt
default-packages: ""
@ -197,37 +198,32 @@ jobs:
# shellcheck disable=SC1090
set -ex
arch
if [[ -n "$CONDA_ENV" ]]; then
# Use binaries under conda environment
export PATH="$CONDA_ENV/bin":$PATH
fi
# TODO: Remove me later, and properly activate venv
PATH="$(dirname "$(which python)"):$PATH"
export PATH
# Print out some information about the test environment
which conda
conda --version
${CONDA_RUN} which python3
${CONDA_RUN} python3 --version
${CONDA_RUN} which python
${CONDA_RUN} python --version
for tool in python3 python; do
which $tool
$tool --version
done
${CONDA_RUN} python3 -mpip install --no-index --no-deps dist/*.whl
python3 -mpip install --no-index --no-deps dist/*.whl
set +e
pushd "${RUNNER_TEMP}"
# Install pip dependencies if they are not found. This is to mitigate a peculiar
# flaky missing dependencies on MacOS
${CONDA_RUN} python3 -c "import torch"
python3 -c "import torch"
RC=$?
popd
if [ "${RC}" -ne 0 ]; then
${CONDA_RUN} python3 -mpip install --ignore-installed -r "${PIP_REQUIREMENTS_FILE}"
python3 -mpip install --ignore-installed -r "${PIP_REQUIREMENTS_FILE}"
fi
set -e
${CONDA_RUN} .ci/pytorch/macos-test.sh
.ci/pytorch/macos-test.sh
- name: Print remaining test logs
shell: bash
@ -239,11 +235,7 @@ jobs:
shell: bash
if: ${{ contains(steps.get-job-id.outputs.job-name, 'mps') }}
run: |
if [[ -n "$CONDA_ENV" ]]; then
# Use binaries under conda environment
export PATH="$CONDA_ENV/bin":$PATH
fi
${CONDA_RUN} python3 test/bench_mps_ops.py
python3 test/bench_mps_ops.py
- name: Stop monitoring script

View File

@ -105,7 +105,7 @@ jobs:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
tag=${ECR_DOCKER_IMAGE##*:}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
@ -191,9 +191,6 @@ jobs:
SHARD_NUMBER: ${{ matrix.shard }}
NUM_TEST_SHARDS: ${{ matrix.num_shards }}
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_REGION: us-east-1
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: ${{ matrix.mem_leak_check && '1' || '0' }}

View File

@ -23,7 +23,7 @@ on:
env:
DOCKER_REGISTRY: "docker.io"
DOCKER_BUILDKIT: 1
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) }}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
@ -32,7 +32,7 @@ concurrency:
jobs:
build-docker:
if: github.repository_owner == 'pytorch'
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
environment: ${{ (github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) && 'docker-build') || '' }}
runs-on: linux.9xlarge.ephemeral
strategy:
matrix:

View File

@ -22,7 +22,7 @@ on:
env:
DOCKER_REGISTRY: "docker.io"
DOCKER_BUILDKIT: 1
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) }}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
@ -40,7 +40,7 @@ jobs:
curr_ref_type: ${{ github.ref_type }}
build:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
environment: ${{ (github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) && 'docker-build') || '' }}
needs: get-label-type
runs-on: ${{ needs.get-label-type.outputs.label-type }}linux.9xlarge.ephemeral
name: libtorch-cxx11-builder:${{ matrix.tag }}

View File

@ -29,7 +29,7 @@ concurrency:
jobs:
build-linux-magma-rocm:
if: github.repository_owner == 'pytorch'
runs-on: linux.12xlarge
runs-on: linux.2xlarge
permissions:
id-token: write
strategy:

View File

@ -12,7 +12,7 @@ on:
env:
DOCKER_REGISTRY: "docker.io"
DOCKER_BUILDKIT: 1
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) }}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
@ -21,7 +21,7 @@ concurrency:
jobs:
build-docker-cpu-s390x:
if: github.repository_owner == 'pytorch'
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
environment: ${{ (github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) && 'docker-build') || '' }}
runs-on: linux.s390x
steps:
- name: Checkout PyTorch

View File

@ -23,8 +23,7 @@ on:
env:
DOCKER_REGISTRY: "docker.io"
DOCKER_BUILDKIT: 1
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release')) }}
WITH_PUSH: ${{ github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) }}
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
@ -41,7 +40,7 @@ jobs:
curr_ref_type: ${{ github.ref_type }}
build:
environment: ${{ (github.ref == 'refs/heads/main' || startsWith(github.event.ref, 'refs/tags/v')) && 'docker-build' || '' }}
environment: ${{ (github.event_name == 'push' && (github.ref == 'refs/heads/main' || startsWith(github.ref, 'refs/heads/release') || startsWith(github.ref, 'refs/tags/v')) && 'docker-build') || '' }}
needs: get-label-type
strategy:
fail-fast: false

View File

@ -8,6 +8,7 @@ on:
# NOTE: Binary build pipelines should only get triggered on release candidate builds
# Release candidate tags look like: v1.11.0-rc1
- v[0-9]+.[0-9]+.[0-9]+-rc[0-9]+
- 'ciflow/triton_binaries/*'
paths:
- .github/workflows/build-triton-wheel.yml
- .github/scripts/build_triton_wheel.py

View File

@ -55,6 +55,8 @@ jobs:
tag_or_branch="${tag_or_branch//\//_}"
echo "PT_RELEASE_NAME=pytorch-$tag_or_branch" >> "$GITHUB_ENV"
echo "PT_RELEASE_FILE=pytorch-$tag_or_branch.tar.gz" >> "$GITHUB_ENV"
- name: Checkout optional submodules
run: python3 tools/optional_submodules.py
- name: Create source distribution
run: |
# Create new folder with specified name so extracting the archive yields that
@ -80,7 +82,7 @@ jobs:
path: ${{ env.PT_RELEASE_FILE }}
- name: Set output
id: release_name
run: echo "name=pt_release_name::${{ env.PT_RELEASE_NAME }}.tar.gz" >> "${GITHUB_OUTPUT}"
run: echo "pt_release_name=${{ env.PT_RELEASE_NAME }}.tar.gz" >> "${GITHUB_OUTPUT}"
upload_source_code_to_s3:
if: ${{ github.repository == 'pytorch/pytorch' && github.event_name == 'push' && startsWith(github.ref, 'refs/tags/v') && contains(github.ref, 'rc') }}

View File

@ -69,7 +69,6 @@ jobs:
pytorch-linux-jammy-py3.12-halide,
pytorch-linux-jammy-xpu-2025.0-py3,
pytorch-linux-jammy-xpu-2025.1-py3,
pytorch-linux-jammy-py3-clang15-asan,
pytorch-linux-jammy-py3-clang18-asan,
pytorch-linux-jammy-py3-clang12-onnx,
pytorch-linux-jammy-linter,

View File

@ -136,7 +136,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -252,7 +252,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -368,7 +368,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -484,7 +484,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -600,7 +600,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
@ -716,7 +716,7 @@ jobs:
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_13t-cuda-aarch64-12_9
build_environment: linux-aarch64-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
timeout-minutes: 420
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}

View File

@ -248,6 +248,74 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_9-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: libtorch-cuda12_9-shared-with-deps-release
build_environment: linux-binary-libtorch
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_9-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cuda12_9-shared-with-deps-release-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-cuda12_9-shared-with-deps-release
build_environment: linux-binary-libtorch
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8 and 12.9 build need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
libtorch-cuda12_9-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_9-shared-with-deps-release-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: libtorch-cxx11-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.9
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
build_name: libtorch-cuda12_9-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-rocm6_3-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml

View File

@ -61,7 +61,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_6-test: # Testing
@ -108,7 +108,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_8-test: # Testing
@ -155,7 +155,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_9-test: # Testing

View File

@ -131,7 +131,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_6-test: # Testing
@ -200,7 +200,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_8-test: # Testing
@ -269,7 +269,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_9-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_9-cuda12_9-test: # Testing
@ -744,7 +744,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_6-test: # Testing
@ -813,7 +813,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_8-test: # Testing
@ -882,7 +882,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_10-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_10-cuda12_9-test: # Testing
@ -1357,7 +1357,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_6-test: # Testing
@ -1407,74 +1407,6 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda12_6-full-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
use_split_build: False
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_6-full
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_6-full-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_11-cuda12_6-full-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_6-full
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu # for other cuda versions, we use 4xlarge runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_6-full-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda12_6-full-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu126
GPU_ARCH_VERSION: 12.6
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.6
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_6-full
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda12_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -1494,7 +1426,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_8-test: # Testing
@ -1544,6 +1476,74 @@ jobs:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda12_8-full-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
use_split_build: False
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_8-full
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_8-full-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_11-cuda12_8-full-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_8-full
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.g4dn.4xlarge.nvidia.gpu # 12.8 and 12.9 build need sm_70+ runner
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_8-full-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_11-cuda12_8-full-test
with:
PYTORCH_ROOT: /pytorch
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu128
GPU_ARCH_VERSION: 12.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: manylinux2_28-builder
DOCKER_IMAGE_TAG_PREFIX: cuda12.8
use_split_build: False
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cuda12_8-full
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_11-cuda12_9-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
@ -1563,7 +1563,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_11-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_11-cuda12_9-test: # Testing
@ -2038,7 +2038,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_6-test: # Testing
@ -2107,7 +2107,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_8-test: # Testing
@ -2176,7 +2176,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_12-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_12-cuda12_9-test: # Testing
@ -2651,7 +2651,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_6-test: # Testing
@ -2720,7 +2720,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_8-test: # Testing
@ -2789,7 +2789,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13-cuda12_9-test: # Testing
@ -3264,7 +3264,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_6
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.6.80; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.6.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.0.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.7.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.1.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.4.2; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.6.77; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.6.85; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.11.1.6; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_6-test: # Testing
@ -3333,7 +3333,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.2.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.8.4.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.3.3.83; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.9.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.3.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.8.90; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.8.93; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.13.1.3; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_8-test: # Testing
@ -3402,7 +3402,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_9
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.10.2.21; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.9.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.4.1.4; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.10.19; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.7.5.82; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.5.10.65; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparselt-cu12==0.7.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.27.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvshmem-cu12==3.3.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.9.79; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.9.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufile-cu12==1.14.1.1; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_9-test: # Testing

View File

@ -788,3 +788,253 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_9-shared-with-deps-debug-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.9"
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v4.4.0
if: always()
with:
name: libtorch-cuda12_9-shared-with-deps-debug
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_9-shared-with-deps-debug-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cuda12_9-shared-with-deps-debug-build
- get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.g4dn.xlarge"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-cuda12_9-shared-with-deps-debug
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_9-shared-with-deps-debug-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_9-shared-with-deps-debug-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
LIBTORCH_CONFIG: debug
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.9"
build_name: libtorch-cuda12_9-shared-with-deps-debug
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -788,3 +788,253 @@ jobs:
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml
libtorch-cuda12_9-shared-with-deps-release-build:
if: ${{ github.repository_owner == 'pytorch' }}
needs: get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.9"
steps:
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Build PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_build.sh"
- uses: actions/upload-artifact@v4.4.0
if: always()
with:
name: libtorch-cuda12_9-shared-with-deps-release
retention-days: 14
if-no-files-found: error
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_9-shared-with-deps-release-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- libtorch-cuda12_9-shared-with-deps-release-build
- get-label-type
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.g4dn.xlarge"
timeout-minutes: 300
env:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
SKIP_ALL_TESTS: 1
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.9"
steps:
- name: Display EC2 information
shell: bash
run: |
set -euo pipefail
function get_ec2_metadata() {
# Pulled from instance metadata endpoint for EC2
# see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html
category=$1
curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}"
}
echo "ami-id: $(get_ec2_metadata ami-id)"
echo "instance-id: $(get_ec2_metadata instance-id)"
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: "[FB EMPLOYEES] Enable SSH (Click me for login details)"
uses: pytorch/test-infra/.github/actions/setup-ssh@main
continue-on-error: true
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
- name: Enable git long paths and symlinks on Windows and disable fsmonitor daemon
shell: bash
run: |
git config --global core.longpaths true
git config --global core.symlinks true
# https://git-scm.com/docs/git-fsmonitor--daemon. The daemon could lock
# the directory on Windows and prevent GHA from checking out as reported
# in https://github.com/actions/checkout/issues/1018
git config --global core.fsmonitor false
# Needed for binary builds, see: https://github.com/pytorch/pytorch/issues/73339#issuecomment-1058981560
- name: Enable long paths on Windows
shell: powershell
run: |
Set-ItemProperty -Path "HKLM:\\SYSTEM\CurrentControlSet\Control\FileSystem" -Name "LongPathsEnabled" -Value 1
# Since it's just a defensive command, the workflow should continue even the command fails. This step can be
# removed once Windows Defender is removed from the AMI
- name: Disables Windows Defender scheduled and real-time scanning for files in directories used by PyTorch
continue-on-error: true
shell: powershell
run: |
Add-MpPreference -ExclusionPath $(Get-Location).tostring(),$Env:TEMP -ErrorAction Ignore
# Let's both exclude the path and disable Windows Defender completely just to be sure
# that it doesn't interfere
Set-MpPreference -DisableRealtimeMonitoring $True -ErrorAction Ignore
- name: Checkout PyTorch
uses: actions/checkout@v4
with:
ref: ${{ github.event_name == 'pull_request' && github.event.pull_request.head.sha || github.sha }}
submodules: recursive
path: pytorch
show-progress: false
- name: Clean PyTorch checkout
run: |
# Remove any artifacts from the previous checkouts
git clean -fxd
working-directory: pytorch
# NOTE: These environment variables are put here so that they can be applied on every job equally
# They are also here because setting them at a workflow level doesn't give us access to the
# runner.temp variable, which we need.
- name: Populate binary env
shell: bash
run: |
echo "BINARY_ENV_FILE=${RUNNER_TEMP}/env" >> "${GITHUB_ENV}"
echo "PYTORCH_FINAL_PACKAGE_DIR=${RUNNER_TEMP}/artifacts" >> "${GITHUB_ENV}"
echo "WIN_PACKAGE_WORK_DIR=${RUNNER_TEMP}"
- uses: actions/download-artifact@v4.1.7
name: Download Build Artifacts
with:
name: libtorch-cuda12_9-shared-with-deps-release
path: "${{ env.PYTORCH_FINAL_PACKAGE_DIR }}"
- name: Populate binary env
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_populate_env.sh"
- name: Test PyTorch binary
shell: bash
run: |
"${PYTORCH_ROOT}/.circleci/scripts/binary_windows_test.sh"
- name: Wait until all sessions have drained
shell: powershell
working-directory: pytorch
if: always()
timeout-minutes: 120
run: |
.github\scripts\wait_for_ssh_to_drain.ps1
- name: Kill active ssh sessions if still around (Useful if workflow was cancelled)
shell: powershell
working-directory: pytorch
if: always()
run: |
.github\scripts\kill_active_ssh_sessions.ps1
libtorch-cuda12_9-shared-with-deps-release-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: libtorch-cuda12_9-shared-with-deps-release-test
with:
PYTORCH_ROOT: ${{ github.workspace }}/pytorch
PACKAGE_TYPE: libtorch
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu129
GPU_ARCH_VERSION: 12.9
GPU_ARCH_TYPE: cuda
LIBTORCH_CONFIG: release
LIBTORCH_VARIANT: shared-with-deps
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.9"
build_name: libtorch-cuda12_9-shared-with-deps-release
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
uses: ./.github/workflows/_binary-upload.yml

File diff suppressed because it is too large Load Diff

View File

@ -8,6 +8,8 @@ on:
push:
tags:
- ciflow/h100-distributed/*
schedule:
- cron: 46 8 * * * # about 1:46am PDT
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
@ -25,14 +27,14 @@ jobs:
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-cuda12_8-py3_10-gcc11-sm90-build:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90
linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-dist:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90-dist
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runner: "linux.12xlarge"
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90-dist
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '9.0'
test-matrix: |
@ -42,12 +44,12 @@ jobs:
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-sm90-test:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90-dist
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-sm90-build
- linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-dist
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build.outputs.test-matrix }}
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90-dist
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-dist.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-dist.outputs.test-matrix }}
secrets: inherit

54
.github/workflows/h100-symm-mem.yml vendored Normal file
View File

@ -0,0 +1,54 @@
name: Limited CI for symmetric memory tests on H100
on:
pull_request:
paths:
- .github/workflows/h100-symm-mem.yml
workflow_dispatch:
push:
tags:
- ciflow/h100-symm-mem/*
schedule:
- cron: 22 8 * * * # about 1:22am PDT
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
jobs:
get-label-type:
if: github.repository_owner == 'pytorch'
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-symm:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90-symm
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90-symm
docker-image-name: ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11
cuda-arch-list: '9.0'
test-matrix: |
{ include: [
{ config: "h100-symm-mem", shard: 1, num_shards: 1, runner: "linux.aws.h100.4" },
]}
secrets: inherit
linux-jammy-cuda12_8-py3_10-gcc11-sm90-test:
name: linux-jammy-cuda12.8-py3.10-gcc11-sm90-symm
uses: ./.github/workflows/_linux-test.yml
needs:
- linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-symm
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc11-sm90-symm
docker-image: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-symm.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cuda12_8-py3_10-gcc11-sm90-build-symm.outputs.test-matrix }}
secrets: inherit

View File

@ -13,7 +13,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
linux-jammy-cpu-py3_9-gcc11-inductor-build:

View File

@ -13,7 +13,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
get-default-label-prefix:

View File

@ -16,7 +16,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
get-default-label-prefix:

View File

@ -10,7 +10,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
get-default-label-prefix:

View File

@ -48,7 +48,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
get-label-type:

View File

@ -2,7 +2,7 @@ name: inductor-perf-nightly-h100
on:
schedule:
- cron: 15 0,8,16 * * 1-6
- cron: 15 0,4,8,12,16,20 * * 1-6
- cron: 0 7 * * 0
# NB: GitHub has an upper limit of 10 inputs here, so before we can sort it
# out, let try to run torchao cudagraphs_low_precision as part of cudagraphs
@ -63,7 +63,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
get-label-type:
@ -94,18 +96,22 @@ jobs:
{ config: "inductor_huggingface_perf_cuda_h100", shard: 3, num_shards: 5, runner: "linux.aws.h100" },
{ config: "inductor_huggingface_perf_cuda_h100", shard: 4, num_shards: 5, runner: "linux.aws.h100" },
{ config: "inductor_huggingface_perf_cuda_h100", shard: 5, num_shards: 5, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 1, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 2, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 3, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 4, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 5, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 6, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 1, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 2, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 3, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 4, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 5, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 6, num_shards: 6, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 1, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 2, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 3, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 4, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 5, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 6, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_timm_perf_cuda_h100", shard: 7, num_shards: 7, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 1, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 2, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 3, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 4, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 5, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 6, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 7, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 8, num_shards: 9, runner: "linux.aws.h100" },
{ config: "inductor_torchbench_perf_cuda_h100", shard: 9, num_shards: 9, runner: "linux.aws.h100" },
]}
selected-test-configs: ${{ inputs.benchmark_configs }}
secrets: inherit
@ -114,7 +120,7 @@ jobs:
name: cuda12.8-py3.10-gcc9-sm90
uses: ./.github/workflows/_linux-test.yml
needs: build
if: github.event.schedule == '15 0,8,16 * * 1-6'
if: github.event.schedule == '15 0,4,8,12,16,20 * * 1-6'
with:
build-environment: linux-jammy-cuda12.8-py3.10-gcc9-sm90
dashboard-tag: training-true-inference-true-default-true-dynamic-true-cudagraphs-true-cppwrapper-true-aotinductor-true-freezing_cudagraphs-true-cudagraphs_low_precision-true

View File

@ -5,7 +5,7 @@ on:
tags:
- ciflow/inductor-perf-test-nightly-rocm/*
schedule:
- cron: 0 7 * * 0
- cron: 0 7 * * 0,3
# NB: GitHub has an upper limit of 10 inputs here, so before we can sort it
# out, let try to run torchao cudagraphs_low_precision as part of cudagraphs
workflow_dispatch:
@ -88,18 +88,23 @@ jobs:
docker-image-name: ci-image:pytorch-linux-jammy-rocm-n-py3
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_rocm", shard: 1, num_shards: 3, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 2, num_shards: 3, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 3, num_shards: 3, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 1, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 2, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 3, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_huggingface_perf_rocm", shard: 4, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_timm_perf_rocm", shard: 1, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_timm_perf_rocm", shard: 2, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_timm_perf_rocm", shard: 3, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_timm_perf_rocm", shard: 4, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_timm_perf_rocm", shard: 5, num_shards: 5, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 1, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 2, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 3, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 4, num_shards: 4, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 1, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 2, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 3, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 4, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 5, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 6, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 7, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
{ config: "inductor_torchbench_perf_rocm", shard: 8, num_shards: 8, runner: "linux.rocm.gpu.mi300.2" },
]}
secrets: inherit

View File

@ -0,0 +1,131 @@
name: inductor-perf-nightly-x86-zen
on:
push:
tags:
- ciflow/inductor-perf-test-nightly-x86-zen/*
schedule:
# - cron: 0 7 * * 1-6
# - cron: 0 7 * * 0
# Does not perform max_autotune on CPU, so skip the weekly run setup
- cron: 0 7 * * *
# NB: GitHub has an upper limit of 10 inputs here
workflow_dispatch:
inputs:
training:
# CPU for training is not typical, but leave the option open here
description: Run training (off by default)?
required: false
type: boolean
default: false
inference:
description: Run inference (on by default)?
required: false
type: boolean
default: true
default:
description: Run inductor_default?
required: false
type: boolean
default: true
dynamic:
description: Run inductor_dynamic_shapes?
required: false
type: boolean
default: false
cppwrapper:
description: Run inductor_cpp_wrapper?
required: false
type: boolean
default: false
aotinductor:
description: Run aot_inductor for inference?
required: false
type: boolean
default: false
benchmark_configs:
description: The list of configs used the benchmark
required: false
type: string
default: inductor_huggingface_perf_cpu_x86_zen,inductor_timm_perf_cpu_x86_zen,inductor_torchbench_perf_cpu_x86_zen
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions:
id-token: write
contents: read
jobs:
get-label-type:
name: get-label-type
uses: pytorch/pytorch/.github/workflows/_runner-determinator.yml@main
if: ${{ (github.event_name != 'schedule' || github.repository == 'pytorch/pytorch') && github.repository_owner == 'pytorch' }}
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
opt_out_experiments: lf
linux-jammy-zen-cpu-py3_9-gcc11-inductor-build:
name: linux-jammy-zen-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build-environment: linux-jammy-py3.9-gcc11-build
docker-image-name: ci-image:pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks
test-matrix: |
{ include: [
{ config: "inductor_huggingface_perf_cpu_x86_zen", shard: 1, num_shards: 3, runner: "linux.24xlarge.amd" },
{ config: "inductor_huggingface_perf_cpu_x86_zen", shard: 2, num_shards: 3, runner: "linux.24xlarge.amd" },
{ config: "inductor_huggingface_perf_cpu_x86_zen", shard: 3, num_shards: 3, runner: "linux.24xlarge.amd" },
{ config: "inductor_timm_perf_cpu_x86_zen", shard: 1, num_shards: 5, runner: "linux.24xlarge.amd" },
{ config: "inductor_timm_perf_cpu_x86_zen", shard: 2, num_shards: 5, runner: "linux.24xlarge.amd" },
{ config: "inductor_timm_perf_cpu_x86_zen", shard: 3, num_shards: 5, runner: "linux.24xlarge.amd" },
{ config: "inductor_timm_perf_cpu_x86_zen", shard: 4, num_shards: 5, runner: "linux.24xlarge.amd" },
{ config: "inductor_timm_perf_cpu_x86_zen", shard: 5, num_shards: 5, runner: "linux.24xlarge.amd" },
{ config: "inductor_torchbench_perf_cpu_x86_zen", shard: 1, num_shards: 4, runner: "linux.24xlarge.amd" },
{ config: "inductor_torchbench_perf_cpu_x86_zen", shard: 2, num_shards: 4, runner: "linux.24xlarge.amd" },
{ config: "inductor_torchbench_perf_cpu_x86_zen", shard: 3, num_shards: 4, runner: "linux.24xlarge.amd" },
{ config: "inductor_torchbench_perf_cpu_x86_zen", shard: 4, num_shards: 4, runner: "linux.24xlarge.amd" },
]}
selected-test-configs: ${{ inputs.benchmark_configs }}
secrets: inherit
linux-jammy-zen-cpu-py3_9-gcc11-inductor-test-nightly:
name: linux-jammy-zen-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-zen-cpu-py3_9-gcc11-inductor-build
if: github.event.schedule == '0 7 * * *'
with:
build-environment: linux-jammy-py3.9-gcc11-build
dashboard-tag: training-false-inference-true-default-true-dynamic-true-cppwrapper-true-aotinductor-true
docker-image: ${{ needs.linux-jammy-zen-cpu-py3_9-gcc11-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-zen-cpu-py3_9-gcc11-inductor-build.outputs.test-matrix }}
timeout-minutes: 720
# disable monitor in perf tests
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4
secrets: inherit
linux-jammy-zen-cpu-py3_9-gcc11-inductor-test:
name: linux-jammy-zen-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-zen-cpu-py3_9-gcc11-inductor-build
if: github.event_name == 'workflow_dispatch'
with:
build-environment: linux-jammy-py3.9-gcc11-build
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cppwrapper-${{ inputs.cppwrapper }}-aotinductor-${{ inputs.aotinductor }}
docker-image: ${{ needs.linux-jammy-zen-cpu-py3_9-gcc11-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-zen-cpu-py3_9-gcc11-inductor-build.outputs.test-matrix }}
timeout-minutes: 720
# disable monitor in perf tests
disable-monitor: false
monitor-log-interval: 15
monitor-data-collect-interval: 4
secrets: inherit

View File

@ -1,6 +1,9 @@
name: inductor-perf-nightly-x86
on:
pull_request:
paths:
- .github/workflows/inductor-perf-test-nightly-x86.yml
schedule:
# - cron: 0 7 * * 1-6
# - cron: 0 7 * * 0
@ -40,6 +43,11 @@ on:
required: false
type: boolean
default: false
freezing:
description: Run freezing?
required: false
type: boolean
default: true
benchmark_configs:
description: The list of configs used the benchmark
required: false
@ -47,10 +55,12 @@ on:
default: inductor_huggingface_perf_cpu_x86,inductor_timm_perf_cpu_x86,inductor_torchbench_perf_cpu_x86
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
get-label-type:
@ -90,15 +100,14 @@ jobs:
selected-test-configs: ${{ inputs.benchmark_configs }}
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-inductor-test-nightly:
linux-jammy-cpu-py3_9-gcc11-inductor-test-nightly-freezing:
name: linux-jammy-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-cpu-py3_9-gcc11-inductor-build
if: github.event.schedule == '0 7 * * *'
with:
build-environment: linux-jammy-py3.9-gcc11-build
dashboard-tag: training-false-inference-true-default-true-dynamic-true-cppwrapper-true-aotinductor-true
dashboard-tag: training-false-inference-true-default-true-dynamic-true-cppwrapper-true-aotinductor-true-freezing-true
docker-image: ${{ needs.linux-jammy-cpu-py3_9-gcc11-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_9-gcc11-inductor-build.outputs.test-matrix }}
timeout-minutes: 720
@ -108,7 +117,6 @@ jobs:
monitor-data-collect-interval: 4
secrets: inherit
linux-jammy-cpu-py3_9-gcc11-inductor-test:
name: linux-jammy-cpu-py3.9-gcc11-inductor
uses: ./.github/workflows/_linux-test.yml
@ -116,7 +124,7 @@ jobs:
if: github.event_name == 'workflow_dispatch'
with:
build-environment: linux-jammy-py3.9-gcc11-build
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cppwrapper-${{ inputs.cppwrapper }}-aotinductor-${{ inputs.aotinductor }}
dashboard-tag: training-${{ inputs.training }}-inference-${{ inputs.inference }}-default-${{ inputs.default }}-dynamic-${{ inputs.dynamic }}-cppwrapper-${{ inputs.cppwrapper }}-aotinductor-${{ inputs.aotinductor }}-freezing-${{ inputs.freezing }}
docker-image: ${{ needs.linux-jammy-cpu-py3_9-gcc11-inductor-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_9-gcc11-inductor-build.outputs.test-matrix }}
timeout-minutes: 720

View File

@ -63,7 +63,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
get-label-type:

View File

@ -15,7 +15,9 @@ concurrency:
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
get-default-label-prefix:

View File

@ -12,7 +12,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-unittest
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
get-label-type:

View File

@ -22,7 +22,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
unit-test:

View File

@ -19,7 +19,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
linux-jammy-cpu-py3_9-gcc11-opbenchmark-build:

View File

@ -20,7 +20,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}-${{ github.event.schedule }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
llm-td:

View File

@ -19,7 +19,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
llm-td:
@ -201,6 +203,7 @@ jobs:
{ config: "dynamo_wrapped", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "einops", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" }
]}
secrets: inherit
@ -236,6 +239,7 @@ jobs:
{ config: "dynamo_wrapped", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo_wrapped", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "einops", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" }
]}
secrets: inherit

View File

@ -15,7 +15,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}-${{ github.event.schedule }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
llm-td:

View File

@ -18,7 +18,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}-${{ github.event.schedule }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
llm-td:

View File

@ -46,7 +46,7 @@ jobs:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
tag=${ECR_DOCKER_IMAGE##*:}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image

View File

@ -16,7 +16,9 @@ concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
permissions:
id-token: write
contents: read
jobs:
llm-td:

View File

@ -7,7 +7,7 @@ on:
concurrency:
group: ${{ github.workflow }}
cancel-in-progress: false
cancel-in-progress: true
jobs:
do_update_viablestrict:

View File

@ -5,6 +5,10 @@ on:
tags:
- ciflow/xpu/*
workflow_dispatch:
schedule:
# Run 3 times on weekdays and less frequently on weekends.
- cron: 45 0,8,16 * * 1-5
- cron: 45 4 * * 0,6
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}

7
.gitmodules vendored
View File

@ -2,10 +2,6 @@
ignore = dirty
path = third_party/pybind11
url = https://github.com/pybind/pybind11.git
[submodule "third_party/eigen"]
ignore = dirty
path = third_party/eigen
url = https://gitlab.com/libeigen/eigen.git
[submodule "third_party/googletest"]
ignore = dirty
path = third_party/googletest
@ -133,3 +129,6 @@
[submodule "third_party/flash-attention"]
path = third_party/flash-attention
url = https://github.com/Dao-AILab/flash-attention.git
[submodule "third_party/aiter"]
path = third_party/aiter
url = https://github.com/ROCm/aiter.git

View File

@ -87,7 +87,7 @@ include_patterns = [
'torch/csrc/**/*.cpp',
'torch/nativert/**/*.h',
'torch/nativert/**/*.cpp',
'torch/standalone/**/*.h',
'torch/headeronly/**/*.h',
'test/cpp/**/*.h',
'test/cpp/**/*.cpp',
]
@ -122,6 +122,7 @@ is_formatter = true
[[linter]]
code = 'MYPY'
include_patterns = [
'setup.py',
'torch/**/*.py',
'torch/**/*.pyi',
'caffe2/**/*.py',
@ -157,18 +158,18 @@ init_command = [
'mypy==1.16.0',
'sympy==1.13.3',
'types-requests==2.27.25',
'types-PyYAML==6.0.7',
'types-pyyaml==6.0.1',
'types-tabulate==0.8.8',
'types-protobuf==5.29.1.20250403',
'types-pkg-resources==0.1.3',
'types-Jinja2==2.11.9',
'types-setuptools==79.0.0.20250422',
'types-jinja2==2.11.9',
'types-colorama==0.4.6',
'filelock==3.13.1',
'junitparser==2.1.1',
'rich==10.9.0',
'pyyaml==6.0.1',
'optree==0.13.0',
'dataclasses_json==0.6.7',
'dataclasses-json==0.6.7',
'pandas==2.2.3',
]
@ -241,7 +242,7 @@ include_patterns = [
'torch/nativert/*.cpp',
'torch/nativert/**/*.h',
'torch/nativert/**/*.cpp',
'torch/standalone/**/*.h',
'torch/headeronly/**/*.h',
]
exclude_patterns = [
# The negative filters below are to exclude files that include onnx_pb.h or
@ -1156,18 +1157,22 @@ exclude_patterns = [
'torch/_vendor/**',
'torch/_inductor/fx_passes/serialized_patterns/**',
'torch/_inductor/autoheuristic/artifacts/**',
'torch/utils/model_dump/preact.mjs',
# These files are all grandfathered in, feel free to remove from this list
# as necessary
# NOTE: remove the patterns in the order they are listed
'aten/**',
'c10/**',
'cmake/**',
'docs/**',
'functorch/**',
'scripts/**',
'aten/src/ATen/native/**',
'aten/src/ATen/native/q*/**',
'aten/src/ATen/native/[a-pA-P]*/**',
'aten/src/ATen/[a-mA-M]*/**',
'test/**',
'tools/**',
'test/test_*',
'test/[a-hA-h]*/**',
'test/distributed/**',
'torch/**',
'torchgen/**',
'torch/_*/**',
'torch/distributed/tensor/**',
]
init_command = [
'python3',

View File

@ -671,14 +671,6 @@ flatbuffer_cc_library(
out_prefix = "torch/csrc/jit/serialization/",
)
cc_library(
name = "torch_standalone_headers",
hdrs = glob([
"torch/standalone/**/*.h"
]),
visibility = ["//visibility:public"],
)
cc_library(
name = "torch_headers",
hdrs = if_cuda(

View File

@ -228,6 +228,8 @@ dependencies as well as the nightly binaries into the repo directory.
details.
* [cuda](aten/src/ATen/native/cuda) - CUDA implementations of
operators.
* [mps](aten/src/ATen/native/mps) - MPS implementations of
operators for Apple's Metal GPU family.
* [sparse](aten/src/ATen/native/sparse) - CPU and CUDA
implementations of COO sparse tensor operations
* [mkl](aten/src/ATen/native/mkl) [mkldnn](aten/src/ATen/native/mkldnn)
@ -343,13 +345,7 @@ command runs tests such as `TestNN.test_BCELoss` and
### Local linting
Install all prerequisites by running
```bash
make setup-lint
```
You can now run the same linting steps that are used in CI locally via `make`:
You can run the same linting steps that are used in CI locally via `make`:
```bash
make lint

View File

@ -1,31 +1,50 @@
include MANIFEST.in
# Reference: https://setuptools.pypa.io/en/latest/userguide/miscellaneous.html
# Include source files in SDist
include CMakeLists.txt
include CITATION.cff
include LICENSE
include NOTICE
include .gitmodules
include build_variables.bzl
include mypy.ini
include requirements.txt
include ufunc_defs.bzl
include *.bzl *.bazel .bazel* BUILD *.BUILD BUILD.* WORKSPACE
include BUCK BUCK.*
include requirements*.txt
include version.txt
recursive-include android *.*
recursive-include aten *.*
recursive-include binaries *.*
recursive-include c10 *.*
recursive-include caffe2 *.*
recursive-include cmake *.*
recursive-include torch *.*
recursive-include tools *.*
recursive-include test *.*
recursive-include docs *.*
recursive-include ios *.*
recursive-include third_party *
recursive-include test *.*
recursive-include benchmarks *.*
recursive-include scripts *.*
recursive-include mypy_plugins *.*
recursive-include modules *.*
recursive-include functorch *.*
include [Mm]akefile *.[Mm]akefile [Mm]akefile.*
include [Dd]ockerfile *.[Dd]ockerfile [Dd]ockerfile.* .dockerignore
graft android
graft aten
graft binaries
graft c10
graft caffe2
graft cmake
graft functorch
graft third_party
graft tools
graft torch
graft torchgen
# FIXME: torch-xla build during codegen will fail if include this file in wheel
exclude torchgen/BUILD.bazel
# Misc files and directories in SDist
include *.md
include CITATION.cff
include LICENSE NOTICE
include mypy*.ini
graft benchmarks
graft docs
graft mypy_plugins
graft scripts
# Misc files needed for custom setuptools command
include .gitignore
include .gitmodules
# Include test suites in SDist
graft test
include pytest.ini
include .coveragerc
# Prune generated/compiled files
prune torchgen/packaged
prune */__pycache__
global-exclude *.o *.so *.dylib *.a .git *.pyc *.swp
global-exclude *.o *.obj *.so *.a *.dylib *.pxd *.dll *.lib *.py[cod]
prune */.git
global-exclude .git *~ *.swp

View File

@ -58,18 +58,32 @@ setup-env-rocm:
$(MAKE) setup-env PYTHON="$(PYTHON)" NIGHTLY_TOOL_OPTS="$(NIGHTLY_TOOL_OPTS) --rocm"
.PHONY: setup-lint
setup-lint:
setup-lint .lintbin/.lintrunner.sha256: requirements.txt pyproject.toml .lintrunner.toml
@echo "Setting up lintrunner..."
$(PIP) install lintrunner
lintrunner init
@echo "Generating .lintrunner.sha256..."
@mkdir -p .lintbin
@sha256sum requirements.txt pyproject.toml .lintrunner.toml > .lintbin/.lintrunner.sha256
.PHONY: lazy-setup-lint
lazy-setup-lint: .lintbin/.lintrunner.sha256
@if [ ! -x "$(shell command -v lintrunner)" ]; then \
$(MAKE) setup-lint; \
fi
.PHONY: lint
lint:
lintrunner
lint: lazy-setup-lint
lintrunner --all-files
.PHONY: quicklint
quicklint:
quicklint: lazy-setup-lint
lintrunner
.PHONY: quickfix
quickfix: lazy-setup-lint
lintrunner --apply-patches
# Deprecated target aliases
.PHONY: setup_env setup_env_cuda setup_env_rocm setup_lint
setup_env: setup-env

View File

@ -200,7 +200,7 @@ If you want to compile with CUDA support, [select a supported version of CUDA fr
- [NVIDIA cuDNN](https://developer.nvidia.com/cudnn) v8.5 or above
- [Compiler](https://gist.github.com/ax3l/9489132) compatible with CUDA
Note: You could refer to the [cuDNN Support Matrix](https://docs.nvidia.com/deeplearning/cudnn/backend/latest/reference/support-matrix.html) for cuDNN versions with the various supported CUDA, CUDA driver and NVIDIA hardware
Note: You could refer to the [cuDNN Support Matrix](https://docs.nvidia.com/deeplearning/cudnn/backend/latest/reference/support-matrix.html) for cuDNN versions with the various supported CUDA, CUDA driver, and NVIDIA hardware.
If you want to disable CUDA support, export the environment variable `USE_CUDA=0`.
Other potentially useful environment variables may be found in `setup.py`. If

View File

@ -50,6 +50,7 @@ Following is the Release Compatibility Matrix for PyTorch releases:
| PyTorch version | Python | C++ | Stable CUDA | Experimental CUDA | Stable ROCm |
| --- | --- | --- | --- | --- | --- |
| 2.8 | >=3.9, <=3.13, (3.13t experimental) | C++17 | CUDA 12.6 (CUDNN 9.10.2.21), CUDA 12.8 (CUDNN 9.10.2.21) | CUDA 12.9 (CUDNN 9.10.2.21) | ROCm 6.4 |
| 2.7 | >=3.9, <=3.13, (3.13t experimental) | C++17 | CUDA 11.8 (CUDNN 9.1.0.70), CUDA 12.6 (CUDNN 9.5.1.17) | CUDA 12.8 (CUDNN 9.7.1.26) | ROCm 6.3 |
| 2.6 | >=3.9, <=3.13, (3.13t experimental) | C++17 | CUDA 11.8, CUDA 12.4 (CUDNN 9.1.0.70) | CUDA 12.6 (CUDNN 9.5.1.17) | ROCm 6.2.4 |
| 2.5 | >=3.9, <=3.12, (3.13 experimental) | C++17 | CUDA 11.8, CUDA 12.1, CUDA 12.4, CUDNN 9.1.0.70 | None | ROCm 6.2 |
@ -73,9 +74,9 @@ Following is the release cadence. All future dates below are tentative. For late
| 2.4 | Jun 2024 | Jul 2024 | Sept 2024 | Not planned |
| 2.5 | Sep 2024 | Oct 2024 | Nov 2024 | Not planned |
| 2.6 | Dec 2024 | Jan 2025 | Not planned | Not planned |
| 2.7 | Mar 2025 | Apr 2025 | (May 2025) | (Jun 2025) |
| 2.7 | Mar 2025 | Apr 2025 | Jun 2025 | Not planned |
| 2.8 | Jun 2025 | Jul 2025 | (Aug 2025) | (Sep 2025) |
| 2.9 | Aug 2025 | Oct 2025 | (Nov 2025) | (Dec 2025) |
| 2.9 | Sept 2025 | Oct 2025 | (Nov 2025) | (Dec 2025) |
| 2.10 | Dec 2025 | Jan 2026 | (Feb 2026) | (Mar 2026) |
| 2.11 | Mar 2026 | Apr 2026 | (Jun 2026) | (Jul 2026) |

View File

@ -169,14 +169,10 @@ file(GLOB native_transformers_hip_hip "native/transformers/hip/*.hip")
file(GLOB native_transformers_hip_cpp "native/transformers/hip/*.cpp")
file(GLOB native_quantized_cudnn_hip_cpp "native/quantized/cudnn/hip/*.cpp")
file(GLOB native_utils_cpp "native/utils/*.cpp")
# flash_attention sources
file(GLOB flash_attention_cuda_kernels_cu ${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc/flash_attn/src/*.cu)
# Flash attention C++ sources
file(GLOB flash_attention_cuda_cpp
"${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc/flash_attn/src/*.cpp"
"native/transformers/cuda/flash_attn/flash_api.cpp"
)
file(GLOB flash_attention_cuda_cpp ${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc/flash_attn/src/*.cpp)
file(GLOB native_flash_attn_api_cpp "native/transformers/cuda/flash_attn/flash_api.cpp")
# flash_attention hip sources
file(GLOB flash_attention_hip_hip "native/transformers/hip/flash_attn/*.hip")
@ -197,6 +193,10 @@ if(USE_FLASH_ATTENTION)
add_subdirectory(native/transformers/hip/flash_attn/ck)
file(GLOB flash_attention_hip_ck_hip "native/transformers/hip/flash_attn/ck/*.hip")
list(APPEND native_transformers_hip_hip ${flash_attention_hip_ck_hip})
# FAv3 Generation
add_subdirectory(native/transformers/hip/flash_attn/ck/fav_v3)
file(GLOB flash_attention_v3_hip "native/transformers/hip/flash_attn/ck/fav_v3/*.hip")
list(APPEND native_transformers_hip_hip ${flash_attention_v3_hip})
endif()
endif()
file(GLOB flash_attention_hip_aot_hip "native/transformers/hip/flash_attn/aot/*.hip")
@ -208,10 +208,29 @@ file(GLOB mem_eff_attention_cuda_cu "native/transformers/cuda/mem_eff_attention/
file(GLOB mem_eff_attention_cuda_kernels_cu "native/transformers/cuda/mem_eff_attention/kernels/*.cu")
file(GLOB mem_eff_attention_cuda_cpp "native/transformers/cuda/mem_eff_attention/*.cpp")
if(USE_CUDA AND (USE_FLASH_ATTENTION OR USE_MEM_EFF_ATTENTION))
add_library(flash_attention OBJECT EXCLUDE_FROM_ALL ${flash_attention_cuda_kernels_cu} ${flash_attention_cuda_cpp})
target_include_directories(flash_attention PUBLIC
${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc
${PROJECT_SOURCE_DIR}/third_party/flash-attention/include
${PROJECT_SOURCE_DIR}/third_party/cutlass/include
${PROJECT_SOURCE_DIR}/third_party/flash-attention/csrc/flash_attn/src
)
target_compile_definitions(flash_attention PRIVATE
# Copied from https://github.com/pytorch/pytorch/blob/a10024d7dea47c52469059a47efe376eb20adca0/caffe2/CMakeLists.txt#L1431
FLASH_NAMESPACE=pytorch_flash
FLASHATTENTION_DISABLE_ALIBI
FLASHATTENTION_DISABLE_SOFTCAP
UNFUSE_FMA
)
set_target_properties(flash_attention PROPERTIES POSITION_INDEPENDENT_CODE ON)
endif()
if(USE_FLASH_ATTENTION)
list(APPEND native_transformers_cuda_cu ${flash_attention_cuda_cu})
list(APPEND native_transformers_cuda_cu ${flash_attention_cuda_kernels_cu})
list(APPEND native_transformers_cuda_cpp ${flash_attention_cuda_cpp})
list(APPEND native_transformers_cuda_cpp ${native_flash_attn_api_cpp})
list(APPEND FLASH_ATTENTION_CUDA_SOURCES ${flash_attention_cuda_cu} ${flash_attention_cuda_kernels_cu})
list(APPEND ATen_ATTENTION_KERNEL_SRCS ${flash_attention_cuda_kernels_cu})
@ -377,6 +396,7 @@ if(USE_ROCM)
list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/composable_kernel/include)
list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/composable_kernel/library/include)
list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_BINARY_DIR}/composable_kernel)
list(APPEND ATen_HIP_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/aiter/csrc/include)
_pytorch_rocm_generate_ck_conf()
# Next two lines are needed because TunableOp uses third-party/fmt

View File

@ -19,9 +19,69 @@
#if defined(__aarch64__) && !defined(C10_MOBILE)
#include <cpuinfo.h>
#endif
namespace at {
namespace {
/*
These const variables defined the fp32 precisions for different backend
We have "generic", "cuda", "mkldnn" backend now and we can choose fp32
prevision from "ieee", "tf32", "bf16" and "none". The "ieee" precision means
IEEE standard floating point format "tf32" and "bf16" means we are allowed to
use "tf32" or "bf16" as internal computation data types for fp32 computations.
And "none" means it is override-able by parent's node
generic->mkldnn->matmul
->conv
->rnn
->cuda ->matmul
->conv
->rnn
*/
const std::map<std::string, std::vector<std::string>> _fp32_precisions = {
{"generic", {{"ieee", "tf32", "bf16", "none"}}},
{"mkldnn", {{"ieee", "bf16", "none"}}},
{"cuda", {{"ieee", "tf32", "none"}}}};
// Check whether the backend and op are legal
void check_fp32_prec_backend_and_op(
const std::string& backend,
const std::string& op) {
static std::vector<std::string> backends = {"generic", "mkldnn", "cuda"};
static std::vector<std::string> operators = {"conv", "matmul", "rnn", "all"};
TORCH_CHECK(
std::find(backends.begin(), backends.end(), backend) != backends.end(),
"Invalid backend: ",
backend);
TORCH_CHECK(
std::find(operators.begin(), operators.end(), op) != operators.end(),
"Invalid operator: ",
op);
if (backend == "generic") {
TORCH_CHECK(op == "all", "Invalid operation for generic backend: ", op);
}
}
// Return whether the precision is supported by backends
bool validate_fp32_prec(
const std::string& backend,
const std::string& precision) {
auto iterp = _fp32_precisions.find(backend);
TORCH_CHECK(iterp != _fp32_precisions.end());
auto precisions = iterp->second;
bool valid = std::find(precisions.begin(), precisions.end(), precision) !=
precisions.end();
return valid;
}
C10_ALWAYS_INLINE void warn_deprecated_fp32_precision_api(){
TORCH_WARN_ONCE(
"This API is going to be deprecated, please see "
"https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices"
);
}
} // namespace
Context::Context() = default;
// TODO: This could be bad juju if someone calls globalContext() in the
@ -115,12 +175,29 @@ void Context::setUserEnabledNNPACK(bool e) {
enabled_nnpack = e;
}
bool Context::allowTF32CuDNN() const {
bool Context::allowTF32CuDNN(const std::string& op) const {
if (op.size() == 0){
bool allow_tf32_rnn = float32Precision("cuda", "rnn") == "tf32";
bool allow_tf32_conv = float32Precision("cuda", "conv") == "tf32";
TORCH_CHECK(
allow_tf32_rnn == allow_tf32_conv && allow_tf32_rnn == allow_tf32_cudnn,
"PyTorch is checking whether allow_tf32 is enabled for cuDNN without a specific operator name,",
"but the current flag(s) indicate that cuDNN conv and cuDNN RNN have different TF32 flags.",
"This combination indicates that you have used a mix of the legacy and new APIs to set the TF32 flags. ",
"We suggest only using the new API to set the TF32 flag(s). See also: ",
"https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices");
} else {
return float32Precision("cuda", op) == "tf32";
}
warn_deprecated_fp32_precision_api();
return allow_tf32_cudnn;
}
void Context::setAllowTF32CuDNN(bool b) {
setFloat32Precision("cuda", "rnn", b ? "tf32" : "none");
setFloat32Precision("cuda", "conv", b ? "tf32" : "none");
allow_tf32_cudnn = b;
warn_deprecated_fp32_precision_api();
}
void Context::setSDPPriorityOrder(const std::vector<int64_t>& order) {
@ -141,12 +218,13 @@ bool Context::allowTF32OneDNN() const {
return allow_tf32_onednn;
}
void Context::setAllowTF32OneDNN(bool b){
#ifdef USE_XPU
// NOLINTNEXTLINE(clang-diagnostic-unused-parameter)
void Context::setAllowTF32OneDNN(bool b){
#ifdef USE_XPU
allow_tf32_onednn = b;
#else
#else
TORCH_WARN("TF32 acceleration on top of oneDNN is available for Intel GPUs. The current Torch version does not have Intel GPU Support.");
#endif
#endif
}
bool Context::userEnabledFlashSDP() const {
@ -259,7 +337,16 @@ bool Context::allowTF32CuBLAS() const {
return false;
}
#endif
return float32_matmul_precision != at::Float32MatmulPrecision::HIGHEST;
bool legacy_allow_tf32 = float32_matmul_precision != at::Float32MatmulPrecision::HIGHEST;
bool allow_tf32_new = float32Precision("cuda", "matmul") == "tf32";
TORCH_CHECK(
legacy_allow_tf32 == allow_tf32_new,
"PyTorch is checking whether allow_tf32_new is enabled for cuBlas matmul,",
"Current status indicate that you have used mix of the legacy and new APIs to set the TF32 status for cublas matmul. ",
"We suggest only using the new API to set the TF32 flag. See also: ",
"https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices");
warn_deprecated_fp32_precision_api();
return allow_tf32_new;
}
void Context::setAllowTF32CuBLAS(bool b) {
@ -272,27 +359,54 @@ void Context::setAllowTF32CuBLAS(bool b) {
}
#endif
float32_matmul_precision = b ? at::Float32MatmulPrecision::HIGH : at::Float32MatmulPrecision::HIGHEST;
setFloat32Precision("cuda", "matmul", b ? "tf32" : "ieee");
}
Float32MatmulPrecision Context::float32MatmulPrecision() const {
bool invalid = float32Precision("cuda", "matmul") == "tf32" &&
float32_matmul_precision == at::Float32MatmulPrecision::HIGHEST;
invalid = invalid ||
(float32Precision("mkldnn", "matmul") == "bf16" &&
float32_matmul_precision != at::Float32MatmulPrecision::MEDIUM);
TORCH_CHECK(
!invalid,
"PyTorch is checking the matmul precision without a specific backend name,",
"Current status indicate that you have used mix of the legacy and new APIs to set the matmul precision. ",
"We suggest only using the new API for matmul precision. See also: ",
"https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices");
warn_deprecated_fp32_precision_api();
return float32_matmul_precision;
}
void Context::setFloat32MatmulPrecision(Float32MatmulPrecision p) {
float32_matmul_precision = p;
std::string Context::float32Precision(const std::string& backend, const std::string& op) const {
check_fp32_prec_backend_and_op(backend, op);
auto precision = fp32_precision.find(backend)->second.find(op)->second;
if (precision == "none")
precision = fp32_precision.find(backend)->second.find("all")->second;
if (precision == "none")
precision = fp32_precision.find("generic")->second.find("all")->second;
bool valid_prec = validate_fp32_prec(backend, precision);
return valid_prec ? precision : "none";
}
void Context::setFloat32MatmulPrecision(const std::string &s) {
auto match = [this](const std::string & s_) {
warn_deprecated_fp32_precision_api();
// TODO: consider if CuDNN field needs to also be set for potential future CuDNN ops like multi-headed attention
if (s_ == "highest") {
float32_matmul_precision = at::Float32MatmulPrecision::HIGHEST;
setFloat32Precision("cuda", "matmul", "ieee");
setFloat32Precision("mkldnn", "matmul", "ieee");
return true;
} else if (s_ == "high") {
float32_matmul_precision = at::Float32MatmulPrecision::HIGH;
setFloat32Precision("cuda", "matmul", "tf32");
setFloat32Precision("mkldnn", "matmul", "ieee");
return true;
} else if (s_ == "medium") {
float32_matmul_precision = at::Float32MatmulPrecision::MEDIUM;
setFloat32Precision("cuda", "matmul", "tf32");
setFloat32Precision("mkldnn", "matmul", "bf16");
return true;
}
return false;
@ -306,6 +420,27 @@ void Context::setFloat32MatmulPrecision(const std::string &s) {
"setFloat32MatmulPrecision call has no effect.");
}
void Context::setFloat32Precision(const std::string& backend, const std::string& op, const std::string& p) {
check_fp32_prec_backend_and_op(backend, op);
if (validate_fp32_prec(backend, p)) {
fp32_precision[backend][op] = p;
} else {
std::string msg;
auto iterp = _fp32_precisions.find(backend);
TORCH_CHECK(iterp != _fp32_precisions.end());
for (auto p : iterp->second) {
msg += p;
msg += " ";
}
TORCH_WARN(
"you have set wrong precision for backend:",
backend,
" setFloat32Precision call has no effect.",
"Please choose precision from: ",
msg);
}
}
at::LinalgBackend Context::linalgPreferredBackend() const {
return linalg_preferred_backend;
}
@ -535,13 +670,14 @@ at::QEngine Context::qEngine() const {
#endif
return qengine;
}();
return quantized_engine.value_or(_quantized_engine);
auto qt_engine = quantized_engine.load();
return qt_engine == at::QEngine::NoQEngine ? _quantized_engine : qt_engine;
}
void Context::setQEngine(at::QEngine e) {
const auto& qengines = supportedQEngines();
if (std::find(qengines.begin(), qengines.end(), e) != qengines.end()) {
quantized_engine = e;
quantized_engine.store(e);
return;
}
TORCH_CHECK(false, "quantized engine ", toString(e), " is not supported");
@ -553,17 +689,9 @@ const std::vector<at::QEngine>& Context::supportedQEngines() {
// Engines are listed in priority order: later one wins
// By default we prefer FBGEMM if we're running on server side
// QNNPACK on server side has some issue, so we disable it by default.
#ifdef C10_MOBILE
engines.push_back(at::kNoQEngine);
#ifdef USE_PYTORCH_QNNPACK
engines.push_back(at::kQNNPACK);
#endif
#else // C10_MOBILE
#ifdef USE_PYTORCH_QNNPACK
engines.push_back(at::kQNNPACK);
#endif
engines.push_back(at::kNoQEngine);
#endif // C10_MOBILE
#if AT_MKLDNN_ENABLED()
engines.push_back(at::kONEDNN);
@ -695,6 +823,7 @@ void Context::setAllowFP16ReductionCPU(bool b) {
#if defined(__aarch64__) && !defined(C10_MOBILE)
if (!cpuinfo_initialize() || !cpuinfo_has_arm_fp16_arith())
#else
// NOLINTNEXTLINE(facebook-hte-MissingBraces)
if (true)
#endif
TORCH_CHECK(false, "Float16 arithmetic is not supported by the CPU!");

View File

@ -28,6 +28,7 @@
#include <c10/util/irange.h>
#include <cstdint>
#include <map>
#include <mutex>
namespace at {
@ -336,14 +337,20 @@ class TORCH_API Context {
void alertCuBLASConfigNotDeterministic() const;
void setFloat32MatmulPrecision(const std::string& s);
bool allowTF32CuDNN() const;
void setFloat32Precision(
const std::string& backend,
const std::string& op,
const std::string& s);
bool allowTF32CuDNN(const std::string& op = std::string()) const;
void setAllowTF32CuDNN(bool);
bool allowTF32OneDNN() const;
void setAllowTF32OneDNN(bool);
bool allowTF32CuBLAS() const;
void setAllowTF32CuBLAS(bool);
Float32MatmulPrecision float32MatmulPrecision() const;
void setFloat32MatmulPrecision(Float32MatmulPrecision p);
std::string float32Precision(
const std::string& backend,
const std::string& op) const;
bool allowFP16ReductionCuBLAS() const;
void setAllowFP16ReductionCuBLAS(bool);
bool allowBF16ReductionCuBLAS() const;
@ -465,10 +472,27 @@ class TORCH_API Context {
bool release_original_weights = false;
#endif
bool display_vmap_fallback_warnings_ = false;
std::optional<at::QEngine> quantized_engine = std::nullopt;
std::atomic<at::QEngine> quantized_engine = at::QEngine::NoQEngine;
bool enable_sparse_tensor_invariant_checks = false;
bool allow_fp16_reduction_cpu = false;
std::map<std::string, std::map<std::string, std::string>> fp32_precision = {
{"generic", {{"all", "none"}}},
{"mkldnn",
{{"matmul", "none"},
{"conv", "none"},
{"rnn", "none"},
{"all", "none"}}},
{"cuda",
{{"matmul",
float32_matmul_precision == at::Float32MatmulPrecision::HIGHEST
? "none"
: "tf32"},
{"conv", "tf32"},
{"rnn", "tf32"},
{"all", "none"}}},
};
Allocator* prev_allocator_ptr_{nullptr};
};

View File

@ -266,19 +266,38 @@ ScalarType toScalarType(const DLDataType& dtype) {
}
namespace {
// The templated classes below are needed for supporting both:
// - DLManagedTensor
// - DLManagedTensorVersioned
template <class T>
struct ATenDLMTensor {
Tensor handle;
DLManagedTensor tensor{};
T tensor{};
};
} // namespace
static void deleter(DLManagedTensor* arg) {
delete static_cast<ATenDLMTensor*>(arg->manager_ctx);
template <class T>
void deleter(T* arg) {
delete static_cast<ATenDLMTensor<T>*>(arg->manager_ctx);
}
// Adds version information for DLManagedTensorVersioned.
// This is a no-op for the other types.
template <class T>
void fillVersion(T* tensor) {}
template <>
void fillVersion<DLManagedTensorVersioned>(
DLManagedTensorVersioned* tensor) {
tensor->flags = 0;
tensor->version.major = DLPACK_MAJOR_VERSION;
tensor->version.minor = DLPACK_MINOR_VERSION;
}
// This function returns a shared_ptr to memory managed DLpack tensor
// constructed out of ATen tensor
DLManagedTensor* toDLPack(const Tensor& src) {
template <class T>
T* toDLPackImpl(const Tensor& src) {
// create a new tensor with possibly normalized strides
// gh-83069
auto shape = src.sizes();
@ -290,10 +309,10 @@ DLManagedTensor* toDLPack(const Tensor& src) {
}
auto view = src.as_strided(shape, strides, src.storage_offset());
ATenDLMTensor* atDLMTensor(new ATenDLMTensor);
ATenDLMTensor<T>* atDLMTensor(new ATenDLMTensor<T>);
atDLMTensor->handle = view;
atDLMTensor->tensor.manager_ctx = atDLMTensor;
atDLMTensor->tensor.deleter = &deleter;
atDLMTensor->tensor.deleter = &deleter<T>;
atDLMTensor->tensor.dl_tensor.data = view.data_ptr();
c10::DeviceIndex device_id = 0;
if (src.is_cuda() || src.is_privateuseone()) {
@ -305,35 +324,68 @@ DLManagedTensor* toDLPack(const Tensor& src) {
atDLMTensor->tensor.dl_tensor.shape = view.sizes().data();
atDLMTensor->tensor.dl_tensor.strides = view.strides().data();
atDLMTensor->tensor.dl_tensor.byte_offset = 0;
fillVersion(&atDLMTensor->tensor);
return &(atDLMTensor->tensor);
}
Tensor fromDLPack(DLManagedTensor* src) {
auto deleter = [src](void* self [[maybe_unused]]) {
if (src->deleter) {
src->deleter(src);
}
};
return fromDLPack(src, std::move(deleter));
}
// Explicitly instantiate the template above for both classes.
template DLManagedTensor* toDLPackImpl<DLManagedTensor>(const Tensor&);
template DLManagedTensorVersioned* toDLPackImpl<DLManagedTensorVersioned>(const Tensor&);
Tensor fromDLPack(DLManagedTensor* src, std::function<void(void*)> deleter) {
Device device = getATenDevice(src->dl_tensor.device, src->dl_tensor.data);
ScalarType stype = toScalarType(src->dl_tensor.dtype);
if (!src->dl_tensor.strides) {
// This function constructs a Tensor from a memory managed DLPack which
// may be represented as either: DLManagedTensor and DLManagedTensorVersioned.
template <class T>
at::Tensor fromDLPackImpl(T* src, std::function<void(void*)> deleter) {
if (!deleter) {
deleter = [src](void* self [[maybe_unused]]) {
if (src->deleter) {
src->deleter(src);
}
};
}
DLTensor& dl_tensor = src->dl_tensor;
Device device = getATenDevice(dl_tensor.device, dl_tensor.data);
ScalarType stype = toScalarType(dl_tensor.dtype);
if (!dl_tensor.strides) {
return at::from_blob(
src->dl_tensor.data,
IntArrayRef(src->dl_tensor.shape, src->dl_tensor.ndim),
dl_tensor.data,
IntArrayRef(dl_tensor.shape, dl_tensor.ndim),
std::move(deleter),
at::device(device).dtype(stype),
{device});
}
return at::from_blob(
src->dl_tensor.data,
IntArrayRef(src->dl_tensor.shape, src->dl_tensor.ndim),
IntArrayRef(src->dl_tensor.strides, src->dl_tensor.ndim),
dl_tensor.data,
IntArrayRef(dl_tensor.shape, dl_tensor.ndim),
IntArrayRef(dl_tensor.strides, dl_tensor.ndim),
deleter,
at::device(device).dtype(stype),
{device});
}
// Explicitly instantiate the template above for both classes.
template at::Tensor fromDLPackImpl<DLManagedTensor>(DLManagedTensor* src, std::function<void(void*)> deleter);
template at::Tensor fromDLPackImpl<DLManagedTensorVersioned>(DLManagedTensorVersioned* src, std::function<void(void*)> deleter);
} // namespace
DLManagedTensor* toDLPack(const Tensor& src) {
return toDLPackImpl<DLManagedTensor>(src);
}
DLManagedTensorVersioned* toDLPackVersioned(const Tensor& src) {
return toDLPackImpl<DLManagedTensorVersioned>(src);
}
Tensor fromDLPack(DLManagedTensor* src, std::function<void(void*)> deleter) {
return fromDLPackImpl<DLManagedTensor>(src, std::move(deleter));
}
Tensor fromDLPackVersioned(DLManagedTensorVersioned* src, std::function<void(void*)> deleter) {
return fromDLPackImpl<DLManagedTensorVersioned>(src, std::move(deleter));
}
} // namespace at

View File

@ -12,10 +12,48 @@ namespace at {
TORCH_API ScalarType toScalarType(const DLDataType& dtype);
TORCH_API DLManagedTensor* toDLPack(const Tensor& src);
TORCH_API Tensor fromDLPack(DLManagedTensor* src);
TORCH_API struct DLManagedTensorVersioned* toDLPackVersioned(const Tensor& src);
TORCH_API Tensor
fromDLPack(DLManagedTensor* src, std::function<void(void*)> deleter);
fromDLPack(DLManagedTensor* src, std::function<void(void*)> deleter = {});
TORCH_API Tensor fromDLPackVersioned(
DLManagedTensorVersioned* src,
std::function<void(void*)> deleter = {});
TORCH_API DLDataType getDLDataType(const Tensor& t);
TORCH_API DLDevice getDLContext(const Tensor& tensor, const int64_t& device_id);
// This trait class is used for retrieving different attributes, such as the
// PyCapsule names and conversion functions for both DLPack tensor classes:
// `DLManagedTensor` and `DLManagedTensorVersioned`.
//
// Each specialization should contain the following 2 traits:
// - `capsule`: actual name of the capsule
// - `used`: name of the capsule after using it
// - `toDLPack`: function for converting a tensor into a DLPack capsule
// - `fromDLPack`: function for creating a tensor from a DLPack capsule
//
// While `toDLPack` is the directly exposed to Python, `fromDLPack` is not.
// Although it contains the core implementation, it lacks the required book
// keeping logic contained in its caller `tensor_fromDLPack`.
//
// That said, `fromDLPack` is used directly in a few DLPack tests that live
// inside ATen (no Python available).
template <class T>
struct DLPackTraits {};
template <>
struct DLPackTraits<DLManagedTensor> {
inline static const char* capsule = "dltensor";
inline static const char* used = "used_dltensor";
inline static auto toDLPack = at::toDLPack;
inline static auto fromDLPack = at::fromDLPack;
};
template <>
struct DLPackTraits<DLManagedTensorVersioned> {
inline static const char* capsule = "dltensor_versioned";
inline static const char* used = "used_dltensor_versioned";
inline static auto toDLPack = at::toDLPackVersioned;
inline static auto fromDLPack = at::fromDLPackVersioned;
};
} // namespace at

View File

@ -30,7 +30,7 @@ TORCH_API bool isAccelerator(c10::DeviceType device_type);
template <
typename... T,
typename = std::enable_if_t<(std::is_same_v<T, c10::DeviceType> && ...)>>
TORCH_API inline bool isAcceleratorExcluded(
inline bool isAcceleratorExcluded(
c10::DeviceType device_type,
c10::DeviceType first_excluded,
T... rest_excluded) {

Some files were not shown because too many files have changed in this diff Show More