Compare commits

...

501 Commits

Author SHA1 Message Date
fa00b8cdba my local copy of pr #137410 2024-10-22 11:09:18 -07:00
575f260229 Extend vectorization with SVE(ARM) with Torch Compile (Inductor) (#134672)
**Motivation**
Enable SVE vectorization with `torch.compile`
Extends PR: #119571

* This PR enables vectorization for codegen part using SVE-256 (vec length)
* The changes can be extended to other SVE vec lengths

I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile`
Test results are for 8 cores on ARM Neoverse_V1

<img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2">

It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134672
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-10-10 13:20:40 +00:00
479bd1f300 Hardlock frequent periodic jobs to Meta runners (#137616)
The change in pytorch/pytorch#136785 enabled these jobs to run on LF runners however we saw a sudden large spike in cost once that happened last week that would have caused us to over use our available AWS credits. This change hardlocks the tests for these jobs to Meta runners. We need this at least until we can figure out how to handle the additional spend caused by these jobs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137616
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2024-10-10 12:32:16 +00:00
f69bf005f7 Revert "In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097)"
This reverts commit 4304c68a4c4d742a3ec5266b81f64a85922509c9.

Reverted https://github.com/pytorch/pytorch/pull/137097 on behalf of https://github.com/huydhn due to Sorry for reverting your change, it seems to increase the compilation time a lot causing some jobs to timeout ([comment](https://github.com/pytorch/pytorch/pull/137097#issuecomment-2404573266))
2024-10-10 09:29:05 +00:00
eea1f79a1d [AMD] use rccl.h instead of rccl/rccl.h (#135472)
Summary: We hipify NCCLUtils.h from nccl.h to rccl/rccl.h. This follows the format of the rocm rpm suite (the header is in include/rccl/rccl.h), however the source code is just src/rccl.h. Using the rccl/rccl.h will make us find the rpm's header but not the src code's header.

Test Plan:
buck run mode/opt-amd-gpu -c hpc_comms.use_rccl=develop -c fbcode.split-dwarf=True  --config rccl.build_rdma_core=true --config rccl.adhoc_brcm=true  //aps_models/ads/icvr:icvr_launcher -- mode=local_ctr_cvr_cmf_rep_1000x_v1_no_atom   data_loader.dataset.table_ds=[2024-09-04]   data_loader.dataset.batch_size=512  max_ind_range=10

w/o this diff, it'll show 2.18 nccl version

Differential Revision: D62371434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135472
Approved by: https://github.com/jeffdaily, https://github.com/cenzhaometa
2024-10-10 08:55:57 +00:00
eaab5cf0f9 Fix torch.compile correctness bug on aarch64+sve due to gcc bug (#137606)
Some unit tests were failing relating to argmin_vec/argmax_vec due to a bug in GCC affecting versions <= 12 on aarch64 platforms with SVE

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117001

Fixes #137597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137606
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-10 08:44:53 +00:00
365722f606 fix test_constant_output (#137547)
Summary: Fixes a couple of problems: constants didn't have metadata before creating graph signatures, and graph signatures weren't updated when lifting constants.

Test Plan: fixed test

Differential Revision: D64081786

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137547
Approved by: https://github.com/tugsbayasgalan
2024-10-10 07:48:15 +00:00
4e8997744c [inductor] Enable coordinate descent tuning with max-autotune (#136867)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136867
Approved by: https://github.com/Chillee
2024-10-10 07:29:52 +00:00
383eba5229 Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492
Fixes #75240

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby, https://github.com/eqy
2024-10-10 06:59:08 +00:00
71010bf097 [Inductor][CPP] generalize the wgt tensor delete (#135101)
**Summary**
Previously, we assumed the packed weight for (`MKL/MKLDNN`) linear operations was at `new_input_nodes[1]`. However, this is not the case for `MKL linear`, where `new_input_nodes[1]` contains the original weight instead of the packed weight. To generalize the code, in this PR, we identify nodes that are present in `input_nodes` but not in `new_input_nodes`—indicating they are no longer used by the GEMM template and can be considered candidates for deletion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135101
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-10-10 06:01:09 +00:00
ea83c78174 [SymmetricMemory] set the storage_offset of tensors returned by get_buffer() to 0 (#137569)
It seems that there's a bug in `TensorMaker` - it would treat `storage_offset` as bytes when calculating the storage size, but as numel when setting the tensor `storage_offset`. This seems to be causing tensors returned by get_buffer() with non-0 offset to report wrong storage size.

Will look into the `TensorMaker` issue further. But for `get_buffer()`, it seems more natural to just incorporate the offset into the data pointer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137569
Approved by: https://github.com/weifengpy
ghstack dependencies: #137567
2024-10-10 05:05:58 +00:00
96bab021c0 ATen | Fix header namespace pollution. (#137619)
Summary: Fixing a warning, so we can enable it globally.

Test Plan: Sandcastle-only, no runtime changes.

Differential Revision: D64122115

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137619
Approved by: https://github.com/Skylion007
2024-10-10 05:04:54 +00:00
1aa130e80c Avoid generating as_strided for alaising views in auto_functionalize_v2 (#137149)
during auto_functionalize_v2 if we encounter a view such that size() stride() and storage_offset() matches the base
we create a view that is regenerated by calling aten.alias instead of as_strided for better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137149
Approved by: https://github.com/zou3519
2024-10-10 05:00:41 +00:00
b5284a01a4 [CPU] remove keyword static for exp_u20 (#137571)
Remove all the keyword static for constants of vec registers in exp_u20 implementation. With the bf16 input shape of BertLarge, the SDPA kernel improves from 5.1ms to 4.7ms on SPR 56 threads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137571
Approved by: https://github.com/jgong5
2024-10-10 04:52:04 +00:00
d170c410f2 Clean up op BC check list (#137634)
Summary: Remove some stale items

Test Plan: CI

Differential Revision: D64133246

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137634
Approved by: https://github.com/hl475
2024-10-10 04:29:21 +00:00
249152475d fix sequence number for group (#134578)
Summary:
Fix sequence number in execution trace dump for matching between collective/p2p op and wait in execution trace replay.

`ProcessGroupNCCL` has 2 sequence number counter, `seqCollective_` and `seqP2P_`.
b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L1188-L1191)
However, `WorkNCCL` only has one sequence number member `seq_`. b18ba9419e/torch/csrc/distributed/c10d/ProcessGroupNCCL.hpp (L387)
We need to match collective and p2p with wait separately.
29b5a462dc

Depend on: https://github.com/pytorch/pytorch/pull/135132

Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test

Differential Revision:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134578
Approved by: https://github.com/kwen2501, https://github.com/c-p-i-o
2024-10-10 04:24:06 +00:00
5aa9f2b660 Fixed issue with nn.Transformer().generate_square_subsequent_mask() (#137654)
Fixed issue where nn.Transformer().generate_square_subsequent_mask() doesn't respect set_default_device() and set_default_dtype().

Fixes #137186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137654
Approved by: https://github.com/mikaylagawarecki
2024-10-10 03:10:01 +00:00
b9c9f7f0fa Document ROCm environment variables and improve CMake messaging to user (#137308)
Fixes #115725. Note that the github issue title is misleading. Read the comments to understand what the problem is really about.

The PR improves the documentation and CMake's behavior for ROCM builds.

- Documentation: There were two environment variables for ROCm builds that are now documented. `ROCM_PATH` and `PYTORCH_ROCM_ARCH`.
- CMake: Improved diagnostic messaging and error handling with respect to `ROCM_PATH`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137308
Approved by: https://github.com/pruthvistony, https://github.com/jithunnair-amd, https://github.com/jeffdaily
2024-10-10 03:08:08 +00:00
f394fb554b Enable failing diffs for regressions on basic_modules_ListOfLinears benchmarks (#137541)
Note that basic_modules_ListOfLinears_inductor_gpu_force_shape_pad is flay with 8% detected variance,
I set it up with 20% threshold (8*2)++
others are stable within +-1.5%

<img width="611" alt="Screenshot 2024-10-08 at 4 19 03 PM" src="https://github.com/user-attachments/assets/103c4bc7-6be8-41bf-ac31-4b8909fabfcf">

<img width="1581" alt="Screenshot 2024-10-08 at 4 18 56 PM" src="https://github.com/user-attachments/assets/56006f7a-e7de-4966-9a05-9263195adc68">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137541
Approved by: https://github.com/aorenste
2024-10-10 02:47:38 +00:00
f9ed39c989 Autoupdate min_lrs for ReduceLROnPlateau if possible, fixes #104361 (#137637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137637
Approved by: https://github.com/albanD
2024-10-10 01:23:30 +00:00
d50d5df2fb Add warning for non static grads in optimizer variable (#137554)
Fixes https://github.com/pytorch/pytorch/issues/112548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137554
Approved by: https://github.com/williamwen42
2024-10-10 01:23:21 +00:00
f301f6544b fix bug for fill_empty_deterministic_ not support complex half (#137488)
Fixes #133157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137488
Approved by: https://github.com/ezyang
2024-10-10 01:21:32 +00:00
361046718d Generate new expected results file when there is failures in diff time benchmarks (#137551)
The test also add singpost log for the benchmarks that pass.
to test run I ran python check_results.py test_check_result/expected_test.csv test_check_result/result_test.csv out.csv
results
```
WIN: benchmark ('a', 'instruction count') failed, actual result 90 is -18.18% lower than expected 110 ±1.00% please update the expected results.

REGRESSION: benchmark ('b', 'memory') failed, actual result 200 is 100.00% higher than expected 100 ±+10.00% if this is an expected regression, please update the expected results.

PASS: benchmark ('c', 'something') pass, actual result 107 +7.00% is within expected 100 ±10.00%

MISSING REGRESSION TEST: benchmark ('d', 'missing-test') does not have a regression test enabled for it.

You can use the new reference expected result stored at path: out.csv.

a,instruction count,90,0.01
b,memory,200,0.1
c,something,100,0.1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137551
Approved by: https://github.com/aorenste
2024-10-10 01:09:15 +00:00
d9f4a7d3f9 Simplify find_localzeros (#133325)
Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D64135230](https://our.internmc.facebook.com/intern/diff/D64135230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325
Approved by: https://github.com/albanD
2024-10-10 00:52:50 +00:00
4f45c76806 [PGNCCL] Limit access to ncclComm_ (#137573)
When non-blocking mode is enabled, we need to make sure `ncclComm_` is ready before calling NCCL APIs on it.
`NCCLComm::getNcclComm` help us do that (thanks to a wait function inside), thus is preferred than directly using `ncclComm_`.

To prevent `ncclComm_` from being directly used outside, e.g. in `ProcessGroupNCCL`, we also move it as a private member of `NCCLComm` class -- the external-facing wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137573
Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang, https://github.com/c-p-i-o
ghstack dependencies: #137572
2024-10-10 00:34:05 +00:00
cyy
0739efbd1f Remove reference of gcc7 from CI scripts (#137339)
Because gcc7 can't be used to build Pytorch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137339
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-10-10 00:29:29 +00:00
47a515d260 [c10d] simplify barrier implementation and further decouple CPU/GPU (#137516)
synchronization
Summary:
Barrier is  essentially intended to block CPU thread (instead of GPU
streams). Before we used 2 stream synchronizations (1. current stream
blocked by nccl stream end event, 2. CPU thread blocked on current
stream). This is unnecessary as we already have CPU thread blocking
logic in wait(). Also, adding barrier specific code block in the general
GPU synchronize() API is intrusive and confusing.

This PR cleans this.

Test Plan:
CI

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137516
Approved by: https://github.com/fduwjj, https://github.com/kwen2501
2024-10-09 23:55:28 +00:00
51c33c0b72 Increase the runner size of AVX* jobs to 4xlarge (#137633)
The failed test is recently moved backed from slow and it requires more RAM than what available on 2xlarge runner.  It looks ok to up the instance size to 4xlarge instead.  I missed periodic jobs in https://github.com/pytorch/pytorch/pull/137447

Example periodic failures de4c2a3b4e (test_cpu_repro)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137633
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-10-09 23:43:49 +00:00
4304c68a4c In Inductor, be willing to generate deferred runtime asserts when unbacked (#137097)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137097
Approved by: https://github.com/angelayi
ghstack dependencies: #137091
2024-10-09 23:34:35 +00:00
6908d8d450 Enable python dispatcher for reinplacing pass (#137091)
Arguably this should be put somewhere higher up in the stack?  Not sure.

Xref: https://fb.workplace.com/groups/6829516587176185/permalink/8042762615851570/

There is a repro but I need to fix more bugs before it can be checked in

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137091
Approved by: https://github.com/bdhirsh
2024-10-09 23:34:35 +00:00
31e334ad9e [unwind] replace LONG_LONG_MAX by the portable LLONG_MAX (#125043)
This fixes a compilation error on systems with the musl c library.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125043
Approved by: https://github.com/aaronenyeshi
2024-10-09 23:34:16 +00:00
aafa02506e [CudaDMAConnectivityDetector] improve the detection robustness (#137530)
- Previously the detection would fail before user calling APIs such as `torch.cuda.set_device()`. This is because the detection logic requires nvml initialization. In this PR, we added explicit nvml initialization (which idempotent).
- Previously any nvml issue occurred in the detection logic would result in fatal error. Now we issue an informative warning and return a topology assuming no NVLink connectivity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137530
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475, #137529
2024-10-09 23:30:16 +00:00
fbaf9b62de [SymmetricMemoryOps] use float32 as the accumulator type when accumulating bfloat16 with multimem.ld_reduce (#137529)
This provides better accuracy without additional cost.

Also added documentation to `multimem_one_shot_all_reduce` to note the numerical caveats.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137529
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474, #137475
2024-10-09 23:30:16 +00:00
39c5122a4f [IntraNodeComm] replace all-reduce kernels with corresponding symm_mem ops (#137475)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR
- Replaces one-shot all-reduce with `symm_mem::one_shot_all_reduce_out`
- Replaces two-shot all-reduce with `symm_mem::two_shot_all_reduce_`
- Removes HCM all-reduce (at least for now). Due to the nature of its accumulation order, we can't guarantee the numerical consistency across all ranks.
- Removes the `IntraNodeComm` python binding (its original purpose is superceded by `SymmetricMemory`).
- Removes methods that were made for the python binding.
- Replaces nvlink detection logic with `DMAConnectivityDetector`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137475
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473, #137474
2024-10-09 23:30:16 +00:00
e6edfe3928 [SymmetricMemoryOps] create an out-variant for multimem_one_shot_all_reduce (#137474)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::multimem_one_shot_all_reduce_out`. The out-variant is more suitable for `IntraNodeComm` integration.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137474
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472, #137473
2024-10-09 23:30:16 +00:00
b22749712c type _inductor/optimize_indexing.py (#137599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137599
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-10-09 23:29:47 +00:00
d67b4f9e5f type _inductor/quantized_lowerings.py (#137598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137598
Approved by: https://github.com/Skylion007
2024-10-09 23:29:26 +00:00
9b01d17b8d Use MetaProxy more pervasively (#137588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137588
Approved by: https://github.com/ezyang
ghstack dependencies: #136674
2024-10-09 23:22:03 +00:00
13cf8360d8 [MPS] Fix testing for generator operators (#137601)
Before this changes, tests for operators like `eye` or `triu_indices` were essentially a test that respective CPU operators are stable, as cpu_sample and mps_sample were the same

Moved the logic to `transform_opinfo_sample_to_mps` whicih in addition to copying tensors is also tweaks `kwargs`

Discovered that:
 - `torch.randn` and `torch.randint` fall into the same undefined category
 - `torch.logspace` is not implemented for MPS
 -  Allow 1.0  absolute tolerance for all `torch.linspace` calls over integral input as rounding is wrong on the MPS side
 - `torch.triu_indices` are not implemented (PR is coming, this is how I've discovered this problem)
 - `torch.signal.windows.kaiser` fails because `aten::i0` is not implemented
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137601
Approved by: https://github.com/albanD
2024-10-09 23:17:11 +00:00
48fe0d56d6 Type _inductor/exc.py (#137595)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137595
Approved by: https://github.com/Skylion007
2024-10-09 23:15:06 +00:00
7408742b67 Make ignore_fresh_unbacked_symbols reentrant (#137605)
I have a test but it requires some other feature work that isn't fully baked.  Maybe this will fix an xfail.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137605
Approved by: https://github.com/albanD
2024-10-09 23:08:05 +00:00
5516ac5c21 [ROCm] Tunableop record untuned (#128813)
When enable tunableop, It is easy to have OOM since APP usually needs large video memory size, such as running a LLM for inference.  So we need a offline mode to tune the GEMMs. This PR provide an offline mode for tunableOp:

- record untuned GEMMs to file.

- a python API named tune_gemm_in_file is added to read the untuned file and tune the GEMMs in file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128813
Approved by: https://github.com/jeffdaily, https://github.com/hongxiayang, https://github.com/naromero77amd

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-10-09 21:59:03 +00:00
839d3568b0 [compiled autograd] fix -Wuninitialized (#137539)
https://github.com/pytorch/pytorch/pull/135663#discussion_r1792408353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137539
Approved by: https://github.com/isuruf, https://github.com/Skylion007
2024-10-09 21:16:26 +00:00
38027b9b47 [SymmetricMemory] fix a bug where numel calculation overflows when the tensor size is large (#137567)
Fixes https://github.com/pytorch/pytorch/issues/137145

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137567
Approved by: https://github.com/Chillee, https://github.com/weifengpy
2024-10-09 20:45:57 +00:00
a93ea617b5 [FSDP2] Required mesh_dim_names for HSDP (#137436)
Two changes:
1. Require `mesh_dim_names` if using HSDP
2. Pass only the shard mesh to `fsdp_pre_all_gather`

Change 1 is technically BC breaking, but it should not be hard to fix on the user side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-10-09 20:35:09 +00:00
47af7cc962 Add compiler bisector (#131936)
This is a utility to aid the torch.compile debugging. You provide a function that returns True on success, False on failure, or do something out of process and run bisect_helper `good | bad`.

The bisector will first go through backends - `eager`, `aot_eager`, `aot_eager_decomp_partition`, `inductor` to find the first failing backend. Then, it will go through subsystems within the backend - currently limited but could be expanded - and try to find the first subsystem for which disabling fixes the problem. Once it has found the failing subsystem, it will find the number of times the subsystem is applied, and then bisect through it.

An example usage of how to hook it up for aot_eager_decomp_partition and decomposition subsystem is :

```
    from torch._inductor.bisect_helper import BisectionManager
    if op in CURRENT_DECOMPOSITION_TABLE:
        if BisectionManager.disable_subsystem("aot_eager_decomp_partition", "decomposition", lambda: repr(op)):
            return NotImplemented
```

Once it has discovered the problematic change, it will print out the associated debug info, and you can set the same limits with `TORCH_BISECT_BACKEND` `TORCH_BISECT_SUBSYSTEM` and `TORCH_BISECT_MAX`.

We could add further options as an automated way of going through a check list for checking divergence - e.g., the mode to emulate amp casts.

Fix for https://github.com/pytorch/pytorch/issues/126546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131936
Approved by: https://github.com/ezyang
2024-10-09 20:34:11 +00:00
cfe970260a Clarify opt-einsum usage, fix #127109 (#137596)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137596
Approved by: https://github.com/albanD
2024-10-09 20:31:24 +00:00
c73d2634b9 Revert "Log chromium event for automatic dynamic reasons (#137491)"
This reverts commit 3c1ab9367885fdb0ead5fcc14a22d6934070ca92.

Reverted https://github.com/pytorch/pytorch/pull/137491 on behalf of https://github.com/jovianjaison due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/137491#issuecomment-2403360486))
2024-10-09 20:24:12 +00:00
16a2c2cfd4 Revert "Introduce torch.sym_sum (#136429)"
This reverts commit 90bed32b986ab1356dc376df3985497cedbe8a29.

Reverted https://github.com/pytorch/pytorch/pull/136429 on behalf of https://github.com/ezyang due to fails internal stuff ([comment](https://github.com/pytorch/pytorch/pull/136429#issuecomment-2403335147))
2024-10-09 20:08:01 +00:00
572f506f9c [c10d] Improve split_group test (#137572)
Fix 1:
`backend1 = pg._get_backend`, here `pg` should be `ng1`.

Fix 2:
`dist.broadcast` should be called by ranks of subgroup `ng1` only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137572
Approved by: https://github.com/Skylion007
2024-10-09 19:43:57 +00:00
70288c3c2d Remove dependency on numpy for serialization for XLA/open registration devices without numpy (#137444)
Related: https://github.com/pytorch/xla/issues/7799#issuecomment-2375818263

Follow ups: Do the same for maia and mtia

## Motivation

With the move to `weights_only` by default, we are making an explicit decision not to allowlist GLOBALs required to deserialize `numpy` tensors  by default. The implication is that backends relying on numpy for serialization will fail loudly when `torch.load` flips `weights_only`.

However, we make the observation that this dependency on numpy was legacy and is not actually needed anymore. So we can remove it, which aligns with our weights_only strategy.

## Why is this ok?

The following comment on why numpy is necessary for serialization is legacy

c87c9f0a01/torch/_tensor.py (L303-L312)

We no longer do the following, though it was the case 5 years ago in the PR that added this
> CPU storage is reconstructed with randomly initialized data, moved onto backend device, and then storage is updated to the serialized content

**Instead what now happens is that CPU storage is constructed with data from the file **and then** moved onto backend device.**

Old behavior (`legacy_load`): 67adda891a/torch/serialization.py (L620)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137444
Approved by: https://github.com/albanD
2024-10-09 19:35:55 +00:00
aa61e251d4 [FSDP2] Added shard_placement_fn arg (#137496)
## Overview
This PR adds a `shard_placement_fn: Optional[Callable[[nn.Parameter], Optional[Shard]]` arg to `fully_shard` that allows users to specify FSDP sharding on a nonzero tensor dim. If doing so, then the tensor dim size must be divisible by the FSDP shard world size.

```
# Example:
def shard_placement_fn(param: nn.Parameter) -> Optional[Shard]:
    largest_dim = largest_dim_size = -1
    for dim, dim_size in enumerate(param.shape):
        if dim_size > largest_dim_size:
            largest_dim = dim
            largest_dim_size = dim_size
    return Shard(largest_dim)

fully_shard(module, shard_placement_fn=shard_placement_fn)
```

## Follow-Ups
- **Copy kernels:** For all-gather copy-out, we currently copy-out to temporaries and then chunk-dim-0 -> cat-shard-dim, incurring an extra copy for parameters sharded on nonzero tensor dim. Similarly, for reduce-scatter copy-in, we currently chunk-shard-dim -> cat-dim-0, incurring an extra copy for gradients sharded on nonzero tensor dim. @yifuwang  has ideas for adding additional split size args to the copy ops that allows fusing these extra copies into the existing all-gather copy-out and reduce-scatter copy-in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137496
Approved by: https://github.com/weifengpy
ghstack dependencies: #137593
2024-10-09 19:13:32 +00:00
36133f39db Tensorify compute on Python scalars (#136674)
Signed-off-by: Bob Ren <bobrenfb.com>

Comandeered from https://github.com/pytorch/pytorch/pull/130228 as I'm helping @ezyang w/ shipping dynamic float arguments in PT2. This starts with supporting torch.ops.aten.mul. I'll stack on top support for other operators in subsequent PRs to keep this scoped to the mechanics of the fx pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136674
Approved by: https://github.com/ezyang
2024-10-09 18:51:41 +00:00
f15edb291a type _dynamo/trace_wrapped_higher_order_op.py (#137354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-10-09 18:35:28 +00:00
9a957e2842 [NCCL][Profiler] Add functionality to call dump function of NCCL profiler plugin (#137523)
Summary:
NCCL 2.23.4 provides the profiler plugin feature, which traces collective, p2p, proxyOps, and other events.

The diff supports the following feature: when NCCL times out, the flight recorder can also dump traces in the profiler plugin.

Test Plan:
```
        tensor = torch.tensor([dist.get_rank()], dtype=torch.int32, device=dev)
        # Create a list with same number of elements as world size (aka no. of ranks)
        # During allgather this list is going to be populated with tensors from all ranks (aka all gather)
        gathered_tensors = [torch.zeros_like(tensor) for _ in range(WORLD_SIZE)]
        # get collective from all ranks
        if i <= 10 or RANK != 0:
            dist.all_gather(gathered_tensors, tensor)
```
My script triggers flight recoder.
```
trainer/0 [0]:E0927 12:07:22.643702 1012209 ProcessGroupNCCL.cpp:1356] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info.
trainer/0 [0]:I0927 12:07:22.643784 1012209 ProcessGroupNCCL.cpp:392] NCCL_PROFILER_PLUGIN: /data/users/zhiyongww/fbsource/fbcode/scripts/nbahl/libnccl_profiler_plugin.so
trainer/0 [0]:I0927 12:07:22.643805 1012209 plugin.cpp:559] Profiler start dump
trainer/0 [0]:I0927 12:07:22.645249 1012209 ProcessGroupNCCL.cpp:1363] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL dumping nccl trace to /tmp/nccl_trace_rank_0
trainer/0 [0]:I0927 12:07:22.645418 1012209 NCCLUtils.cpp:348] Finished writing NCCLPG debug info to /tmp/nccl_trace_rank_0
```
Content from /tmp/nccl_trace_rank_0: P1614645283

Differential Revision: D61929401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137523
Approved by: https://github.com/c-p-i-o
2024-10-09 18:19:33 +00:00
394c143e4e [dynamo] Fix error when inlining certain nested closure returned by another function (#137510)
See `test_inline_closure_returned_by_another_function_and_captures` and #136814 for more context.

In #90286, we introduced an optimization so that for captured cells that are unmodified during a Dynamo trace, `UserFunctionVariable` will represent them as variable of the cell's actual value, rather than a `NewCellVariable`.

Later on we introduced more mechanisms to model such cells across function calls (#104222), and across function calls where `NestedUserFunctionVariable::bind_args` need to look up further in the parent frames (#106491) to find these cells' values.

This patch removes `InlinedClosureVariable` in favor of a simpler modelling, which is also more consistent with what was introduced in #90286, i.e., just model these cells as their contents, in `symbolic_locals`.

This fixes #136814 because resolution of `InlinedClosureVariable` to the underlying cell content value happens in
`NestedUserFunctionVariable::bind_args`, which requires Dynamo to have the value in scope at the function call site (when Dynamo does inlining), but's not always the case (as the test case shows). However, if we model the cells in `symbolic_locals`, we never need such resolution, and the values are directly stored into the `NestedUserFunctionVariable::closure` upon the function creation, at which point Dynamo always has the cell value in `symbolic_locals` for look up.

Fixes #136814.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137510
Approved by: https://github.com/williamwen42
2024-10-09 18:13:57 +00:00
018dabff20 [ONNX] Implement patch for jit.isinstance (#137592)
Patch torch.jit.isinstance for users for models to be dynamo exportable. Replaces https://github.com/pytorch/pytorch/pull/137487.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137592
Approved by: https://github.com/titaiwangms, https://github.com/xadupre
2024-10-09 18:06:52 +00:00
ceb2fcc5db [FSDP2] Fixed incorrect tensor meta after .to(dtype) (#137593)
This fixes https://github.com/pytorch/pytorch/issues/137522. After a method that changes to module parameters (like `.to(torch.float64)`), we need to update the `DTensorSpec`, whose `TensorMeta`'s dtype may have changed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137593
Approved by: https://github.com/Skylion007
2024-10-09 17:57:11 +00:00
bae8d5853e [TorchRec][PT2 compile] enable dynamo in _get_user_embeddings (#136798)
Summary:
# context
* enable the `_get_user_embeddings` function
* run failed at P1610151892
```
  torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
  GuardOnDataDependentSymNode: Could not guard on data-dependent expression u22 <= 0 (unhinted: u22 <= 0).  (Size-like symbols: u22)

  ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
  Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.

  Potential framework code culprit (scroll up for full backtrace):
    File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/38472faba4e3e6c1/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 1692, in native_layer_norm_backward
      if M <= 0 or N <= 0:
```
```
    N = prod(inner_dims)  # type: ignore[arg-type]
    M = prod(outer_dims)  # type: ignore[arg-type]
    if M <= 0 or N <= 0:
        return (
            input.new_zeros(input_shape) if output_mask[0] else None,
            input.new_zeros(input_shape[axis:]) if output_mask[1] else None,
            input.new_zeros(input_shape[axis:]) if output_mask[2] else None,
        )
```
# changes
* use guard_size_oblivious since the new_zeros return is kind of optimization, shouldn't impact the correctness of the follow up code logic.
* the size `ret[i][j]` could be zero, so the change in V1 isn't valid
* for more details: [post](https://fb.workplace.com/groups/6829516587176185/permalink/8003616173099548/)
```
    from torch.fx.experimental.symbolic_shapes import guard_size_oblivious
    if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):
```

# past
* found `u22` was introduced at
```
    def _wait_impl(self) -> List[List[int]]:
        # Can not use is_torchdynamo_compiling(), as every such condition should be independent for compilation with graph breaks.
        if isinstance(self._splits_awaitable, dist.Work):
            self._splits_awaitable.wait()

        ret = self._output_tensor.view(self.num_workers, -1).T.tolist()  # <------ u22 introduced here

        if not torch.jit.is_scripting() and is_torchdynamo_compiling():
            for i in range(len(ret)):
                for j in range(len(ret[i])):
                    torch._check_is_size(ret[i][j])   # <----------  my question: why the _check_is_size isn't enough??
                    torch._check(ret[i][j] > 0)   # <------ added by diff V1
```

Test Plan:
# run command
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 | tee -a `tagT`.`tagH`.log
```

# results
* before
**without enabling `_get_user_embeddings`**
[14 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp2eNI7p/failures_and_restarts.html)
log: P1610151892
{F1889387940}
* V1
enable `_get_user_embeddings`
with `torch._check(ret[i][j] > 0)`
[13 Failures and Restarts](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp6J1iY9/failures_and_restarts.html)
{F1889388378}
* V2
enable `_get_user_embeddings`
with `if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):`
[tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpFhZZyC/index.html)
if guard_size_oblivious(M <= 0) or guard_size_oblivious(N <= 0):

Differential Revision: D63424929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136798
Approved by: https://github.com/ezyang
2024-10-09 17:19:45 +00:00
4d45536e92 Save aot graph code in AOTAutogradCache for logging purposes (#137432)
Save the string graph code from print_readable

Differential Revision: [D63985711](https://our.internmc.facebook.com/intern/diff/D63985711/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137432
Approved by: https://github.com/bdhirsh
ghstack dependencies: #137431
2024-10-09 16:59:08 +00:00
b71d0ac3b1 remove unused variable (#137565)
per title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137565
Approved by: https://github.com/Skylion007
2024-10-09 16:31:43 +00:00
ae03c0cff3 Add microbenchmark for FxGraphHashDetails.debug_lines (#137506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137506
Approved by: https://github.com/jamesjwu
2024-10-09 16:15:05 +00:00
e945b6600d Support 3.8 compile again (#137587)
This is not going to be very reliable since we don't have CI though...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137587
Approved by: https://github.com/Skylion007
2024-10-09 15:54:52 +00:00
1d15dd7891 Fix triton_reshape to properly expand Min keyword in triton codegen (#137357)
Summary: Previously triton_reshape will generate code with `Min` keyword in it, which is incorrect. This diff updates the triton_reshape function to properly expand `Min` keyword to `<`.

Test Plan:
```
buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_Min_keyword_in_block_shape
```

Differential Revision: D63850158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137357
Approved by: https://github.com/blaine-rister, https://github.com/eellison
2024-10-09 15:53:45 +00:00
de4c2a3b4e Add AsyncCollectiveTensor isinstance check to test_graph_input_is_async (#137253)
This PR doesn't change the logic of `test_graph_input_is_async` - it just adds an additional check to the graph input type to ensure it's always `AsyncCollectiveTensor` as expected. It would potentially make it easier to show to users that we already support `AsyncCollectiveTensor` as graph input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137253
Approved by: https://github.com/bdhirsh
2024-10-09 08:06:16 +00:00
ac8954d1ca [pattern match][SDPA] remove contiguous in sdpa replacement (#136930)
Fixes a perf issue which is found internally.
In the case, we see query(size=[1, 16, 384, 64], stride=[393216, 64, 1024, 1]) in model code. However before entering SDPA, it becomes query(size=[1, 16, 384, 64], stride=[393216, 24576, 64, 1]). This is caused by the [SDPA pattern match](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/fx_passes/fuse_attention.py#L130-L132), which applies contiguous to inputs in replacement. This is not necessary as the contiguous doesn't exist in pattern. Furthermore, it could sometimes cause perf issues. Anyway, we can do the additional contiguous in the kernel implementation if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136930
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jgong5
2024-10-09 07:52:38 +00:00
72ad1b8c6c Make Context to be Device-agnostic Step by Step (2/N) (#136526)
- add new method(getDefaultGenerator, getNewGenerator) into AcceleratorHooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136526
Approved by: https://github.com/ezyang, https://github.com/EikanWang
ghstack dependencies: #136519
2024-10-09 07:34:30 +00:00
a02093e824 fix test_export_constraints_error_not_in_range (#137500)
Test Plan: fixed

Differential Revision: D64052011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137500
Approved by: https://github.com/tugsbayasgalan
2024-10-09 05:48:14 +00:00
abb00efc14 Add torch.squeeze parameter description to declare allowed type (#137485)
Fixes #137422

Add parameter type definition in API docs to clarify allowed value type, eliminate users pass `None`  as `dim` value directly.

```python
>>> import torch
>>> x = torch.randn(3,1,2)
>>> x.squeeze(dim=None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: Please look up dimensions by name, got: name = None.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137485
Approved by: https://github.com/albanD
2024-10-09 05:29:13 +00:00
df114a447e Parametrize test_lstm_packed (#137447)
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour.  Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.

Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error.  Maybe, this would also fix the issue (pending CI testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
2024-10-09 05:13:53 +00:00
2fff990c16 Revert "[AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314)"
This reverts commit 932b9945c0bc61a11a7db2f52c974cf283d5a2ed.

Reverted https://github.com/pytorch/pytorch/pull/137314 on behalf of https://github.com/huydhn due to The failure shows up in trunk ([comment](https://github.com/pytorch/pytorch/pull/137314#issuecomment-2401311719))
2024-10-09 04:53:30 +00:00
972822dea1 Minorly reorder optim kwargs in docs, fixes #137391 (#137531)
Closes #137391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137531
Approved by: https://github.com/albanD
2024-10-09 04:14:45 +00:00
4628fcf41a Fix ir._WaitKernel (#137401)
In ABI-compatible mode, AOTInductor could not compile _WaitKernel due to
an incorrect outputs list.  Add the correct set of outputs, as done in
ir._CollectiveKernel.create_out_of_place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137401
Approved by: https://github.com/desertfire
ghstack dependencies: #136924
2024-10-09 04:02:30 +00:00
0414aeacd9 AOTInductor: silence linker warnings about executable stacks (#136924)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136924
Approved by: https://github.com/desertfire
2024-10-09 04:02:30 +00:00
ddc7b6d0b4 Removes confusing note, addresses #38006 (#137535)
Fixes #38006

The note was originally added in https://github.com/pytorch/pytorch/pull/30257, which tried to ensure that the gradient wasn't modified in the optimizer. This note creates more confusion than is helpful, so removing it is better than leaving it in, especially because most uses of closure that I know _does_ modify the grads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137535
Approved by: https://github.com/albanD
2024-10-09 04:00:38 +00:00
d3edf4ebf4 [SymmetricMemoryOps] implement two-shot all-reduce (#137473)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::two_shot_all_reduce_`. Later we'll replace the two-shot all-reduce in `IntraNodeComm` with these.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137473
Approved by: https://github.com/Chillee
ghstack dependencies: #137471, #137472
2024-10-09 03:49:42 +00:00
82e55b624f [SymmetricMemoryOps] implement one_shot_all_reduce (#137472)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Implement `symm_mem::one_shot_all_reduce` and `symm_mem::one_shot_all_reduce_out`. Later we'll replace the one-shot all-reduce in `IntraNodeComm` with these.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137472
Approved by: https://github.com/Chillee, https://github.com/weifengpy
ghstack dependencies: #137471
2024-10-09 03:49:42 +00:00
5d83ee3e32 [SymmetricMemoryOps] refine cross-device barriers (#137471)
## This Stack

Implement custom all-reduce algos available in `IntraNodeComm` as `symm_mem` ops and replace the existing `IntraNodeComm` kernels with them.

## This PR

Refine the corss-device synchronization primitives to make it clearer when to use which synchronization pattern.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137471
Approved by: https://github.com/Chillee, https://github.com/weifengpy
2024-10-09 03:49:42 +00:00
5f1759a025 [Dynamo] add flex attention mode test (#137121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119
2024-10-09 02:29:40 +00:00
d5785d4295 [Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227
2024-10-09 02:29:40 +00:00
0a304d9048 [Dynamo] Handle extracted unbound tensor methods (#137227)
fixes2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
ghstack dependencies: #137114, #137115, #137116, #137117, #137120
2024-10-09 02:29:40 +00:00
b3f30c9bc3 [Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts.  (We don't trace through torch.* modules by default)

Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)

Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang, https://github.com/malfet
ghstack dependencies: #137114, #137115, #137116, #137117
2024-10-09 02:29:40 +00:00
27dee935af [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-09 02:29:40 +00:00
38afac2917 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115
2024-10-09 02:29:40 +00:00
108b469f78 [Dynamo] Remove ignored modes workaround (#135502) (#137115)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114
2024-10-09 02:29:40 +00:00
e41dffbedd [Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
2024-10-09 02:29:40 +00:00
0b8048c78a Fix AOTI CPP GEMM Template issue without freezing (#136421)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135106. For AOTI, there is the Inductor IR of weight
```
ReinterpretView(
  StorageBox(
    ConstantBuffer(name='L__self___mlp_0_weight', layout=FixedLayout('cpu', torch.float32, size=[64, 128], stride=[128, 1]))
  ),
  FixedLayout('cpu', torch.float32, size=[128, 64], stride=[1, 128]),
  origins=OrderedSet([addmm])
)
```
In the post-processing step of the GEMM template, the used weight was before permutation, leading to correctness issues. In this PR, we address this by reshaping the weight to the expected size and stride before the weight prepack.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_aot_inductor.py -k test_misc_1_max_autotune_True_non_abi_compatible_cpu
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear
python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_aoti_linear_multi_view_operations
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136421
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-10-09 02:19:07 +00:00
be0b75256a Make Context to be Device-agnostic Step by Step (1/N) (#136519)
- make init to be device-agnostic and move it to AcceleratorHooksInterface
- refactoring context related to device initialization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136519
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/guangyey
2024-10-09 02:13:36 +00:00
384ddab294 [c10d] fix sequence numbers for coalesced operations (#135132)
Summary:
We were erroneously incrementing seq_collective for p2p operations.
FIxes issue #134833

Test Plan:
Unit tests.
TODO: add more unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135132
Approved by: https://github.com/fduwjj
2024-10-09 01:38:12 +00:00
8cbb58cff6 [inductor] Limit cpu copies in autotuning to CUDA devices (#137509)
Summary: Missed in https://github.com/pytorch/pytorch/pull/136701#discussion_r1792328849: we should perform this optimization only for mutated args on cuda devices

Test Plan: `python benchmarks/dynamo/timm_models.py --performance --inductor --device cuda --inference --bfloat16 --print-compilation-time --print-memory --cold-start-latency --only fbnetc_100`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137509
Approved by: https://github.com/int3, https://github.com/eellison
2024-10-09 01:31:58 +00:00
932b9945c0 [AutoAC] Backward Pass Aware AC - changes to partitioner to acommodate SOLVER as a callable (#137314)
Summary: making it so that the config can pass `config.activation_memory_budget_solver` as a callable method and then that callable is invoked to determine the set of saved/recomputed nodes.

Test Plan: tbd

Reviewed By: Chillee, basilwong

Differential Revision: D63714905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137314
Approved by: https://github.com/eellison, https://github.com/basilwong

Co-authored-by: Parikshit Shah <parikshit@meta.com>
2024-10-09 00:39:29 +00:00
23c531b3e9 Allow parallelize_module to get device_mesh from ambient context (#134247)
This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one.

Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model.

(The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.)

For example:
```
class FeedForward(nn.Module):
    def __init__(self, config: TransformerArgs) -> None:
        super().__init__()
        w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
        w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        self.w1 = parallelize_module(w1, Colwise)
        self.w2 = parallelize_module(w2, Rowwise)
        self.w3 = parallelize_module(w3, Colwise)

    def forward(self, x: Tensor) -> Tensor:
        y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x))
        # y is a DTensor with Partial placement; we can return it as is.
        return y
        # Or we can convert it to Replicate -- there is modeling flexibility here.
        return y.redistribute(Replicate())

with device_mesh:
    model = FeedForward(config)
    # Now model is a model parallelized onto device_mesh

y = model(x)

```

The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context.

Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247
Approved by: https://github.com/tianyu-l
2024-10-09 00:19:03 +00:00
de7f32a205 openreg add pin_memory (#135339)
Occording to `Next steps` in test/cpp_extensions/open_registration_extension/README.md, add Pinned memory and HostAllocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135339
Approved by: https://github.com/albanD
2024-10-09 00:07:59 +00:00
8893881867 Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang

Co-authored-by: eellison <elias.ellison@gmail.com>
2024-10-09 00:05:52 +00:00
eqy
cba3f4f5e3 [CUDA] Clean up asserts in test_cuda.py (#137034)
Switch some `assertTrue` tests to `assertEqual` etc for debuggability in logs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137034
Approved by: https://github.com/Skylion007
2024-10-08 23:16:19 +00:00
b16167874d Minor SGD docs clarification fixing #137356, #137352 (#137528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137528
Approved by: https://github.com/albanD
2024-10-08 23:05:08 +00:00
2a1829d728 Error message for allow_in_graph decorator and arbitrary function combo (#135972)
Fixes #103615

Quick error message for non-allowed allow_in_graph decorator and arbitrary function combo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135972
Approved by: https://github.com/anijain2305
2024-10-08 22:48:38 +00:00
4aed81c0db Add support for cat memory planning mms with max autotune (#132554)
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.

Discussion for reviewers:

It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation

```
class AllocatedFixedLayout(FixedLayout)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
2024-10-08 22:36:46 +00:00
02013da038 Lift restriction on training IR for unflatten (#137470)
Differential Revision: [D64025578](https://our.internmc.facebook.com/intern/diff/D64025578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137470
Approved by: https://github.com/avikchaudhuri
2024-10-08 22:30:24 +00:00
81c8a8ada6 [ONNX] Bump onnxscript in CI (#137497)
To 0.1.0.dev20241008
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137497
Approved by: https://github.com/titaiwangms
2024-10-08 21:56:30 +00:00
76ab1ab665 Fix autograd.Function + NJT when an output grad is None (#136875)
For `autograd.Function`, the engine will try to allocate correctly-shaped zeros for `None` grads (i.e. in the case where the output isn't used downstream). It determines the shape of these zeros from the `VariableInfo` entry, which is derived from the forward output shape. For the NJT forward output case, the size info stored will contain a nested int, and calling `zeros()` with this size throws:
```
RuntimeError: .../build/aten/src/ATen/RegisterCPU.cpp:5260: SymIntArrayRef expected to contain only concrete integers
```

This PR fixes this by storing the full tensor in the `VariableInfo` for the nested case and calling `zeros_like()` to allocate correctly-shaped zeros. This is pretty inefficient; ideally we would want to save just the NJT shape and be able to construct zeros from it, but this requires factory function support for nested ints (WIP). So this is a short-term fix until we have that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136875
Approved by: https://github.com/soulitzer
2024-10-08 21:01:36 +00:00
5e3e1c0151 Revert "[FSDP2] Required mesh_dim_names for HSDP (#137436)"
This reverts commit 5fb30df7d6ecc25cc7c4c17a8a33d14ddaa7c279.

Reverted https://github.com/pytorch/pytorch/pull/137436 on behalf of https://github.com/malfet due to Looks like it broke distributed testing, see https://github.com/pytorch/pytorch/actions/runs/11239761070/job/31249854217 ([comment](https://github.com/pytorch/pytorch/pull/137436#issuecomment-2400794929))
2024-10-08 20:50:49 +00:00
b499083a91 Get rid of quadratic tests to has_same_metadata (#136857)
Fixes https://github.com/pytorch/pytorch/issues/136852

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136857
Approved by: https://github.com/isuruf, https://github.com/bdhirsh
2024-10-08 20:49:23 +00:00
d34b617bb9 Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)"
This reverts commit 51bc839b94829f176e3c1b7f62e3448d6028c480.

Reverted https://github.com/pytorch/pytorch/pull/137114 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
8c937445ee Revert "[Dynamo] Remove ignored modes workaround (#135502) (#137115)"
This reverts commit b1fd7708bd81d8d52908bf4459ed024471abd803.

Reverted https://github.com/pytorch/pytorch/pull/137115 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
e5f9131327 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)"
This reverts commit f9d69cde88ad972ee8fc24413dd0740f4e21562d.

Reverted https://github.com/pytorch/pytorch/pull/137116 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
2d18c2d5e7 Revert "[Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)"
This reverts commit 941be418d8ec3290d0e3bae0e16a443be26b3075.

Reverted https://github.com/pytorch/pytorch/pull/137117 on behalf of https://github.com/huydhn due to The top of the stack has been reverted but it leaves trunk in a broken state, so I try to revert the rest of the stack ([comment](https://github.com/pytorch/pytorch/pull/137114#issuecomment-2400765603))
2024-10-08 20:33:17 +00:00
cc75ac084f Add test for https://github.com/pytorch/pytorch/issues/137087 (#137090)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137090
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-10-08 20:17:03 +00:00
5349ee2934 Revert "Parametrize test_lstm_packed (#137447)"
This reverts commit d5493ed579ba41015ffef981832a3f04f94bb6f8.

Reverted https://github.com/pytorch/pytorch/pull/137447 on behalf of https://github.com/huydhn due to Need to up few more instance to 4xlarge, revert to reland ([comment](https://github.com/pytorch/pytorch/pull/137447#issuecomment-2400737602))
2024-10-08 20:15:24 +00:00
3c1ab93678 Log chromium event for automatic dynamic reasons (#137491)
Log a chromium event so that we can see the reasons for invoking automatic dynamic shapes in aggregate internally.

Run following code:
```
import torch
@torch.compile(backend="eager")
def foo(t, x):
    return t.sin() + x

torch._dynamo.config.automatic_dynamic_shapes = True
torch._dynamo.config.assume_static_by_default = True
# Change size
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([2,2])
foo(x, 2)
torch._dynamo.reset()
# Change dimensionality
x = torch.randn([1,2])
foo(x, 2)
x = torch.randn([1,2,3])
foo(x, 2)
torch._dynamo.reset()
# Change stride
x = torch.randn([3,3])
foo(x, 2)
x = torch.as_strided(x, [3,3], [2,2])
foo(x, 2)
torch._dynamo.reset()
# Change scalar
x = torch.randn([1,2])
foo(x, 2)
foo(x, 3)
```

Internal link to perfetto:
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

The events look like this:
<img width="639" alt="image" src="https://github.com/user-attachments/assets/23916333-7f24-47c7-934b-201f33aebeab">
<img width="638" alt="image" src="https://github.com/user-attachments/assets/9f927c8d-04bb-4431-8802-685b032df656">
<img width="640" alt="image" src="https://github.com/user-attachments/assets/342e9e11-0dfc-422d-bd0b-01a8574d38ba">
<img width="635" alt="image" src="https://github.com/user-attachments/assets/dc2c97cd-7180-4069-b019-d6e63ee490bc">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137491
Approved by: https://github.com/Skylion007, https://github.com/oulgen
2024-10-08 19:53:12 +00:00
cyy
a2396b2dd8 [2/N] Fix extra warnings brought by clang-tidy-17 (#137459)
Follows #137407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459
Approved by: https://github.com/Skylion007
2024-10-08 19:05:02 +00:00
b41fc14072 compile time benchmarks for AOTDispatcher (partitioner) (#136760)
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:

(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple

from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136759
2024-10-08 18:44:13 +00:00
48b8f818b2 compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
2024-10-08 18:44:13 +00:00
53af729a66 add meta for _segment_reduce_backward (#137442)
reland of https://github.com/pytorch/pytorch/pull/124988

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137442
Approved by: https://github.com/albanD
2024-10-08 18:40:06 +00:00
1aac1ffce1 Don't generate implicit value ranges for missing symbols. (#136667)
Instead, callback to a missing handler when needed. This greatly speeds things up with the value ranges dict is large. The missing handler is needed because nested ints don't have VRs, but symbolic sizes involving them occasionally show up in compute.

```
TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="s11" TORCH_LOGS=dynamic PYTORCH_TEST_WITH_DYNAMO=1 python test/test_nestedtensor.py TestNestedTensorAutogradCPU.test_dropout_backward_jagged_cpu
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136667
Approved by: https://github.com/isuruf
ghstack dependencies: #136429
2024-10-08 18:12:57 +00:00
90bed32b98 Introduce torch.sym_sum (#136429)
Partially addresses https://github.com/pytorch/pytorch/issues/128150

When you have big sums of values, we end up computing long chains of
binary addition in our FX graph representation.  Not only is this ugly,
it also is quadratic, as the sympy.Add constructor is O(N) in number
of arguments.  Instead, ensure that we maintain the summation as a
single FX node so we can do the entire addition all in one go.

update_hint_regression benchmark, before and after:

```
update_hint_regression,compile_time_instruction_count,2648328980
update_hint_regression,compile_time_instruction_count,2563748678
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136429
Approved by: https://github.com/isuruf
2024-10-08 18:12:57 +00:00
3bf6594d13 Log compile ids to pt2_remote_cache and pt2_compile_events (#137431)
Log the current compilation id for all relevant samples for these two tables, so we can have a 1:1 analog with dynamo_compile.

Differential Revision: [D63900826](https://our.internmc.facebook.com/intern/diff/D63900826/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137431
Approved by: https://github.com/oulgen
2024-10-08 18:04:48 +00:00
758dbac308 Add type check for ord in torch.linalg.vector_norm() and torch.linalg.matrix_norm() (#137463)
fixes #137424, fixes #137460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137463
Approved by: https://github.com/lezcano
2024-10-08 17:53:56 +00:00
d87835ac32 [Profiler] Clear Out Dangling AppendOnlyLists (#137450)
Summary: There are two instances of AppendOnlyLists that don't get cleared after we have finished iterating through the forward lists. This can be potentially dangerous since they can last for the entirety of the lifespan of the profiler. We have also seen crashes during the destructor of these variables when the profiler is exiting. This could possibly be related to the fact that the default constructor assumes some valid state of these lists rather than whatever state they are in when profiler is exiting.

Test Plan: Ran with profile_memory=True to make sure allocations queue gets cleared correctly and trace+workload ran successfully

Differential Revision: D64010911

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137450
Approved by: https://github.com/aaronenyeshi
2024-10-08 17:48:59 +00:00
7e8dace0de Revert "[ROCm] remove caffe2 from hipify (#137157)"
This reverts commit 40d826074546558f6665a4c118335a7725503cac.

Reverted https://github.com/pytorch/pytorch/pull/137157 on behalf of https://github.com/xw285cornell due to this is breaking internal where we still use caffe2 ([comment](https://github.com/pytorch/pytorch/pull/137157#issuecomment-2400466131))
2024-10-08 17:45:45 +00:00
a8047564ff Revert "[FlexAttention] Support training bias for eager (#136910)"
This reverts commit 711dacf9845cbc9ea8b3b0fa257309930106712f.

Reverted https://github.com/pytorch/pytorch/pull/136910 on behalf of https://github.com/malfet due to torch.library.custom_op looks weird here and it breaks some internal workloads ([comment](https://github.com/pytorch/pytorch/pull/136910#issuecomment-2400434833))
2024-10-08 17:29:02 +00:00
0b5ade8a12 Revert "[Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)"
This reverts commit 68151fd2889c9752348c2dfdc7c175ee201c0cd3.

Reverted https://github.com/pytorch/pytorch/pull/137120 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137120#issuecomment-2400429265))
2024-10-08 17:26:19 +00:00
2570d77a26 Revert "type _dynamo/trace_wrapped_higher_order_op.py (#137354)"
This reverts commit a9f7b905de2217eedee6723b0eb83b3ac7406c26.

Reverted https://github.com/pytorch/pytorch/pull/137354 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137354#issuecomment-2400424669))
2024-10-08 17:22:40 +00:00
76c5bdd2cc Revert "[Dynamo] Handle extracted unbound tensor methods (#137227)"
This reverts commit 14eabd69152e31d059444310979625542db2aece.

Reverted https://github.com/pytorch/pytorch/pull/137227 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137227#issuecomment-2400406384))
2024-10-08 17:12:41 +00:00
c88c0e6c65 Revert "[Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)"
This reverts commit d255b34c0ac6208633ed5e71d019fa9ae061e1fc.

Reverted https://github.com/pytorch/pytorch/pull/137119 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137119#issuecomment-2400401262))
2024-10-08 17:09:26 +00:00
cc10ef4645 Revert "[Dynamo] add flex attention mode test (#137121)"
This reverts commit 144665d772f7ec014a4a23f460a632a4a4774f4a.

Reverted https://github.com/pytorch/pytorch/pull/137121 on behalf of https://github.com/malfet due to Need to revert to be able to revert https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137121#issuecomment-2400389882))
2024-10-08 17:03:34 +00:00
11192ceca4 Revert "[FlexAttention] only calculate grads for buffers that require_grad (#137451)"
This reverts commit 9f9d252971ea1de04d349a0460e39e3bfe824eae.

Reverted https://github.com/pytorch/pytorch/pull/137451 on behalf of https://github.com/malfet due to Need to revert it in order to be able to backout https://github.com/pytorch/pytorch/pull/136910 ([comment](https://github.com/pytorch/pytorch/pull/137451#issuecomment-2400385858))
2024-10-08 17:00:59 +00:00
8184e202d8 Update mutation checking in pattern matcher (#137448)
Fix for https://github.com/pytorch/pytorch/issues/137229

The current mutation checking is complicated because it works for pre grad IR. When pre grad ir has been traced to OpOverloads checking is much easier. I am also special casing auto functional hop although I discussed with @zou3519 it would be nice to have a way of querying HOPs that mimic schemas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137448
Approved by: https://github.com/zou3519
2024-10-08 16:56:40 +00:00
28493efe6e fix silly mapping issue with torch.Size (#137465)
Test Plan: added test

Differential Revision: D64022949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137465
Approved by: https://github.com/yushangdi, https://github.com/angelayi
2024-10-08 16:53:15 +00:00
7267363844 [ONNX] Insert contiguous node between transpose and view before calling run_decompositions (#137340)
Works around #136543.

This fix solves the issue only in the context of the ONNX exporter but this issue happens in other context.

The bug happens when method `run_decompositions` is called. The failing pattern is assumed to be ``view(transpose(x, ...))``. This pattern is replaced by ``view(flatten(transpose(x, ..)))``. By changing the dimensions, the strides are updated as well and `run_decompositions` does not fail anymore. It would be inefficient on a 1D tensor but then transpose would not be used. The extra node appears in the final onnx graph but is removed after optimization. The final onnx graph should not be impacted and no performance loss should be observed for the onnx model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137340
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-10-08 16:45:59 +00:00
5fb30df7d6 [FSDP2] Required mesh_dim_names for HSDP (#137436)
Two changes:
1. Require `mesh_dim_names` if using HSDP
2. Pass only the shard mesh to `fsdp_pre_all_gather`

Change 1 is technically BC breaking, but it should not be hard to fix on the user side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137436
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-10-08 16:31:18 +00:00
0bfedb13e7 Remove aoti_torch_zero_ codegen (#137371)
Summary: aoti_torch_zero_ codegen breaks AOTI FC, see discussion in D63281798.

Test Plan: CI

Differential Revision: D63916320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137371
Approved by: https://github.com/jingsh
2024-10-08 15:57:41 +00:00
c04b35a5ae [AOTI] Add standalone version of TORCH_CHECK (#136873)
Summary: In the standalone mode, TORCH_CHECK throws std::runtime_error, instead of c10::Error. The goal is to cut dependency on libtorch. Specifically, AOTI generates CPU code which may call ATen vectorization ops and we need to make sure those ops are self-contained.

Differential Revision: [D63911928](https://our.internmc.facebook.com/intern/diff/D63911928)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136873
Approved by: https://github.com/albanD, https://github.com/chenyang78
2024-10-08 15:30:01 +00:00
d5493ed579 Parametrize test_lstm_packed (#137447)
The test runs all its combination (512) sequentially, so it takes more than 30 minutes to finish or timeout on ASAN after one hour.  Parametrizing it will break it up, so individual tests can finish and aren't need to be marked as slow anymore.

Also, the test seems to run OOM on a 2xlarge with `std::bad_alloc` memory error.  Maybe, this would also fix the issue (pending CI testing)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137447
Approved by: https://github.com/albanD, https://github.com/malfet
2024-10-08 15:26:27 +00:00
3e2f276a14 Fix to() on non-contiguous NJTs (#137124)
Called out via torchrec integration: `lengths` is not handled properly.

Future work (not related to non-contiguous NJTs): #137275
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137124
Approved by: https://github.com/soulitzer
ghstack dependencies: #137030, #137031
2024-10-08 15:11:05 +00:00
a77bb8527c Make index check in applySelect support deferred runtime assert (#137046)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137046
Approved by: https://github.com/albanD
2024-10-08 14:31:47 +00:00
9b2e453e24 Migrate ARM64 Linux binary jobs to runner determinator (#136666)
Updates ARM64 Linux binary jobs to use the runner determinator.

Issue: pytorch/ci-infra#265
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136666
Approved by: https://github.com/ZainRizvi
2024-10-08 12:14:06 +00:00
76dca1fef3 [c10d] separate the codes for GPU stream synchronization and CPU thread synchronization (#137295)
code
Summary:
This PR should not change the existing behavior of work.wait(), just
separate the stream synchronization code from the CPU busy wait code.

Also, remove the need of a private synchronization function.

In a longer term, we would like to give user the flexibility of bypassing the watchdog thread and handle the collective error by themselves.

Test Plan:
python test/distributed/test_c10d_nccl.py NcclErrorHandlingTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137295
Approved by: https://github.com/kwen2501
2024-10-08 08:53:47 +00:00
9f9d252971 [FlexAttention] only calculate grads for buffers that require_grad (#137451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137451
Approved by: https://github.com/Chillee
2024-10-08 07:36:38 +00:00
59cdd8ddf1 Bump optree version to 0.13.0 to enable Python 3.13 and Python 3.13t support (#137396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137396
Approved by: https://github.com/albanD
2024-10-08 06:49:04 +00:00
493d0eeef3 Revert "Add support for cat memory planning mms with max autotune (#132554)"
This reverts commit d558ec07300defee24dd4a83ab4b387a39ea2176.

Reverted https://github.com/pytorch/pytorch/pull/132554 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it is failing on ROCm ([comment](https://github.com/pytorch/pytorch/pull/132554#issuecomment-2398946854))
2024-10-08 06:21:06 +00:00
8ca15e87f5 Update torchbind expecttest from landrace (#137453)
Update expecttest from torch function mode PR landrace (torch function mode changes output code slightly)

Attempted to revert the stack but there were conflicts
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137453
Approved by: https://github.com/huydhn
2024-10-08 06:01:29 +00:00
bb31e3f57e Add original forward names to schema so that prettify pass works (#136887)
When we run_decomp, we retrace if it is training IR. As a result, we do need to reliably store the oroiginal forward names when we run decomp.

Differential Revision: [D63064453](https://our.internmc.facebook.com/intern/diff/D63064453/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136887
Approved by: https://github.com/angelayi
2024-10-08 04:21:02 +00:00
46525abb71 OpenReg: support multiple executors (#136249)
From PR https://github.com/pytorch/pytorch/pull/135646 we have split the daemon into drvier/executor, however, current executor stands for all devices and allocate memory all together. In order to better simulate device behavior, here we support multiple executors, each executor stands for one device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136249
Approved by: https://github.com/FFFrog, https://github.com/albanD
2024-10-08 01:37:08 +00:00
395e098209 type _dynamo/mutation_guard.py (#137350)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137350
Approved by: https://github.com/Skylion007
2024-10-08 00:04:34 +00:00
52ba40c6f6 [ROCm][AOTI] add CK backend (#135641)
Companion to #134379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135641
Approved by: https://github.com/ColinPeppler, https://github.com/chenyang78

Co-authored-by: Colin Peppler <colinpeppler@meta.com>
2024-10-07 23:53:58 +00:00
2c0b11c79b forward-fix D63916220 breaking test_cutlass_backend in FBCode (#137435)
Summary: It seems like the import path is different from FBCode & OSS. Wondering how to consolidate them.

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cutlass_backend

Tests finished: Pass 2. Fail 0. Fatal 0. Skip 33. Build failure 0
```

Differential Revision: D63991961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137435
Approved by: https://github.com/jovianjaison
2024-10-07 23:44:04 +00:00
812f286d4a Delete duplicate bindings in torch/csrc/autograd/python_torch_functions_manual.cpp (#136711)
This change deletes the duplicate binding of `torch. _functionalize_mark_mutation_hidden_from_autograd()`, another defination is here:

5c78c6b05a/torch/csrc/autograd/python_torch_functions_manual.cpp (L630-L636)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136711
Approved by: https://github.com/soulitzer
2024-10-07 23:19:06 +00:00
d558ec0730 Add support for cat memory planning mms with max autotune (#132554)
When we are autotuning matmuls the aten.mm and the triton template choices take in an externally allocated tensor that can be a view into a pre-planned aten.cat. So long as the output shape and stride of the matmul matches the slice of the cat we're planning, we can realize the mm directly into the cat.

Discussion for reviewers:

It feels a little bit odd that in the existing code we set the output of aten.mm as [FlexibleLayout](bcac71517c/torch/_inductor/kernel/mm.py (L156)). While is this correct, it might lead to passing non performant output strides to cublas.. I guess this is better than a copy ? Not sure. We could also introduce a Layout that denotes a Fixed shape and stride which we control allocation

```
class AllocatedFixedLayout(FixedLayout)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132554
Approved by: https://github.com/jansel
2024-10-07 22:49:29 +00:00
01bf350967 Fix bmm_sparse_cuda illegal memory access (#131977)
This PR fixes a bug in `search_end_matrix_indices_cuda_kernel` causing an illegal memory access when calling `bmm_sparse_cuda` on a sparse matrix with no non-zero values in the first batch dimension. Reproducible example:
```py
import torch

ind = torch.tensor([[1], [0], [0]], device="cuda")
val = torch.tensor([1.], device="cuda")
A = torch.sparse_coo_tensor(ind, val, size=(2, 1, 1))
B = torch.zeros((2, 1, 1), device="cuda")
C = torch.bmm(A, B)
```

## Details

In the previous code, we may for example end up with the following situation:
```
i : indices_1D[i]
------------------------------------------
0 : 1                <- start_idx, mid_idx
1 : 1                <- end_idx
...
```
When `target_mat_num = 0`, the next iteration of the while loop will assign `-1` to `end_idx` and thus `(0 + (-1)) >> 1 = -1` to `mid_idx`, causing an access error on line 703. The updated code maintains the invariant `start_idx <= end_idx` and will not go out of bounds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131977
Approved by: https://github.com/amjames, https://github.com/pearu, https://github.com/nikitaved
2024-10-07 22:47:34 +00:00
a6707a7303 [dynamo] log all graph breaks to graph_breaks logging artifact (#137244)
We were previously not logging all graph breaks (e.g. data dependent jumps) to the graph_breaks logging artifact.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137244
Approved by: https://github.com/jansel
2024-10-07 22:34:27 +00:00
a9f7b905de type _dynamo/trace_wrapped_higher_order_op.py (#137354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137354
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-10-07 21:57:06 +00:00
796c3c3415 Revert "Disallow FakeTensor.data_ptr access in eager mode (#137221)"
This reverts commit 7e13e7dd7e5fc20c0420605aeecb0f902af5326c.

Reverted https://github.com/pytorch/pytorch/pull/137221 on behalf of https://github.com/jovianjaison due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/137221#issuecomment-2397957081))
2024-10-07 21:46:13 +00:00
319eda9dfd [inductor] Add API to make post_grad_custom passes cache-able (#137298)
Summary: See https://github.com/pytorch/pytorch/issues/130772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137298
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-10-07 21:11:54 +00:00
8aa110cb00 [ROCm] Enable int_mm_error tests for rocm 6.0+ (#124999)
This pull request enables the int_mm_error tests for rocm 6.0+ . since  #122431 landed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124999
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-10-07 21:10:18 +00:00
46abaa3b0f Increase parallelnative shards to 4 (#137433)
The job times out flakily in trunk as its duration is approaching 3.5h https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=parallelnative

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137433
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-10-07 21:06:34 +00:00
c87c9f0a01 [inductor] Conditionally copy args to cpu to minimize memory overhead of autotuning (#136701)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136701
Approved by: https://github.com/eellison
2024-10-07 19:47:04 +00:00
900f57216f [dynamo] Log a summary of frames Dynamo traced (#137297)
This patch adds logging for all frames Dynamo traced, during each invocation of a Dynamo-optimized function.

Example:
```python
import torch

@torch.compile
def foo():
    x = torch.ones([10])
    def bar():
        y = x + x
        torch._dynamo.graph_break()
        z = y * x
        return z

    return bar(), bar

foo()
foo()
```

Running `TORCH_LOGS="dynamo" python` on the above dumps the following near the very end.
```
......
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: [
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486]   * foo /Users/ryanguo99/Documents/work/scratch/test.py:4
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486]   * bar /Users/ryanguo99/Documents/work/scratch/test.py:7
I1003 12:18:31.058000 177 torch/_dynamo/eval_frame.py:486] ]
I1003 12:18:31.064000 177 torch/_dynamo/eval_frame.py:486] starting from foo /Users/ryanguo99/Documents/work/scratch/test.py:4, torchdynamo attempted to trace the following frames: []
......
```

Fixes #118262.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137297
Approved by: https://github.com/williamwen42
2024-10-07 19:44:41 +00:00
f33ffd01f2 [export] fix joint graph metadata (#136011)
Differential Revision: D62652832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136011
Approved by: https://github.com/tugsbayasgalan
2024-10-07 19:36:44 +00:00
08b84afda9 [inductor] Fix alignment hint for WorkspaceArg (#137429)
Alignment hints refer to the base ptr, not the size.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137429
Approved by: https://github.com/eellison
2024-10-07 19:32:33 +00:00
fe44b6a67f Revert "Add back DistributedDataParallel types that were lost when pyi was removed (#136835)"
This reverts commit 40b09edd87fcbe4e63c4db6399ec758d5c34e1b1.

Reverted https://github.com/pytorch/pytorch/pull/136835 on behalf of https://github.com/jovianjaison due to this pr is causing typecheck errors internally ([comment](https://github.com/pytorch/pytorch/pull/136835#issuecomment-2397661940))
2024-10-07 18:59:41 +00:00
144665d772 [Dynamo] add flex attention mode test (#137121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137121
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227, #137119
2024-10-07 18:55:26 +00:00
d255b34c0a [Dynamo] Handle torch function subclass/mode dispatch on generic tensor methods (#137119)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137119
Approved by: https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116, #137117, #137120, #137227
2024-10-07 18:55:26 +00:00
14eabd6915 [Dynamo] Handle extracted unbound tensor methods (#137227)
fixes2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137227
Approved by: https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116, #137117, #137120
2024-10-07 18:55:26 +00:00
68151fd288 [Dynamo] Move flex attention torch function mode to traceable HOP file (#137120)
Moves `TransformGetItemToIndex` to a file where dynamo stores other traceable HOP concepts.  (We don't trace through torch.* modules by default)

Tracing through the mode required fixing a bug in dynamo autograd function, which fixed a graph break, which caused the autograd test failures (skipping for now and will file an issue)

Previously those tests were in essence running in eager, because dynamo would fallback due to an arg mismatch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137120
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115, #137116, #137117
2024-10-07 18:55:26 +00:00
941be418d8 [Dynamo] Ensure torch function modes are dispatched on builtin ops (#137117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137117
Approved by: https://github.com/yanboliang, https://github.com/williamwen42
ghstack dependencies: #137114, #137115, #137116
2024-10-07 18:55:26 +00:00
f9d69cde88 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503) (#137116)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137116
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114, #137115
2024-10-07 18:55:26 +00:00
b1fd7708bd [Dynamo] Remove ignored modes workaround (#135502) (#137115)
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137115
Approved by: https://github.com/yanboliang
ghstack dependencies: #137114
2024-10-07 18:55:26 +00:00
51bc839b94 [Dynamo] Trace enter/exit of TorchFunctionModes (#135422) (#137114)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137114
Approved by: https://github.com/yanboliang
2024-10-07 18:55:26 +00:00
ff95ff5d38 type _dynamo/profiler.py (#137351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137351
Approved by: https://github.com/Skylion007
2024-10-07 18:54:33 +00:00
aa145dead8 [FSDP2] Fixed mistargeted backward prefetch (#137348)
If there is an `unshard` (top-half) without a `wait_for_unshard` (bottom-half), then the next iteration's `unshard` will be a no-op. This can unexpectedly not propagate the optimizer update on the sharded parameters to the unsharded parameters, so it is better to clear that `unshard` at the end of backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137348
Approved by: https://github.com/weifengpy
2024-10-07 18:10:09 +00:00
01c07e7864 Revert "[BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920)"
This reverts commit 8dddd456794f82db5b4e807e9aed1919d3a832da.

Reverted https://github.com/pytorch/pytorch/pull/136920 on behalf of https://github.com/drisspg due to Breaks sdpa with bias support, will switch to newer patch version when released ([comment](https://github.com/pytorch/pytorch/pull/136920#issuecomment-2397548622))
2024-10-07 17:56:57 +00:00
cyy
0c0d8c8ff0 [1/N] Fix extra warnings brought by clang-tidy-17 (#137407)
Before we can use clang-tidy-17
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137407
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-10-07 17:53:59 +00:00
ceb4ed8450 [AOTI][Tooling][10/n] Add scalar and symbolic type input debug printing support (#137323)
Summary:
- Further added more types for debug value dumping.

- Add a test case for symint inputs for debug printer. in real prod model use cases,  "unbacked symints" (those 'u0', 's0', etc.) can be helpful if we can examine their value.

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2  TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_sym_inputs_abi_compatible_cuda
```

Differential Revision: D63864708

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137323
Approved by: https://github.com/chenyang78
2024-10-07 17:41:40 +00:00
04e48ac562 [inductor] Refactor prefix to make it easy to create subclass of PythonWrapper (#137198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137198
Approved by: https://github.com/jansel
ghstack dependencies: #137191, #137193
2024-10-07 17:20:58 +00:00
e2b72348d0 [inductor] Reuse the subgraph if accessed via same get_attr node (#137193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137193
Approved by: https://github.com/jansel
ghstack dependencies: #137191
2024-10-07 17:20:58 +00:00
7a5eaecd92 [inductor] Correctly keep track of the graph_input_names (#137191)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137191
Approved by: https://github.com/jansel
2024-10-07 17:20:53 +00:00
14b4099521 [FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955)
this PR unblocks unit test with single Float8Linear module. It fixes following error
```
torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs)
[rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn'
```

Differential Revision: [D63961071](https://our.internmc.facebook.com/intern/diff/D63961071)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955
Approved by: https://github.com/vkuzo, https://github.com/eqy
2024-10-07 16:36:31 +00:00
33461592e2 [TLParse] Include cache hit/miss/bypass in the report name (#137282)
Makes it easier to tell on glance

https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp1xoGc1/index.html

<img width="398" alt="image" src="https://github.com/user-attachments/assets/7ed111cb-46d8-4442-a1b2-037d0a8decd8">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137282
Approved by: https://github.com/jamesjwu
2024-10-07 16:00:00 +00:00
4db199f15f Implement Remote AOTAutogradCache (#137278)
Summary: Implement Remote AOTAutogradCache. It uses all the same tech as Remote FXGraphCache, just with its own name.

Test Plan:
Run benchmark:
TORCHINDUCTOR_AUTOGRAD_REMOTE_CACHE=1 TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 TORCHINDUCTOR_AUTOGRAD_CACHE=0 TORCHINDUCTOR_FX_GRAPH_CACHE=0 TORCH_LOGS=+torch._functorch._aot_autograd.autograd_cache buck run mode/opt benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --repeat 5 --performance --cold-start-latency

See that it cache hits even with local cache removed.

Results show up in remote cache logs https://fburl.com/scuba/pt2_remote_cache/5893dbaj

New unit tests

Reviewed By: oulgen

Differential Revision: D63323958

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137278
Approved by: https://github.com/oulgen
2024-10-07 15:38:54 +00:00
f80ed0b831 [export] Custom op meta kernel generation (two pass) (#137277)
Summary: Prototyping the custom op meta kernel generation. Rest of the changes are in fbcode/scripts/angelayi

Test Plan: followup diff (D63837739)

Differential Revision: D63837740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137277
Approved by: https://github.com/zou3519
2024-10-07 15:34:19 +00:00
e20e7a8c38 Fixed developer setup issue in open_registration_extension (#137355)
This PR fixes an issue where when running `python setup.py develop`, the `open_registration_extension` self contained example will not build due to the following:

```
error: 'synchronizeStream' overrides a member function but is not marked 'override'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137355
Approved by: https://github.com/albanD, https://github.com/spzala
2024-10-07 15:25:37 +00:00
8c3ab21490 multiprocessing.spawn: allow a grace period when shutdown (#131278)
When one process fails, others are immediately killed. This prevents other processes to do necessary cleanups, or dump debug information (in particular, the NCCL flight recorder).

This PR adds a grace period. Default behavior is unchanged.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131278
Approved by: https://github.com/albanD
2024-10-07 12:37:34 +00:00
a063a82c8b [redo] Fp8 support for item() with cuda, index_select, and fill_ cpu (#137341)
Summary:

Redo of https://github.com/pytorch/pytorch/pull/128780, easier to copy-paste.

Test Plan: CI

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137341
Approved by: https://github.com/eqy
2024-10-07 00:58:51 +00:00
d1b87e26e5 [BE] Delete empty files (#137376)
Discovered by running
```
 % find aten -type f -size 0
aten/src/ATen/native/quantized/cpu/qnnpack/wrappers/dummy.c
aten/src/ATen/native/vulkan/api/StringUtil.cpp
aten/src/ATen/native/LegacyBridge.cpp
aten/src/ATen/function_wrapper.py
aten/src/ATen/cudnn/Exceptions.h
```

Most of them were added by b774ce54f8

Remove reference to LegacyBridge.cpp from `aten_native_source_non_codegen_list`:
f42f63ee86/build_variables.bzl (L1317)

And reference to `native/quantized/cpu/qnnpack/wrappers/dummy.c` from f42f63ee86/aten/src/ATen/native/quantized/cpu/qnnpack/buckbuild.bzl (L440)
Which seems to be a bug from some ancient Android toolchain

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137376
Approved by: https://github.com/kit1980, https://github.com/eqy, https://github.com/seemethere, https://github.com/jianyuh, https://github.com/Skylion007
2024-10-06 18:59:04 +00:00
0eba7e5451 Revert runtime numeric check in inductor due to increased compilation time (#137324)
Summary:
This diff reverts D63438718
Cause latency regression on multiple models

Test Plan: NA

Differential Revision: D63872515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137324
Approved by: https://github.com/xuzhao9
2024-10-06 05:23:24 +00:00
1dc1b85714 [export] Move swap to a different file (#137134)
Refactor so that unflattener doesn't become too messy

Differential Revision: [D63719648](https://our.internmc.facebook.com/intern/diff/D63719648/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137134
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #136191, #137102
2024-10-06 04:28:18 +00:00
fa9cd46d12 [export] Update swap's forward function (#137102)
Downstream APS code was failing to run the previously swapped module because of some fx.GraphModule forward function weirdness (P1594789677). So to fix this, I just attached a custom forward function which matches the unflattened module's forward function.

Differential Revision: [D63683422](https://our.internmc.facebook.com/intern/diff/D63683422/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137102
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #136191
2024-10-06 04:25:36 +00:00
52d7704b32 [export] Add optimization passes (#136191)
Added an optimization pass to the swap function which removes extraneous pytrees. Currently it removes the pytree flatten/unflatten calls between modules in very specific scenarios (all the inputs of one module go into the other).

Future work can be to remove the input pytree.flatten if the inputs go directly into an unflatten, and output pytree unflatten if the outputs are directly from a pytree.flatten.

Differential Revision: [D62879820](https://our.internmc.facebook.com/intern/diff/D62879820)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136191
Approved by: https://github.com/avikchaudhuri
2024-10-06 04:22:42 +00:00
ad4e91acfe [fsdp2] based on device, use stream and Event (#136843)
currently FSDP2 support only CUDA, for other backends that need to use FSDP2 it won’t work as stream and events are based on CUDA. To support other backends, use
 _get_device_handle by device type to get the class and use this
for stream and events.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136843
Approved by: https://github.com/awgu
2024-10-06 04:17:47 +00:00
4061910ba2 Have Triton CPU backend respect max_autotune setting (#137276)
We would previously do it regardless of the setting's value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137276
Approved by: https://github.com/jansel, https://github.com/desertfire
2024-10-06 03:03:33 +00:00
711dacf984 [FlexAttention] Support training bias for eager (#136910)
Add training bias eager implementation, take over the original POC from https://github.com/pytorch/pytorch/pull/136076

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136910
Approved by: https://github.com/Chillee
2024-10-05 19:34:57 +00:00
d073223663 turn CompilationCallbackHandler into dataclass (#137312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137312
Approved by: https://github.com/Skylion007
ghstack dependencies: #137181
2024-10-05 19:03:28 +00:00
f54e142c58 Remove references to Rockset in trymerge (#137207)
For the migration to ClickHouse

But also Rockset is not used in trymerge anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137207
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-10-05 12:53:22 +00:00
40d8260745 [ROCm] remove caffe2 from hipify (#137157)
- Remove all "MasqueradingAsCUDA" files and classes.
- Do not rename "CUDA" classes to "HIP".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137157
Approved by: https://github.com/eqy
2024-10-05 12:48:54 +00:00
ca38f28543 [FlexAttention] Adjust BlockMask if reusing the one created at larger seqlen (#137255)
Fixes #136232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137255
Approved by: https://github.com/Chillee
2024-10-05 07:31:32 +00:00
4830bd0dd4 [Doc] Clarify that NaNs are not equal to each other (#137386)
Fixes https://github.com/pytorch/pytorch/issues/137337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137386
Approved by: https://github.com/janeyx99, https://github.com/huydhn, https://github.com/kit1980
2024-10-05 06:19:12 +00:00
17718209ea fix specialization bug in unflatten + preserve_module_call_signature (#137363)
Summary: In unflatten, when we generate module calls when their signature has been preserved, we do not pass the original constant args. This can cause strange effects, e.g., if the module is swapped out with itself, we may suddenly go down a different path than the original, or even crash.

Test Plan: added a test

Reviewed By: angelayi

Differential Revision: D63913750

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137363
Approved by: https://github.com/angelayi
2024-10-05 04:26:02 +00:00
6d0d7b6e37 [CI][BE] Restore cuda memory allocator setting (#137383)
By adding `finally:` clause at the end of the test

Might fix https://github.com/pytorch/pytorch/issues/137098#issuecomment-2389172552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137383
Approved by: https://github.com/ngimel
2024-10-05 04:16:38 +00:00
0067f586ba [audio hash update] update the pinned audio hash (#136968)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136968
Approved by: https://github.com/pytorchbot
2024-10-05 04:08:59 +00:00
4d8b845797 Fix overflow error when torch.bincount() handles a large tensor (#136745)
Fixes #136720

the result in this case says:

```
Traceback (most recent call last):
  File "/Users/shenke/workspace/pytorch/mytest.py", line 9, in <module>
    result = torch.bincount(input)
             ^^^^^^^^^^^^^^^^^^^^^
RuntimeError: maximum value of input overflowed, it should be < 9223372036854775807 but got 9223372036854775807
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136745
Approved by: https://github.com/Skylion007
2024-10-05 04:04:48 +00:00
d6f340f66c Determine autograd engine ready queue based on InputMetadata instead of InputBuffer (#135633)
Thanks @awgu for raising this issue and the small repro

From offline discussion with @albanD, in the case where a forward returns multiple outputs with different devices, we'd want to select the ready queue based on the device of the first one. Even though this is somewhat arbitrary, we prefer this over deciding which ready queue to push based on whichever input buffer's we happen to compute last, which can vary depending on more factors and thus be harder to reason about. This is in theory bc-breaking, but it seems unlikely that someone would depend on this behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135633
Approved by: https://github.com/albanD
2024-10-04 23:59:46 +00:00
79562f3af8 [ROCm] Modify hipify script to work with Windows paths (#135360)
This change modifies the `hipify_python.py` script to properly detect all directories, `include` and `ignore` paths during hipification process on Windows, by changing the path syntax convention to a UNIX-like one.

Since in many places the script assumes a UNIX-like convention by using paths with forward slashes `/`, I decided to accommodate for it by converting Windows paths to UNIX-like ones. By doing it so, the number of changes to the file is limited. Moreover this early-on unification allows for the rest of the code to have a battle-tested linux-like behaviour.

Another option would be to use `Path` object from `pathlib` to represent all paths in the script, however, it would impact a broader share of a code and would hence require a more meticulous evaluation in terms of non-altered logic and edge cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135360
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd
2024-10-04 23:43:43 +00:00
8b6774d381 Clarify comment for error handling of dict getattr (#137381)
Just a small nit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137381
Approved by: https://github.com/malfet
2024-10-04 23:40:21 +00:00
f42f63ee86 Add option to disable operator profiling (#136838)
Summary:
X-link: https://github.com/pytorch/executorch/pull/5720

For smaller models the overhead of profiling ops might be prohibitively large (distorting the inference execution time significantly) so we provide users an option to disable op profiling and essentially only profile the important events such as inference execution time.

To disable operator profiling users need to do:
```
etdump_gen.set_event_tracer_profiling_level(executorch::runtime::EventTracerProfilingLevel::kNoOperatorProfiling);
```

Test Plan: Added test case.

Differential Revision: D61883224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136838
Approved by: https://github.com/dbort
2024-10-04 22:56:00 +00:00
f2d174c051 Update CODEOWNERS (#136278)
Swap @gokulavasan for @divyanshk as codeowner of torch/utils/data/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136278
Approved by: https://github.com/divyanshk, https://github.com/janeyx99, https://github.com/jansel
2024-10-04 22:42:05 +00:00
88e54de219 More nogil unsafe API fix (#137142)
Cover the PyDict APIs and confirms no update needed for PyModule one.
The rest was already covered in https://github.com/pytorch/pytorch/pull/136899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137142
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-10-04 21:56:34 +00:00
e27c0048db Enable additional tests for MPS CI runs (#134356)
As part of the follow up for https://github.com/pytorch/pytorch/issues/133520, adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/huydhn
2024-10-04 21:52:38 +00:00
7c1d93944e Proper handling of arguments passed by in kwargs inside zip_schema (#137311)
if the function is

```func(a, b, c) ```
and is called as
```func(a=1, b=.., c=..)```
before this change we do not iterate on the a, b, c, since those appear in kwargs this diff fix that issue.

This function is used in _inductor/ir.py to iterate over custom op arguments and when a custom pass does changes
and pass arguments as kwargs, we do not process them.
```
        for info, arg in torch._library.utils.zip_schema(schema, args, kwargs):
            handle_aliasing_and_mutation(info, arg)
```
Fix https://github.com/pytorch/pytorch/issues/137057

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137311
Approved by: https://github.com/zou3519
2024-10-04 21:50:31 +00:00
c0deec120f Fix resurrection logic to trigger early enough (#137267)
Fixes https://github.com/pytorch/pytorch/issues/136358

The bug here is that the Tensor object is actually 2 classes: `Tensor` from `_tensor.py` and `TensorBase` from c++.

Before this PR, they have the following gc methods:
Tensor:
 - tp_clear subtype_clear
 - tp_traverse THPVariable_subclass_traverse
 - tp_dealloc THPVariable_subclass_dealloc

TensorBase:
- tp_clear THPVariable_clear
- tp_traverse THPFunction_traverse (fake function that is just an error)
- tp_dealloc object_dealloc

The problem is that when clear is called on the Tensor, subtype_clear is going to clear the things owned by the "Tensor" type, in particular, its `__dict__` attribute, before delegating to the TensorBase clear where we detect that resurrection needs to happen and skip it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137267
Approved by: https://github.com/ezyang, https://github.com/kshitij12345
2024-10-04 21:13:54 +00:00
bd48933323 Run docker builds on Meta account for now (#137358)
To fix
```
arn:aws:sts::391835788720:assumed-role/ghci-lf-github-action-runners-runner-role/i-096a3e2616140518b is not authorized to perform: ecr:InitiateLayerUpload on resource: arn:aws:ecr:us-east-1:308535385114:repository/pytorch/pytorch-linux-jammy-py3-clang18-asan because no resource-based policy allows the ecr:InitiateLayerUpload action
```
Which seems to be doing the trick see https://github.com/pytorch/pytorch/actions/runs/11185419440/job/31098258344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137358
Approved by: https://github.com/huydhn
2024-10-04 20:39:56 +00:00
7b3378a39a [FSDP2] Relaxed even sharding requirement for all-gather extensions (#137005)
This PR relaxes the even sharding requirement for the all-gather extensions.

The `fsdp_pre_all_gather` now expects signature:
```diff
def fsdp_pre_all_gather(
    self,
    mesh: DeviceMesh,
+    outer_size: torch.Size,
+    outer_stride: Tuple[int, ...],
    module: nn.Module,
    mp_policy: MixedPrecisionPolicy,
) -> Tuple[Tuple[torch.Tensor, ...], Any]:
```
- Since no one is using this new signature yet, we should be safe to change it.
- Currently, the `outer_stride` will always be contiguous strides since FSDP2 only supports contiguous strides for now.
- For the uneven sharding case, the user is responsible to return a padded sharded tensor from `fsdp_pre_all_gather`. This is risky territory because if the user does not do so, then this may manifest as a NCCL timeout, as only the ranks with padding will error out. However, I am not aware of any way around this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137005
Approved by: https://github.com/weifengpy
2024-10-04 20:34:20 +00:00
f4b415da11 type _dynamo/replay_record.py (#137183)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137183
Approved by: https://github.com/Skylion007
2024-10-04 20:29:24 +00:00
6a6a8b17b8 handle state tensors in training ir path (#137240)
Summary: We had attribute assignment detection and handling of registered buffer assignments when using `aot_autograd`, but not when using just `make_fx`. Fixed.

Test Plan: expanded coverage of `test_state_tensors` to use `export` instead of `torch.export.export`

Differential Revision: D63802576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137240
Approved by: https://github.com/tugsbayasgalan
2024-10-04 20:23:48 +00:00
f0ef7fddde Add ignored/unmaintained comment for capture_autograd_function flag (#137309)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137309
Approved by: https://github.com/jansel
ghstack dependencies: #136961
2024-10-04 20:02:37 +00:00
0878739b11 [AOTI] Add C shim for MKLDNN _convolution_pointwise (#137269)
Differential Revision: [D63875271](https://our.internmc.facebook.com/intern/diff/D63875271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137269
Approved by: https://github.com/chenyang78, https://github.com/hl475
2024-10-04 19:42:05 +00:00
a968576777 Add lowering for aten.searchsorted (#135701)
Adds lowering for `aten.searchsorted`. This entails:

1. Adding support for multi-dimensional bucket tensors to `ops.bucketize`.
2. Adding support for striding to `ops.bucketize`.
3. Adding support for sorting tensors to `ops.bucketize`.
4. Adding a lowering for `aten.searchsorted.Tensor`.
5. Adding a basic decomposition for `aten.searchsorted.Scalar` that calls into the lowering for tensors.
6. Updating the meta-function for `aten.searchsorted` to properly check some of the sizing conditions.

Closes #135873

Differential Revision: [D63766514](https://our.internmc.facebook.com/intern/diff/D63766514)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135701
Approved by: https://github.com/amjames, https://github.com/eellison, https://github.com/davidberard98
2024-10-04 19:26:05 +00:00
58ec6a360c force contiguity for all reduce (#137345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137345
Approved by: https://github.com/xmfan
2024-10-04 19:16:38 +00:00
c0a930b104 Change to export_for_training in quantize_pt2e tests (#137233)
Summary:
as title

also change it in `prepare_pt2e()` docstring

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat

buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization
```

Differential Revision: D63345059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137233
Approved by: https://github.com/tugsbayasgalan
2024-10-04 18:33:02 +00:00
22e19bd2d7 Add link to torch.compile the missing manual in troubleshooting (#137301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137301
Approved by: https://github.com/svekars

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
2024-10-04 18:19:30 +00:00
79195b9453 [inductor] Add kwargs to bypass unexpected keyword argument error (#137329)
Summary:
I tried `TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=~/fbcode/profile.txt`.

TypeError: DebugAutotuner.run() got an unexpected keyword argument 'benchmark_run'

Test Plan: ci

Differential Revision: D63876103

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137329
Approved by: https://github.com/muchulee8
2024-10-04 18:17:56 +00:00
d2d14d14e3 [RELAND] Fix unlift to preserve aliased constants (#137310)
Differential Revision: [D63864743](https://our.internmc.facebook.com/intern/diff/D63864743)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137310
Approved by: https://github.com/avikchaudhuri
2024-10-04 18:15:52 +00:00
8b9cbf22c2 Enable regression test for add loop benchmarks (#136573)
The red dotted line is 1.5

<img width="1607" alt="Screenshot 2024-09-24 at 11 50 41 AM" src="https://github.com/user-attachments/assets/719a9a86-89af-4c58-8723-80a28f9bb517">

expected taken from the average.
<img width="850" alt="Screenshot 2024-09-24 at 2 33 27 PM" src="https://github.com/user-attachments/assets/0f25e855-35ae-4031-86ef-1452ef6598de">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136573
Approved by: https://github.com/ezyang
2024-10-04 18:12:08 +00:00
ad240018f2 [PT2][Inductor][Reliability] Add back unit test for pad_mm with BF16 (#137231)
Summary: We added the unit test for recent added pad_mm pattern in customized optimus D63040455, where it will resolve the long computation kernel issue for BF16 on A100.

Test Plan:
```
buck2 test mode/opt //caffe2/test/inductor:pad_mm -- test_pad_mm_bf16
```

Buck UI: https://www.internalfb.com/buck2/4dd4c90c-4a2a-4859-923c-a4008f78a1cd
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9851624237127136
Network: Up: 100KiB  Down: 4.3GiB  (reSessionID-87f11454-d920-47af-9af5-39ca0572b7c6)
Jobs completed: 7079. Time elapsed: 3:34.3s.
Cache hits: 99%. Commands: 7061 (cached: 7024, remote: 19, local: 18)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

Differential Revision: D63794727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137231
Approved by: https://github.com/henrylhtsang
2024-10-04 17:49:55 +00:00
b2979f4382 Allow autocast in training ir export (#137287)
Summary: hardcode "val" field for autocast (similar to set_grad_enabled), to bypass the verifier check.

Test Plan: CI

Differential Revision: D63345767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137287
Approved by: https://github.com/angelayi
2024-10-04 17:38:51 +00:00
42adadf2f2 [aotinductor] enable CUTLASS backend (#134379)
### Context
This PR allows CUTLASS kernels usage in AOTI. It does this by:
* For any CUTLASS kernels that win during autotuning, compile them as a .so & .o
* When creating the final model .so, link all the CUTLASS kernels .o files
* Make sure we codegen things correctly (argument dtypes and specify extern "C" linking for the CUTLASS kernel)

### Example
https://gist.github.com/ColinPeppler/e834fa2255c37e9444b6d540bf7bd04d#file-model-cpp-L548-L549

```
TORCH_LOGS="+output_code" python test/inductor/test_cutlass_backend.py -v -k test_max_autotune_cutlass_backend_regular_mm
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134379
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
2024-10-04 17:32:41 +00:00
c7b0d4b148 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-04 15:36:29 +00:00
cyy
67908e9111 Enable clang-tidy on torch/csrc/distributed/rpc (#137320)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137320
Approved by: https://github.com/Skylion007
2024-10-04 15:34:05 +00:00
15c3479db7 [AOTI] Fix _scaled_mm ABI-compatible codegen (#137132)
Summary: Similar to https://github.com/pytorch/pytorch/pull/137008, but for supporting _scaled_mm in the ABI-compatible mode.

Differential Revision: [D63757729](https://our.internmc.facebook.com/intern/diff/D63757729)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137132
Approved by: https://github.com/chenyang78
ghstack dependencies: #137008
2024-10-04 14:05:18 +00:00
5d24ea81d3 [AOTI] Fix cpp wrapper codegen for _scaled_mm (#137008)
Summary: Fixes https://github.com/pytorch/pytorch/issues/136209. Because _scaled_mm has an out variant, the generated cpp fallback call should call _scaled_mm_out. The ABI-compatible mode needs more work.

Differential Revision: [D63757728](https://our.internmc.facebook.com/intern/diff/D63757728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137008
Approved by: https://github.com/hl475
2024-10-04 14:02:46 +00:00
f56f7476d3 Revert "Add meta functions for lerp, addcmul, and addcdiv. (#136909)"
This reverts commit e4b98b11493914769d15ca8b124c0b5fa1fdd364.

Reverted https://github.com/pytorch/pytorch/pull/136909 on behalf of https://github.com/albanD due to breaks trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/136909#issuecomment-2393774694))
2024-10-04 14:01:54 +00:00
cd17b2645c Revert "[Distributed] Fix extra context on device 0 (#135273)"
This reverts commit a93d3873e97973fbc0009245579ee4e4fa7f155a.

Reverted https://github.com/pytorch/pytorch/pull/135273 on behalf of https://github.com/albanD due to Broken trunk distributed ci ([comment](https://github.com/pytorch/pytorch/pull/135273#issuecomment-2393772987))
2024-10-04 13:58:57 +00:00
5509207543 Revert "[PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331)"
This reverts commit 592e3a3d4069029946ec4c8d103a313806c53a88.

Reverted https://github.com/pytorch/pytorch/pull/136331 on behalf of https://github.com/albanD due to Breaks aarch64 builds, see link below ([comment](https://github.com/pytorch/pytorch/pull/136331#issuecomment-2393760135))
2024-10-04 13:52:37 +00:00
e80f47fb4d Pass special arguments to user-defined Triton kernels if required (#137236)
Summary:

Special autotuning configs like `num_warps` and `num_stages` can be passed to the kernel as parameters. The `config.all_kwargs()` call [here](762a7d197c/python/triton/runtime/autotuner.py (L106)) in the Trtion code includes those special configs (names and values) into the potential arguments to the kernel. [Here](762a7d197c/python/triton/runtime/jit.py (L613)) some of those may be included in actual kenrel arguments, given that their names are present among the kernel parameters.

This PR replicates this behavior in user-defined Triton kernel compilation in PT2. Resolves #136550.

Test Plan:

```
$ python test/inductor/test_triton_kernels.py -k test_triton_kernel_special_params
inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
.inductor [('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)]
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inductor []
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
.inductor [('benchmarking.TritonBenchmarker.benchmark_gpu', 2), ('fxgraph_cache_bypass', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1), ('extern_calls', 1), ('benchmarking.TritonBenchmarker.triton_do_bench', 1), ('possibly_missed_reinplacing_opportunities', 0), ('possibly_missed_reinplacing_bytes', 0)]
inline_call []
stats [('calls_captured', 2), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 6 tests in 6.283s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137236
Approved by: https://github.com/zou3519
2024-10-04 07:36:55 +00:00
cyy
6327a71880 [Environment Variable][2/N] Use thread-safe setenv wrapper (#124485)
This follows #119449 to make setenv thread-safe.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124485
Approved by: https://github.com/eqy
2024-10-04 07:30:51 +00:00
6dcd773c57 [export] clean up dynamic markers from tensors (#137230)
Summary:
When we handle dynamic shapes markers like `Dim.AUTO, Dim.DYNAMIC`, we use dynamo decorators, attaching set attributes to the export input tensors, e.g. `x._dynamo_dynamic_indices = set()`.

I thought this was fine, since it's done all the time with torch.compile, but it breaks some PT2Inference tests, specifically because unpickling a set attribute isn't possible with the C++ torch::jit::pickle_load call.

We've agreed that the PT2Inference side will clone sample inputs & pickle the original inputs to be safe, but this still establishes a nice invariant that user-facing decorators are both ignored & cleaned out in the lifecycle of an export call.

Test Plan: test_export

Differential Revision: D63773534

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137230
Approved by: https://github.com/avikchaudhuri
2024-10-04 06:50:45 +00:00
a408cfcbf1 [torch.compile] torch.vmap supports dynamic shapes + enable flex attention create_block_mask dynamic shapes (#137163)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137163
Approved by: https://github.com/Chillee
2024-10-04 05:16:04 +00:00
40b09edd87 Add back DistributedDataParallel types that were lost when pyi was removed (#136835)
When the stub file `nn/parallel/distributed.pyi` was removed (#88701), some types that existed are no longer available. This pull request adds them back.

Just for reference, these types are used in pytorch-lightning's LightningCLI. Command line interfaces are created automatically, and having type hints make them nicer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136835
Approved by: https://github.com/kwen2501
2024-10-04 04:44:20 +00:00
97634e4f82 Rollout infra for executorch migration to training IR (#132703)
Title

Differential Revision: [D60432217](https://our.internmc.facebook.com/intern/diff/D60432217/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132703
Approved by: https://github.com/tarun292
2024-10-04 04:33:08 +00:00
f500cb43bb Fix torch.library.register_vmap (#137306)
We didn't support multiple levels of vmap. The main problem is, during
the batching rule, we need to exclude the vmap dispatch key
(FuncTorchBatched) like how our C++ batching rules do it.

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137306
Approved by: https://github.com/Chillee
2024-10-04 03:46:35 +00:00
cfc51c858a type _dynamo/callback.py (#137181)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137181
Approved by: https://github.com/Skylion007
2024-10-04 03:28:52 +00:00
9670e9e5b0 Revert "Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899)"
This reverts commit 4f93de895138cc3cb8c4383b480a2d0ecf407e1b.

Reverted https://github.com/pytorch/pytorch/pull/136899 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136899#issuecomment-2392721534))
2024-10-04 03:28:31 +00:00
e4b98b1149 Add meta functions for lerp, addcmul, and addcdiv. (#136909)
This PR adds new meta functions for `lerp`, `addcmul`, and `addcdiv` (including their
respective inplace versions).

These functions only had refs implementations, which was being the root cause of a
significant overhead ([issue][1]) when running `AdamW` optimizer step on PyTorch/XLA
backend. Running the meta functions resulted in the following improvements:

- `lerp` calls: 1,550ms to 140ms (10x)
- `addcdiv` calls: 640ms to 350ms (1.8x)
- `addcmul` calls: 620ms to 300ms (2.05x)

[1]: https://github.com/pytorch/xla/issues/7923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136909
Approved by: https://github.com/jansel
2024-10-04 02:47:25 +00:00
a1f1f585ab clean up error_on_nested_jit_trace flag (#136961)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136961
Approved by: https://github.com/jansel
2024-10-04 02:07:54 +00:00
d32696249a [IntraNodeComm] fix a race condition in one-shot all-reduce (#137257)
One-shot all-reduce did not have a barrier at the end. It was possible for a rank to write to its p2p buffer for the next collective before another rank finished reading it for the previous collective.

Also removing the fuse-input-copy optimization. The synchronization complexity probably outweighs the saving.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137257
Approved by: https://github.com/Chillee
2024-10-04 01:41:14 +00:00
3d3b394e94 [MTIA](3/n) Implement CPU pins functions for MTIA hooks (#137283)
Summary: Link CPU pins function in MTIA hooks to the host allocator implementation

Test Plan:
signals
unit test in D63709424

Differential Revision: D63352770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137283
Approved by: https://github.com/egienvalue
2024-10-04 01:26:21 +00:00
15e127bc3b [numpy2.0 compat] Fix test_parse_numpy_int_overflow for NumPy 2.0 (#137135)
NumPy now throws an OverflowError when trying to create np.uint64(-1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137135
Approved by: https://github.com/Skylion007
2024-10-04 01:21:12 +00:00
13ec343afe clean up capture_func_transforms flag (#136960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136960
Approved by: https://github.com/guilhermeleobas, https://github.com/jansel
2024-10-04 01:10:52 +00:00
6b9b2a126e Build clang18 image for ASAN tests (#128763)
Use the latest clang.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128763
Approved by: https://github.com/malfet
2024-10-04 00:53:56 +00:00
a93d3873e9 [Distributed] Fix extra context on device 0 (#135273)
This PR contains multiple fixes for issue https://github.com/pytorch/pytorch/issues/135279:

## First part:
Moves the GPU guard (`cudaSetDevice`) before the `currentStreamCaptureStatusMayInitCtx` call.
As its name suggests, it May Init Ctx.

## Second part:
Even with the above fix, additional contexts are still observed during Work object destruction, e.g.
```
work = dist.all_reduce(tensor, async_op=True)
time.sleep(5)  <-- no additional context yet
del work  <-- additional context shows up
```
### Debug process
Chasing it down to destruction of a `Future` object -- a member variable of `Work`.
Then further down to the following member of `Future`:
```
std::vector<c10::Event> events_;
```
When the `events_` are destroyed, we hit the road down to:
1f3a793790/c10/cuda/impl/CUDAGuardImpl.h (L106-L121)

When there is no "preset" CUDA context (**which is the case for python garbage collector**), line 112: `c10::cuda::GetDevice(&orig_device)` will set `orig_device` to 0. Then, at line 120, `c10::cuda::SetDevice(orig_device)` will "officially" set the context to device 0 --
**that's where rank 1, 2, ... can create extra context on device 0!**
### Solution
This PR adds an explicit destructor to `Future`. In this destructor, destroy each event with a device guard.

## Test
Added test_extra_cuda_context, implemented via
- `pynvml` (if available), or
- memory consumption check.

`python test/distributed/test_c10d_nccl.py -k test_extra_cuda_context`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135273
Approved by: https://github.com/fduwjj, https://github.com/wconstab, https://github.com/eqy
2024-10-04 00:44:02 +00:00
88e338f4dd [AOTI] Add C shim for MKLDNN _linear_pointwise (#136999)
Differential Revision: [D63851216](https://our.internmc.facebook.com/intern/diff/D63851216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136999
Approved by: https://github.com/leslie-fang-intel, https://github.com/chenyang78, https://github.com/hl475
2024-10-04 00:35:10 +00:00
57c02e5a00 [BE] Use helper functions in mps_extension (#137313)
This should be a no-op change, i.e. it runs the same code, but replaces verbose ObjectiveC invocation with helper function from OperationUtils.h, which this example already depends on
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137313
Approved by: https://github.com/atalman
2024-10-04 00:26:38 +00:00
bc916a5537 [easy] for test_ck_backend enable RE & activate remaining tests for FBCode (#137305)
Differential Revision: D63859208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137305
Approved by: https://github.com/muchulee8, https://github.com/chenyang78
2024-10-04 00:22:35 +00:00
cyy
60d19cb59e Enable clang-tidy on torch/csrc/distributed/autograd/* (#137180)
Enable clang-tidy on `torch/csrc/distributed/autograd/*` directory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137180
Approved by: https://github.com/Skylion007
2024-10-03 23:49:23 +00:00
7e13e7dd7e Disallow FakeTensor.data_ptr access in eager mode (#137221)
Previously we raised a deprecation warning (beginning PyTorch 2.4). Now
that we are on 2.6, we're completing the deprecation and disallowing
this behavior.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137221
Approved by: https://github.com/albanD, https://github.com/eellison
2024-10-03 23:47:55 +00:00
cfcd0e1fe9 [ONNX] Update the faketensor documentation (#137292)
Update the faketensor documentation to reflect current usage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137292
Approved by: https://github.com/shubhambhokare1, https://github.com/sdpython
2024-10-03 23:27:11 +00:00
4096ed7dc2 Migrate to training ir in quantization_pt2e_qat unittests (#137232)
Summary: Change capture_pre_autograd_graph to export_for_training in unit tests.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat
```

Reviewed By: tugsbayasgalan

Differential Revision: D63336660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137232
Approved by: https://github.com/angelayi
2024-10-03 22:57:04 +00:00
b44f25e1ba [CI] Move s390 binary build to its own workflow (#137304)
It was added by https://github.com/pytorch/pytorch/pull/125399 and takes 3 hours to finish
Considering limited number of runners, it often causes queueing see:
<img width="402" alt="image" src="https://github.com/user-attachments/assets/5c67c1d6-af4c-4453-a089-aa1174513bfa">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137304
Approved by: https://github.com/kit1980, https://github.com/huydhn, https://github.com/atalman
2024-10-03 22:31:36 +00:00
54094c0c26 [inductor][user triton] Check size hints to determine indexing dtype (#137234)
Previously, all integer inputs to user-defined triton kernels were assumed to be int32. This would result in errors if your input was actually an int64.

This PR checks the value to determine which dtype to use for indexing: if it is known to be < int_max, then use int32 (and add guards if relevant); if we can't check (e.g. unbacked symint), then use int64.

Differential Revision: [D63797975](https://our.internmc.facebook.com/intern/diff/D63797975)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137234
Approved by: https://github.com/eellison
2024-10-03 22:07:26 +00:00
c83178d894 Change to export_for_training in XNNPACK tests (#137238)
Summary: as title

Test Plan: CI

Differential Revision: D63344674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137238
Approved by: https://github.com/tugsbayasgalan
2024-10-03 21:28:05 +00:00
ce14f1f0c9 [aoti] Accept constant inputs (#137197)
Fixes https://fb.workplace.com/groups/1028545332188949/posts/1056788036031345/?comment_id=1056790162697799&reply_comment_id=1057501845959964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137197
Approved by: https://github.com/henrylhtsang, https://github.com/desertfire, https://github.com/hl475
2024-10-03 20:59:33 +00:00
eqy
46f158bfbc [cuDNN] Check shapes during graph capture in cuDNN CTCLoss (#130071)
Found out from #125952 about the existence of `_assert_async`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130071
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-10-03 20:10:28 +00:00
592e3a3d40 [PyTorch] Port ExecuTorch bfdot improvement back to ATen BlasKernel (#136331)
ExecuTorch's fork of BlasKernel.cpp grew bfdot support, complete with demonstration that it helps. Port it back to PyTorch. Supersedes https://github.com/pytorch/pytorch/pull/127488 . Includes https://github.com/pytorch/executorch/pull/5444 .

Differential Revision: [D63045939](https://our.internmc.facebook.com/intern/diff/D63045939/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136331
Approved by: https://github.com/malfet, https://github.com/albanD
ghstack dependencies: #136445
2024-10-03 18:18:37 +00:00
c8a7da305b [PyTorch] Add attribute version of C10_ALWAYS_INLINE (#136445)
Sometimes (such as on a lambda), you need `__attribute__((always_inline))` but not `inline`.

Differential Revision: [D63266917](https://our.internmc.facebook.com/intern/diff/D63266917/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136445
Approved by: https://github.com/malfet
2024-10-03 18:18:37 +00:00
525f6715bc Revert "Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162)"
This reverts commit f96020c246aec8514b945d530879635a03294f70.

Reverted https://github.com/pytorch/pytorch/pull/137162 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but many jobs are failing with NameError: name _recursive_getattr is not defined + a Lint job fails ([comment](https://github.com/pytorch/pytorch/pull/137162#issuecomment-2392036062))
2024-10-03 18:17:56 +00:00
c7714b8d8d [FR] Fix duplicate output for the case when not all ranks join on collective (#137256)
As title, when testing on an internal case, we found that we have very similar output for the error when certain ranks does not join one collective. This is because we didn't put all ranks into `candidate_ranks` so that they didn't get wiped out from entries and gets checked again.

Ideally for the given case, we should report this is an out of order case, because rank 0, 1 calls all-to-all while all the rest ranks call all-gather-base. But when we select entries to compare, we don't have global view of the entries.

In the specific case, on rank 0 and 1, it has collective of PG 7 on entry 1130 with seq ID = 1130. However, on other ranks, they have collective of PG 0 on entry 1130 with seq ID = 2. It's hard to use entry idx to do the match because if we later consider p2p, this assumption will collapse, so we now still defer it for users or further down debugging stream to figure it out. To make the message clearer, I also include both seqID and record_id (aka, entry index) in the message. (That does not mean this is not possible to implement in the code, for example, we can let all record_id to minus the maximum p2p seq id before it; but users will easily see the wrong order, so we don't think it's necessary to have that logic now)

P1626755348

Differential Revision: [D63815335](https://our.internmc.facebook.com/intern/diff/D63815335/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137256
Approved by: https://github.com/c-p-i-o
2024-10-03 18:06:45 +00:00
adc48a5b52 Python CAPI cleanup (#137266)
This is unrelated to anything else, but as I was going through the code, fixing bad patterns and a refcount bug (which is unlikely to cause any real issue tbh)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137266
Approved by: https://github.com/Skylion007
2024-10-03 17:55:48 +00:00
8bb8c3997b [inductor] parallel compile: add import of thread_safe_fork for internal (#137155)
Summary: We had a report of crashes in parallel compile subprocesses linked to reading justknobs. See https://fburl.com/workplace/14a4mcbh internally. This is a known issue with justknobs. It looks like we don't have a lot of control over evaluating knobs. Some are read in inductor (`"pytorch/remote_cache:autotune_memcache_version`), but many are read by the triton compiler. According to this advice https://fburl.com/workplace/imx9lsx3, we can import thread_safe_fork which installs some functionality to destroy some singletons before forking and re-enable them after. This apporach works for the failing workload.

Test Plan: See D63719673 where the reporting user was kind enough to provide us with a local repro. Without the relevant import, we can reproduce the crash. With the import, the training runs successfully to completion.

Differential Revision: D63736829

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137155
Approved by: https://github.com/xmfan, https://github.com/eellison
2024-10-03 17:37:21 +00:00
f96020c246 Fix unlift to unblock training IR + run_decomp on aliasing constants (#137162)
When we populate unlifted graph module, we actually only "unlift" constant tensor inputs which is problematic because export de-duplicates aliasing constants. As a result, we only register one constant instead of two constants. This PR fixes that by querying ep.constants table instead of ep.graph_signature.lifted_tensor_constants.

Differential Revision: [D63743111](https://our.internmc.facebook.com/intern/diff/D63743111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137162
Approved by: https://github.com/pianpwk
2024-10-03 17:28:53 +00:00
4d3c0fc061 [AOTAutogradCache] add config for AOTAutograd remote cache (#137011)
Summary: This just adds a config option and JK for turning on remote AOTAutogradCache. It does not implement anything with the new options being passed in. That will come next diff.

This PR also changes the command for turning on the local AOTAutogradCache to be more consistent to that of FXGraphCache: TORCHINDUCTOR_AUTOGRAD_CACHE

Test Plan: Existing tests should pass and should build

Reviewed By: oulgen

Differential Revision: D63321965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137011
Approved by: https://github.com/oulgen
2024-10-03 16:03:47 +00:00
a569a8ac4c type _dynamo/external_utils.py (#137185)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137185
Approved by: https://github.com/Skylion007
2024-10-03 15:18:53 +00:00
b6cb174816 Fix serialization for torch.uint16, torch.uint32, torch.uint64 (#137184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137184
Approved by: https://github.com/albanD
2024-10-03 14:56:11 +00:00
89b7a5d128 Implement AcceleratorHooksInterface's virtual functions deviceCount() and getCurrentDevice() for CUDA and XPU (#136752)
Fixes #136751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136752
Approved by: https://github.com/albanD
2024-10-03 14:44:58 +00:00
63bbf712d8 Add py3.13t linux wheel build (#137127)
Builder PR required: https://github.com/pytorch/builder/pull/2001
Test PR: https://github.com/pytorch/pytorch/pull/136490/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127
Approved by: https://github.com/albanD
2024-10-03 13:13:48 +00:00
38114ec860 [async-tp] fix a race condition that can cause silent correctness issue (#137199)
Details described in https://github.com/pytorch/pytorch/issues/137171:

![image](https://github.com/user-attachments/assets/8247b4f1-7805-4585-9d72-05e9475f218b)

Fix: we introduce the following invariants in `_pipelined_all_gather_and_consume` and `_pipelined_produce_and_all2all`:
- Before any stream writes to/reads from p2p buffers, perform a barrier on channel 0 on the launch stream.
- After all streams completed writing to/reading from p2p buffers, perform a barrier on channel 0 on the launch stream.

NOTE: This fix only focuses on addressing the race condition. Some barriers are exposed, which can be hidden by computation, and we'll optimize them in subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137199
Approved by: https://github.com/weifengpy
2024-10-03 10:42:37 +00:00
f166d62764 Avoid __ne__ weakref comparison and use identity instead in cache_size.py (#135000)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135000
Approved by: https://github.com/anijain2305
2024-10-03 07:43:58 +00:00
bd9517c1ee cond_batch_rule with boolean pred (#135009)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135009
Approved by: https://github.com/guilhermeleobas, https://github.com/jansel, https://github.com/zou3519
2024-10-03 07:43:30 +00:00
0d1701f310 Revert "raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)"
This reverts commit 70019074806920f95976fedad775d7570294f635.

Reverted https://github.com/pytorch/pytorch/pull/131114 on behalf of https://github.com/PaliC due to failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/131114#issuecomment-2390615007))
2024-10-03 06:22:55 +00:00
87bf2a8428 [compiled autograd] initialize cudagraph tls from context manager (#136735)
FIXES https://github.com/pytorch/pytorch/issues/126934. Cudagraphs TLS is initialized on module import, but compiled autograd codepaths might not import it. This causes problems because autograd/compiled autograd will restore TLS state, and in this case will restore the TLS to an uninitialized state

Should fix flaky cudagraph tests: https://github.com/pytorch/pytorch/issues/131663, https://github.com/pytorch/pytorch/issues/132108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136735
Approved by: https://github.com/BoyuanFeng, https://github.com/eellison
ghstack dependencies: #136059
2024-10-03 06:22:11 +00:00
b86269fab5 Unify cpp_extension build directory removal (#136059)
Keeps existing default directory clearing logic, even though it fails when TORCH_EXTENSIONS_DIR is set. To properly clear, we'd need to track all the folders we compiled the extensions to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136059
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-10-03 06:22:11 +00:00
55c343fa3a [DTensor] Register replication strategy for a few upsampling interpolate ops (#137201)
To unblock Llama 3.2 vision's use case to resize positional embeddings for fine-tuning. Context in [workplace post](https://fb.workplace.com/groups/319878845696681/permalink/1271172040567352/).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137201
Approved by: https://github.com/XilunWu
2024-10-03 03:45:39 +00:00
84cac3585d Move _is_static_problem to mm_common (#137150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137150
Approved by: https://github.com/eellison
2024-10-03 02:55:43 +00:00
5c0ce8d0a6 Skip Flaky Test: for #134602 (#137226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137226
Approved by: https://github.com/cpuhrsch
2024-10-03 01:53:59 +00:00
b3953ff25e [inductor] Reduce block sizes when using Triton CPU backend (#136612)
This greatly reduces compile time; TorchBench models that were previously 50-100x slower (vs the cpp backend) are now ~20x slower. More work needs to be done on the Triton side, but smaller block sizes will still be helpful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136612
Approved by: https://github.com/desertfire
ghstack dependencies: #135342
2024-10-03 01:48:32 +00:00
4513fb5c53 [Inductor] Use parametrize to break down some unit tests (#137156)
Summary: To address the issue that some tests are marked as slow, see https://github.com/pytorch/pytorch/issues/136940#issuecomment-2387227598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137156
Approved by: https://github.com/eellison
2024-10-03 01:43:36 +00:00
7631a04081 [c10d] Fix the device query story of ProcessGroup (#136790)
Function `_get_pg_default_device` is being used outside of `distributed_c10d.py`.

A concern is that people may not be aware of what it actually does, due to bad naming of this function:
`Return the device to use with ``group`` for control flow usage (object collectives, barrier).`

The remediation is as follows:
- Added a deprecation warning to `_get_pg_default_device`;
- Added a private function `_get_object_coll_device` to undertake what it does;
- Added a `_device_capability` function for users who want to query the device support of a PG -- it returns a plain list, no more "default" choice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136790
Approved by: https://github.com/H-Huang
2024-10-03 01:36:22 +00:00
cd5d1fe015 unflatten with specialized graphs per submodule call (#137013)
Previously we were making a fairly restrictive assumption when unflattening an exported program: for any submodule, we would assert that the graph of every call to that submodule must be the same. This assertion is load-bearing, i.e., if we simply remove the assertion then we can get incorrect results, as shown by the following example.

```
    class N(torch.nn.Module):
        def forward(self, x, b):
            if b:
                return x + 1
            else:
                return x + 2

    class M(torch.nn.Module):
        def __init__(self):
            super().__init__()
            self.n = N()

        def forward(self, x):
            x0 = x + 3
            x1 = self.n(x0, True)
            x2 = x1 + 4
            x3 = self.n(x2, False)
            return x3 + 5

    m = M()
    inp = (torch.ones(1),)
    print(m(*inp))  # tensor([16.])
    ep = torch.export.export(m, inp)
    print(ep.module()(*inp))  # tensor([16.])

    unflattened = torch.export.unflatten(ep)
    print(unflattened(*inp))  # tensor([15.])
```

However, this goes against the spirit of specializing graphs when exporting: we should *expect* that for every call to a submodule we *might* generate a different graph. The goal of this PR is to fix unflattening to handle multiple specialized graphs corresponding to multiple calls to the same submodule.

The idea is simple: for every call to a child module `foo`, we will create potentially different child modules `foo`, `foo@1`, `foo@2`, etc. and use those names as targets in `callmodule` instructions in the parent graph. An immediate consequence of this is that the list of fqns in an unflattened module may not be the same as an exported module. Note that all these variants share the same parameters / buffers, so that multiple calls to the same submodule can share state as expected.

However, as described so far this scheme may end up with needlessly too many submodules. Thus, between calls to the same submodule, if graphs are equal then we optimize away the extra submodules and reuse call names as much as possible. Moreover, when submodules are shared across fqns, we also try to de-duplicate graphs corresponding to their calls as much as possible. Note that no matter what, information about which submodule was called is still preserved, so that if a submodule has to be swapped with another, one can still find all calls to the former submodule and replace them with calls to the latter.

A note on the choice of naming scheme for call names: instead of generating "sibling" modules `foo@1`, `foo@2`, etc. for `foo`, we had considered generating "children" modules `foo._1`, `foo._2`, etc. of `foo`. However this can cause spurious cycles when de-duplicating graphs. E.g., suppose that `foo` is an alias for `bar._1` and `foo._1` is an alias for `bar`, then we must either introduce a cycle or drop the opportunity to optimize. Another idea would be to make `foo` a dummy module that contains `foo._0` corresponding to the first call, but this necessitates too many changes to existing tests and hurts the common case.

Differential Revision: D63642479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137013
Approved by: https://github.com/pianpwk
2024-10-03 00:55:44 +00:00
6241006c28 Fix dependency on filesystem on Linux (#137209)
Similar to: https://github.com/pytorch/pytorch/pull/134494
We are seeing come back of https://github.com/pytorch/pytorch/issues/133437 due to use of filesystem on Linux

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137209
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-10-03 00:18:28 +00:00
235f7e06f4 [CI] upload_metrics function to upload to s3 instead of dynamo (#136799)
* Upload_metrics function to upload to ossci-raw-job-status bucket instead of dynamo
* Moves all added metrics to a field called "info" so ingesting into database table with a strict schema is easier
* Removes the dynamo_key field since it is no longer needed
* Removes the concept of reserved metrics, since they cannot be overwritten by user added metrics anymore
* Moves s3 resource initialization behind a function so import is faster
---
Tested by emitting a metric during run_test and seeing that documents got added to s3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136799
Approved by: https://github.com/ZainRizvi
2024-10-02 23:19:28 +00:00
2c9e194e23 Revert "[FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955)"
This reverts commit b50b3b32191e7192a28c54a417891f24df4e4dda.

Reverted https://github.com/pytorch/pytorch/pull/135955 on behalf of https://github.com/PaliC due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/135955#issuecomment-2389810936))
2024-10-02 22:46:31 +00:00
bb03ef7aca [FlexAttention] Fix max-autotune when captured buffers are View nodes (#137204)
## Summary

Originally reported in https://github.com/pytorch-labs/attention-gym/issues/45

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137204
Approved by: https://github.com/Chillee, https://github.com/BoyuanFeng
2024-10-02 22:19:33 +00:00
759cd73adb [Profiler] Update Kineto Submodule (#137137)
Summary: Updating commits from Aug 7, 2024 to Sep 26, 2024

Test Plan: Phabricator + OSS CI

Differential Revision: D63723255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137137
Approved by: https://github.com/aaronenyeshi
2024-10-02 22:19:28 +00:00
e9e5d767b6 [inductor] Fix build_paths usage in config.py (#137187)
Summary: In https://github.com/pytorch/pytorch/pull/136234 we accidentally used the old version of `build_paths`, but in https://github.com/pytorch/pytorch/pull/136952 the API slightly changed. This diff addresses that issue by updating the API usage.

Test Plan: CI

Reviewed By: ColinPeppler

Differential Revision: D63764809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137187
Approved by: https://github.com/ColinPeppler
2024-10-02 22:06:02 +00:00
e95b230fd8 Fix NJT serialization (#137031)
Fixes #129366

Since NJT has custom serialization logic, we need an NJT-specific fix to clear out cached sizes / strides PyCapsules. Eventually, we should switch NJT to use the default serialization logic, but this depends on #125622 being addressed.

This PR also makes serialization more complete by explicitly handling `lengths`, `ragged_idx`, and the `metadata_cache`, ensuring working operation for both contiguous and non-contiguous NJTs,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137031
Approved by: https://github.com/soulitzer
ghstack dependencies: #137030
2024-10-02 21:41:35 +00:00
eqy
be423a8480 [SDPA] Bump grad_query fudge factor for Flash Attention (#135711)
Tolerance issue for small GPUs e.g., (A16, A2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135711
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-10-02 21:35:00 +00:00
36fb342ffd Check for fused kernel before inplace update (#137042)
Summary:
Given an op, with a pair (output buffer, input buffer) from that op, we consider marking the output buffer as inline. However, if the parent of input buffer and the current op are going to be fused, then we don't want to mark the output buffer as inline. This change checks that criterion, and skips inlining if it is so.

Test Plan:
New unit test "layer_norm_should_not_inplace" runs LayerNorm and checks for no "in_out" pointers.

Fixes #120217

Here's a diagram of the issue:
![Inline+Fusion](https://github.com/user-attachments/assets/c03308d8-fdbf-40a0-a46d-964ece5f9e6d)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137042
Approved by: https://github.com/eellison
2024-10-02 21:14:34 +00:00
a3f3773477 Make PT2E work with both IR simultaneously (#135769)
Summary: as title

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:quantization_pt2e_qat
```

Differential Revision: D62449830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135769
Approved by: https://github.com/angelayi
2024-10-02 21:05:22 +00:00
4a9225fa1f improve get_schedule_class() (#137103)
Small change to make `get_schedule_class()` take case insensitive schedule names

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137103
Approved by: https://github.com/kwen2501
2024-10-02 20:08:25 +00:00
2d465e4d1d [non ghstack] Init threadpool with user defined num_threads before default (#137051)
Very similar to https://github.com/pytorch/pytorch/pull/136793, but adds back `pool->set_thread_count` call as it is still necessary (I am guessing due to the mutex)

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137051
Approved by: https://github.com/albanD
2024-10-02 20:02:30 +00:00
59d7cf7342 Add _dynamo.config inline_inbuilt_nn_modules and specialize_float logging (#137139)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137139
Approved by: https://github.com/ezyang
2024-10-02 19:58:38 +00:00
2b329d3bf1 Fix typo in _normalize ref (#137079)
I think this should basically make no difference numerically, but it does have some ramifications on things like CSE.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137079
Approved by: https://github.com/Skylion007
ghstack dependencies: #136826, #137043, #137049, #137065
2024-10-02 19:06:48 +00:00
6374a19a6e Fix wrapper subclass serialization with custom sizes / strides (#137030)
Fixes #130154

This PR takes the strategy outlined in the above issue and clears out any cached sizes / strides PyCapsules before serialization. This affects the default subclass serialization logic.

The PyCapsule issue also affects `deepcopy`, so that's fixed here as well.

Note: I originally tried utilizing a context manager to remove / restore cached PyCapsules after serialization, but in practice the state returned from `_reduce_ex_internal()` references the actual `tensor.__dict__()`, so the problem persists once the cached values are restored. Instead, we have to be careful to remove the cached values in the right place so they're not re-cached when pulling out size / stride information for serialization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137030
Approved by: https://github.com/albanD
2024-10-02 18:55:03 +00:00
8962610247 [BE][clang-format] make macro PyObject_HEAD_INIT(type) and PyVarObject_HEAD_INIT(type, size) have its own line (#136949)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136949
Approved by: https://github.com/albanD, https://github.com/eqy
ghstack dependencies: #136945
2024-10-02 18:39:22 +00:00
89c37be6b7 [BE][clang-format] make macro PyObject_HEAD have its own line (#136945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136945
Approved by: https://github.com/albanD
2024-10-02 18:39:21 +00:00
54f50f19eb [dtensor][experimental] expose DTensor Context Parallel API (#137038)
**Summary**
expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038
Approved by: https://github.com/wz337, https://github.com/fegin
2024-10-02 18:00:23 +00:00
4559cddaf9 Revert "Add py3.13t linux wheel build (#137127)"
This reverts commit 6b7adc12140d3073c5700cc1c48998556489857e.

Reverted https://github.com/pytorch/pytorch/pull/137127 on behalf of https://github.com/jovianjaison due to Sorry for reverting your changes but 2 jobs are failing ([comment](https://github.com/pytorch/pytorch/pull/137127#issuecomment-2389250791))
2024-10-02 17:44:42 +00:00
b50b3b3219 [FSDP2] support torch._foreach_copy_(float8) for fully_shard(Float8Linear) (#135955)
this PR unblocks unit test with single Float8Linear module. It fixes following error
```
torch._foreach_copy_(foreach_copy_dsts, all_gather_inputs)
[rank0]:E0913 13:44:29.829000 2179476 torch/testing/_internal/common_distributed.py:671] RuntimeError: "foreach_tensor_copy" not implemented for 'Float8_e4m3fn'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135955
Approved by: https://github.com/vkuzo, https://github.com/eqy
2024-10-02 17:26:45 +00:00
c318bafe9c [inductor mkldnn test][BE] Use parametrize to shorten test run time (#137153)
Summary:
Tests in test_mkldnn_pattern_matcher.py can take too long to finish. Splitting them into smaller tests, using `parametrize`.

I guess this means this test file has some refactoring opportunities as well. Next time would be the parametrize the add functions.

Differential Revision: D63723925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137153
Approved by: https://github.com/desertfire
2024-10-02 17:20:27 +00:00
466623fb51 [CI] Support for CI GPU test and benchmark on containers (#137169)
Renames the arc references to container, and add changes required so CI that requires GPU can run on containers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137169
Approved by: https://github.com/huydhn
2024-10-02 17:10:59 +00:00
e3fd4d796f [CI] Skip sccache for nvcc builds when building for A100 (#137170)
There is a unknown issue with nvcc builds and sccache, it crashes with:

```
      /opt/cache/bin/sccache /usr/local/cuda-12.1/bin/nvcc -forward-unknown-to-host-compiler -DUSE_C10D_GLOO -DUSE_C10D_MPI -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_RPC -DUSE_TENSORPIPE -Dfbgemm_gpu_py_EXPORTS -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../include -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/asmjit/src -I/tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/../third_party/cpuinfo/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -isystem /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /usr/local/cuda-12.1/include -DONNX_NAMESPACE=onnx_c2 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -D_GLIBCXX_USE_CXX11_ABI=1 --expt-relaxed-constexpr -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -MD -MT CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o -MF CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o.d -x cu -c /tmp/pip-install-893ub5fd/fbgemm-gpu_f79a3c2737924c478e50ea29fedfa172/fbgemm_gpu/src/sparse_ops/sparse_index_select.cu -o CMakeFiles/fbgemm_gpu_py.dir/src/sparse_ops/sparse_index_select.cu.o
      sccache: error: failed to execute compile
      sccache: caused by: error reading compile response from server
      sccache: caused by: Failed to read response header
      sccache: caused by: failed to fill whole buffer
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137170
Approved by: https://github.com/huydhn
2024-10-02 17:07:24 +00:00
d4cf90d282 [BE] [CI] Skip clean gha workspace if CI is running in a container for checkout-pytorch (#137168)
For the reusable action checkout-pytorch, skips cleaning workspace when running from a container environment.

The motivation for this change is twofold:
* There is no need for cleanup when running in ephemeral containers, as any changes will be discarded when the docker container is terminated;
* In the specific case of GITHUB_WORKSPACE, to enable sharing this between multiple containers, it need to be mounted with `-v`. This prevents the possibility of running `rm -r` and deleting this mount path;

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137168
Approved by: https://github.com/huydhn
2024-10-02 17:04:50 +00:00
af3e25fea7 remove capture_autograd_function flag (#136959)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136959
Approved by: https://github.com/jansel
2024-10-02 16:59:19 +00:00
bcaa0f5ee9 [CI] Remove nanogpt from perf smoke test (#137176)
Summary: nanogpt's performance is not stable. Remove it from the perf smoke test. We may want to use another test instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137176
Approved by: https://github.com/eellison
2024-10-02 16:35:04 +00:00
7001907480 raw_alloc ignores PYTORCH_NO_CUDA_MEMORY_CACHING (#131114)
raw_alloc is used by cudnn, miopen, thrust, and tunableop.  Without this PR, the env var for disabling the caching allocator will only partially work.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131114
Approved by: https://github.com/eqy, https://github.com/houseroad, https://github.com/albanD

Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
2024-10-02 16:27:15 +00:00
a954a9ea75 [Inductor] External callable registration API for Matmul tuning candidates (#130774)
Fixes #[130769](https://github.com/pytorch/pytorch/issues/130769)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130774
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@meta.com>
2024-10-02 15:38:10 +00:00
af86a6fdba [dynamo][user-defined-class] Fallback when object.__new__ fails (#137044)
Seen in https://github.com/vllm-project/vllm/pull/8949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137044
Approved by: https://github.com/jansel
2024-10-02 14:15:36 +00:00
d29094888b Use torch.Stream&torch.Event for Dynamo capature (#134850)
# Motivation
This PR aims to solve the multiple Inheritance problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134850
Approved by: https://github.com/yf225, https://github.com/EikanWang
2024-10-02 14:15:33 +00:00
bf73af4b4e dont let partitioner think it can fuse pointwise ops into user triton kernels (#136878)
Previously if we had a graph like:
```
        triton_kernel_wrapper_functional_proxy = triton_kernel_wrapper_functional(...)
        getitem: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out_ptr']
        getitem_1: "f32[3][1]cuda:0" = triton_kernel_wrapper_functional_proxy['out2_ptr']
        sigmoid: "f32[3][1]cuda:0" = torch.ops.aten.sigmoid.default(getitem_1)
        mul: "f32[3][1]cuda:0" = torch.ops.aten.mul.Tensor(tangents_1, sigmoid)
```

The partitioner would assume that the `sigmoid()` could be fused into either its user (the pointwise mul), or its producer (the user triton kernel). This could lead to a bad partitioning:

(1) If the partitioner thinks we can fuse the sigmoid with its producer triton kernel, we would keep the sigmoid compute in the forward, and have to generate two separate kernels in the forward (user triton kernel, dedicated sigmoid kernel)

(2) if the partitioner puts the sigmoid in the backward instead, we could fuse it with an existing backward kernel (the mul with a tangent)

Reviewed By: embg

Differential Revision: D63551393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136878
Approved by: https://github.com/zou3519
2024-10-02 13:52:44 +00:00
5c2c3ca10b [Inductor] Fix test_conv2d_unary_cpu_cpp_wrapper failure (#137158)
Summary: test_conv2d_unary_cpu_cpp_wrapper is failing on ciflow/slow because of mis-handling of inf. This PR fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137158
Approved by: https://github.com/chenyang78
2024-10-02 13:21:35 +00:00
d117ec1d6e [3/3][Inductor] Make CK work in FBCode (#136234)
Summary:
# Context
Goal: Enable CK for Inductor in FBCode

We split this stack into three diffs to help with review & in case we need to revert anything.

# This Diff
* Gets us to have CK kernels as an option for GEMM autotuning in Inductor.

Reviewed By: zjing14

Differential Revision: D62662705

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136234
Approved by: https://github.com/tenpercent, https://github.com/chenyang78
2024-10-02 12:17:38 +00:00
6b7adc1214 Add py3.13t linux wheel build (#137127)
Builder PR required: https://github.com/pytorch/builder/pull/2001
Test PR: https://github.com/pytorch/pytorch/pull/136490/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137127
Approved by: https://github.com/albanD
2024-10-02 11:59:33 +00:00
8c29a0dd0e [pipelining] Clean up dead code (#136804)
'set_requires_grad' dict appears to be always full of "False" values,
and we always set requires_grad based on the value of 'has_backward'

setting of required_grad field was being repeatedly done during
get_fwd_recv_ops, but it should be done just once, so move it to the
function that creates recv buffers in the first place.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136804
Approved by: https://github.com/kwen2501
2024-10-02 11:26:31 +00:00
cyy
862029a1ef [Distributed] [15/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#137072)
Follows  #136848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137072
Approved by: https://github.com/kwen2501
2024-10-02 10:56:15 +00:00
ed02309232 type _dynamo/create_parameter_op.py (#136958)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136958
Approved by: https://github.com/jansel
2024-10-02 10:23:37 +00:00
52d29a2b94 [reland #136389] Skip kernel saving if already existed (#137073)
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The
following trace shows before/after the change for a benchmarking of a
trivial addmm:

Before:
<img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a">

After:
<img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118">

We can see that before the change, the benchmarking includes two parts,
   (1) The overhead of our triton_heuristic call, which includes the
   save/get, and the (expensive) hash computation.
   (2) The exact computation of Triton kernel.

   We see that (1) accounts >50% of time, which makes kernel selection
   for profiling choosing aten kernels over Triton kernels.

Test Plan:
Existing OSS CI
python test/inductor/test_cuda_cpp_wrapper.py

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137073
Approved by: https://github.com/desertfire
2024-10-02 09:27:08 +00:00
e374d6850a [distributed][test] Remove unused variable and fix doc typo (#136943)
Refactor distributed test code:
- Fix TODO: Remove unused variable
- Fix doc typo
- Migrate deprecated method call `load_state_dict` and `save_state_dict`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136943
Approved by: https://github.com/H-Huang
2024-10-02 08:31:53 +00:00
e9a55b43a1 [inductor] Support lists of tensors in operatorbench (#136911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136911
Approved by: https://github.com/eellison
2024-10-02 06:41:06 +00:00
a89e3c2490 Add compiled_autograd_kwargs_override Dynamo config (#136967)
For Traceable FSDP2, the most common use case is to have `fullgraph=False` for forward pass (to allow user-level graph breaks), and `fullgraph=True` for compiled autograd backward pass (required for queue_callback support).

With `torch._dynamo.compiled_autograd=True`, previously we are not able to set different `fullgraph` config value for forward vs. backward pass, since `rebuild_ctx` just reuses the forward compile config as-is. This PR adds `torch._dynamo.config.compiled_autograd_kwargs_override` config to allow forcing `fullgraph=True` for CA Dynamo tracing.

With this PR, we can remove standalone compiled autograd ctx manager usage in Traceable FSDP2 unit tests, and consolidate on using `torch._dynamo.compiled_autograd=True`.

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136967
Approved by: https://github.com/xmfan
2024-10-02 06:23:59 +00:00
b51d22b8bb [BE] [NEON] Use vshlq_n_u32 instead of vshlq_u32 (#137122)
As compiler optimizes it away anyway

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137122
Approved by: https://github.com/kit1980
2024-10-02 06:18:11 +00:00
2854d157de Add type annotations for higher order ops/flex_attention (#137065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137065
Approved by: https://github.com/drisspg, https://github.com/Skylion007
ghstack dependencies: #136826, #137043, #137049
2024-10-02 04:39:25 +00:00
3b8511dadf Remove python 3.8 from triton builds (#137141)
All jobs have switched to Python 3.9. These 3.8 builds no longer necessary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137141
Approved by: https://github.com/albanD
2024-10-02 03:36:54 +00:00
8e39f2a4a5 [Inductor] Enable a SDPA pattern matching for CUDA (#137085)
Summary: Fixes https://github.com/pytorch/pytorch/issues/122429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137085
Approved by: https://github.com/eellison
2024-10-02 03:22:11 +00:00
18525e185e Fix rendezvous error due to EtcdStore get method not waiting in some cases (#137056)
Fixes #132950

This fixes an issue in `torch/distributed/elastic/rendezvous/etcd_store.py` where the [get method](https://github.com/pytorch/pytorch/blob/v2.4.0/torch/distributed/elastic/rendezvous/etcd_store.py#L60) does not wait as expected when no keys have been written under the store prefix yet (and therefore the store prefix key does not exist). This was because the `_try_wait_get` method would error out immediately [here](https://github.com/alenawang/pytorch/blob/main/torch/distributed/elastic/rendezvous/etcd_store.py#L179) if the prefix was not found instead of continuing to the etcd watch.

This was causing upstream issues where distributed jobs using etcd-v2 could not get past the initial rendezvous at all (details in issue #132950).

We added a test demonstrating this issue and the fix. Without the fix the test fails with `etcd.EtcdKeyNotFound: Key not found : /torch/elastic/store` instead of waiting for the first key to be written; with the fix the test waits properly.

Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137056
Approved by: https://github.com/fduwjj

Co-authored-by: tarat44 <32471142+tarat44@users.noreply.github.com>
2024-10-02 01:45:00 +00:00
f108f88c40 [logging/debugging] handle None (constant) args in debug log (#137032)
Summary:
# Why

The arguments are filtered out as they are just const in the compiled graph, but the logger still expects a non-None type

# What

When passing a filtered out arg (None) to the debug logger, just log that it's a filtered out argument, instead of throwing a Type error

# Background

https://github.com/pytorch/pytorch/pull/131594

Test Plan: - execute repro from https://github.com/pytorch/pytorch/issues/135584#issue-2516944089 with and without the edits

Differential Revision: D63652564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137032
Approved by: https://github.com/angelayi
2024-10-02 01:43:22 +00:00
f984b88718 Ensure noncontiguous tensor creation tests offsetting (#136396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136396
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136055
2024-10-02 00:40:43 +00:00
c7638da558 Lowerings: remove restriction on TensorBox keyword arguments (#136055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136055
Approved by: https://github.com/eellison
2024-10-02 00:40:43 +00:00
63d6908da0 fix build error with gcc 12+ (#137092)
Fixes #127920

This commit addresses a build failure occurring with GCC 12 and above due to the -Werror=nonnull flag. The error manifests in the test_api target.

**Issue:**
When building with GCC 12+, the following error occurs:
```
error: argument 1 null where non-null expected [-Werror=nonnull]
  431 |             __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
      |             ~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
```

This change ensures that:
1. The flag is only added for GCC 12 or higher
2. The flag is only added if it's supported by the compiler
3. The flag is added specifically to the test_api target, not globally

By disabling this specific error, we allow the build to proceed while maintaining other compiler warnings.

**Test Plan:**
- Verified successful build with GCC 12 and above
- Ensured no regression in builds with earlier GCC versions and other compilers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137092
Approved by: https://github.com/malfet
2024-10-02 00:37:15 +00:00
d725758210 [ts_converter] Fix prim::If buffer names (#136648)
Summary:
We previously incorrectly handled the following graph, specifically for the node `w.3` in `block0`:
```
 graph(%x.1 : Float(3, strides=[1], requires_grad=0, device=cpu),
       %y.1 : int):
   %2 : __torch__.___torch_mangle_1.M = prim::CreateObject()
   %3 : int = prim::Constant[value=20](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:34
   %4 : int = prim::Constant[value=10](), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:34
   %5 : int = prim::Constant[value=1](), scope: M::
   %w.1 : int = prim::GetAttr[name="w"](%2), scope: M::
   %7 : int = aten::mul(%w.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:746:25
    = prim::SetAttr[name="w"](%2, %7), scope: M::
   %h.1 : int = prim::GetAttr[name="h"](%2), scope: M::
   %9 : int = aten::mul(%h.1, %3), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:747:25
    = prim::SetAttr[name="h"](%2, %9), scope: M::
   %10 : bool = aten::gt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:19
   %res.37 : Tensor = prim::If(%10), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:749:16
     block0():
       %w.3 : int = prim::GetAttr[name="w"](%2), scope: M::
       %res.1 : Tensor = aten::add(%x.1, %w.3, %5), scope: M:: # <string>:5:9
       -> (%res.1)
     block1():
       %h.3 : int = prim::GetAttr[name="h"](%2), scope: M::
       %res.3 : Tensor = aten::add(%x.1, %h.3, %5), scope: M:: # <string>:5:9
       -> (%res.3)
   %16 : bool = aten::lt(%y.1, %4), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:19
   %res : Tensor = prim::If(%16), scope: M:: # /data/users/angelayi/pytorch/test/export/test_converter.py:754:16
     block0():
       %w : int = prim::GetAttr[name="w"](%2), scope: M::
       %res.15 : Tensor = aten::add(%res.37, %w, %5), scope: M:: # <string>:5:9
       -> (%res.15)
     block1():
       %h : int = prim::GetAttr[name="h"](%2), scope: M::
       %res.21 : Tensor = aten::add(%res.37, %h, %5), scope: M:: # <string>:5:9
       -> (%res.21)
   return (%res)
```

Test Plan: CI

Differential Revision: D63399064

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136648
Approved by: https://github.com/SherlockNoMad
2024-10-02 00:07:47 +00:00
8765804542 Continue on error for pytorch autolint (#137104)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137104
Approved by: https://github.com/huydhn, https://github.com/atalman
2024-10-01 22:30:36 +00:00
f0fa460c60 [BE] Add script to keept the runner-determinator scripts in sync (#136794)
Whenever we update runner_determinator.py it needs to be copied over into _runner-determinator.yml.

This is a quick utility script to make that process less tedious
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136794
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
2024-10-01 22:26:28 +00:00
4f93de8951 Mark PyTorch module as no-gil valid and pythoncapi_compat.h (#136899)
PyList_GetItem are audited but not other APIs yet (they will be done in a follow up PR to keep this one small enough).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136899
Approved by: https://github.com/colesbury, https://github.com/atalman
2024-10-01 22:05:35 +00:00
6baee60e3c upload test stats: remove nan/inf when uploading (#136877)
`json.dumps(float("inf"))` returns `Infinity`, which is technically invalid json

This is fine if you json.load, but ClickHouse cannot handle it

Solution here: cast inf and nan to string (which ClickHouse is able to cast back to float)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136877
Approved by: https://github.com/huydhn
2024-10-01 21:47:46 +00:00
0788d016d6 Update incompatible cudagraph ops skip message (#137015)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137015
Approved by: https://github.com/BoyuanFeng
2024-10-01 21:23:36 +00:00
34c18887ad [FlexAttention] Remove restriction on QK headdim > V headdim (#135884)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135884
Approved by: https://github.com/Chillee
2024-10-01 21:17:54 +00:00
99eb47fb6d Add CI for Triton CPU backend (#135342)
Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips.

Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/malfet
2024-10-01 20:43:10 +00:00
86b715c5f6 Revert "Skip kernel saving if already existed. (#136389)"
This reverts commit 2521cd387482a70d30e4ea922fa4fe3b488c9f6d.

Reverted https://github.com/pytorch/pytorch/pull/136389 on behalf of https://github.com/muchulee8 due to Issue #136940  ([comment](https://github.com/pytorch/pytorch/pull/136389#issuecomment-2386950623))
2024-10-01 20:04:43 +00:00
b53ab8b86a Revert "[dtensor][experimental] expose DTensor Context Parallel API (#137038)"
This reverts commit e23e766cc089b568aa4c0ebf0747ff9b504b8915.

Reverted https://github.com/pytorch/pytorch/pull/137038 on behalf of https://github.com/huydhn due to Sorry for reverting your changes but the doc build failure looks legit ([comment](https://github.com/pytorch/pytorch/pull/137038#issuecomment-2386902253))
2024-10-01 19:49:28 +00:00
a00f0d5db8 [PT2][Inductor] Add runtime numeric check for the post grad pass (#136724)
Summary: Similar to D51838043, we further add post grad runtime numeric check since some graph passes are performed at aten-level.

Differential Revision: D63438718

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136724
Approved by: https://github.com/Yuzhen11
2024-10-01 18:56:01 +00:00
d61e45283e Properly interpolate sloc here (#137088)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137088
Approved by: https://github.com/Skylion007
2024-10-01 18:33:03 +00:00
c2dee8ea9c enable lazy init for MTIA (#136902)
Summary: As title.

Test Plan: OSS and Internal CIs

Reviewed By: nautsimon, hanzlfs

Differential Revision: D63434511

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136902
Approved by: https://github.com/nautsimon
2024-10-01 18:30:56 +00:00
1f3a793790 Fix PyTorch builds on MacOS-13 (#137095)
By including SonomaOps header

Fixes https://github.com/pytorch/pytorch/issues/137094

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137095
Approved by: https://github.com/atalman, https://github.com/ZainRizvi
2024-10-01 17:56:35 +00:00
e23e766cc0 [dtensor][experimental] expose DTensor Context Parallel API (#137038)
**Summary**
expose experimental Context Parallel API `torch.distributed.tensor.experimental._attention.context_parallel` to module `torch.distributed.tensor.experimental`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137038
Approved by: https://github.com/wz337, https://github.com/fegin
2024-10-01 17:41:28 +00:00
73b07df042 Preserve custom ops via run_decomps (#136882)
This is re-apply of https://github.com/pytorch/pytorch/pull/136773?fbclid=IwZXh0bgNhZW0CMTEAAR3SmginkvZcILVY7G2XDa_KosnV4DPmq1l6pkjPIM255QgJLKVAR90rGAU_aem_ZWpcVdUsmAGzOGiwbjtBDg.

Note that this doesn't completely remove the _preserve_ops list from export mainly because we want to have small change to address failing executorch tests. All the complications included in this PR is deleted in the next PR.

Differential Revision: [D63553086](https://our.internmc.facebook.com/intern/diff/D63553086/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136882
Approved by: https://github.com/bdhirsh
2024-10-01 17:38:00 +00:00
b1b6816e05 [testing] reenable kernel_benchmark.py tests (#136876)
Summary:
# Why

We want this to run internally

# What

- fix python path issue on the test
- reenable the test

# Background

(copied from similar issue resolved earlier)

It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:kernel_benchmark

Differential Revision: D63498897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136876
Approved by: https://github.com/henrylhtsang
2024-10-01 17:16:21 +00:00
3d0cb81594 [MPS] Enable bfloat16 testing (#136987)
By even further reducing precisions of imprecise FP16 ops, introducing new BF16_LOW_PRECISION_OPS category and marking BF16 tests as xfail for `divfloor_rounding`, `floor_divide` and `remainder`.
I guess the nature of low-precision results, is that MPSGraph, unlike the rest of the PyTorch does not do accumulation over fp32 for reduction operations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136987
Approved by: https://github.com/albanD
ghstack dependencies: #137070
2024-10-01 17:10:07 +00:00
cc2a66c55e [export] hook up mark_dynamic to export Dims (#137029)
Adds Dim.DYNAMIC which calls torch._dynamo.mark_dynamic() in the backend. Similar to Dim.AUTO in that it does automatic inference for ranges & relations, but errors out for specializations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137029
Approved by: https://github.com/avikchaudhuri
2024-10-01 17:05:09 +00:00
ef6fd3d780 Fix adaptive_max_pool2d fallback (#136367)
Fixes #136332
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136367
Approved by: https://github.com/amjames, https://github.com/eellison
2024-10-01 16:20:34 +00:00
8f4f7bed5d [MPS] Fix bfloat to complex casts (#137070)
For Metal cast ops to comple, one need to explicitly cast to/from `bfloat` unlike for other dtypes

Tested in https://github.com/pytorch/pytorch/pull/136987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137070
Approved by: https://github.com/Skylion007
2024-10-01 15:47:29 +00:00
696d01aef3 Revert "inductor: use previous guards to know if a size is 1 for broadcasting (#136670)"
This reverts commit dfdda2f6a603ae9245f38a3e8f6365c3cb6d49ac.

Reverted https://github.com/pytorch/pytorch/pull/136670 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
951107e8c2 Revert "compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)"
This reverts commit b17cd264d38ca3381391c449bdaf9f03381caf35.

Reverted https://github.com/pytorch/pytorch/pull/136759 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
923410193b Revert "compile time benchmarks for AOTDispatcher (partitioner) (#136760)"
This reverts commit c010c6099bf304bbb681af534b9f3996c33ce582.

Reverted https://github.com/pytorch/pytorch/pull/136760 on behalf of https://github.com/ZainRizvi due to Something in this stack seems to be causing tests to fail on trunk. See functorch/test_control_flow.py::TestControlFlow::test_associative_scan_dim_reverse_True_combine_mode_generic_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/11107079955/job/30872132411) [HUD commit link](c010c6099b) ([comment](https://github.com/pytorch/pytorch/pull/136670#issuecomment-2386303362))
2024-10-01 15:23:55 +00:00
8f5c2b5f17 type _dynamo/test_case.py (#136957)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136957
Approved by: https://github.com/Skylion007
2024-10-01 14:36:22 +00:00
d4cc2aaf1e type _dynamo/logging.py (#136956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136956
Approved by: https://github.com/Skylion007
2024-10-01 14:35:54 +00:00
7303716005 Revert "Simplify find_localzeros (#133325)"
This reverts commit 99f90c379ed214ab30882a87bdb3924ed6d6c899.

Reverted https://github.com/pytorch/pytorch/pull/133325 on behalf of https://github.com/ezyang due to https://fb.workplace.com/groups/gpuinference/permalink/2921405651341417/ ([comment](https://github.com/pytorch/pytorch/pull/133325#issuecomment-2385832600))
2024-10-01 13:25:03 +00:00
6bd9d37266 Remove allow-untyped-defs from torch.fx.experimental.symbolic_shapes (#137019)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137019
Approved by: https://github.com/Skylion007
ghstack dependencies: #136934, #136935, #136972
2024-10-01 13:22:10 +00:00
cc8f1cddd4 Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972
Approved by: https://github.com/Skylion007
ghstack dependencies: #136934, #136935
2024-10-01 13:22:10 +00:00
b85f21fc1d Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
ghstack dependencies: #136653
2024-10-01 10:23:22 +00:00
083921852b set FlexAttention devices properly during tracing (#137049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137049
Approved by: https://github.com/zou3519, https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #136826, #137043
2024-10-01 09:08:08 +00:00
34cef1eaa7 Allow automatic dynamic shapes for closures and set current node properly in flexattention subgraph lowering (#137043)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137043
Approved by: https://github.com/drisspg
ghstack dependencies: #136826
2024-10-01 09:08:08 +00:00
37dd924c2d Fix test/test_linalg.py for NumPy 2 (#136800)
Related to  #107302.

When built and tested with NumPy 2 the following unit tests failed.

```
=========================================================== short test summary info ============================================================
FAILED [0.0026s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex128 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_complex64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0025s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float32 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0024s] test/test_linalg.py::TestLinalgCPU::test_householder_product_cpu_float64 - TypeError: expected np.ndarray (got Tensor)
FAILED [0.0016s] test/test_linalg.py::TestLinalgCPU::test_nuclear_norm_axes_small_brute_force_old_cpu - ValueError: Unable to avoid copy while creating an array as requested.
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex128 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0055s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_complex64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0048s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float32 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
FAILED [0.0054s] test/test_linalg.py::TestLinalgCPU::test_solve_cpu_float64 - AssertionError: The values for attribute 'shape' do not match: torch.Size([0, 0]) != torch.Size([0, 0, 0]).
=========================================== 9 failed, 1051 passed, 118 skipped in 152.51s (0:02:32) ============================================
```

This PR fixes them. The test is now compatible with both NumPy 1 & 2.

Some more details:

1. The `np.linalg.solve` has changed its behavior. So I added an adapt function in the unit test to keep its behavior the same no matter it is NumPy 1 or Numpy 2.
2. The cause of the failure is when passing a `torch.Tensor` to `np.linalg.qr`, the return type in NumPy 1 is `(np.ndarray, np.ndarray)`, while it is `(torch.Tensor, torch.Tensor)` in NumPy 2.
3. NumPy 2 does not allow `np.array(obj, copy=False)`, but recommended to use `np.asarray(obj)` instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136800
Approved by: https://github.com/lezcano
2024-10-01 07:53:24 +00:00
df5bbc09d1 Make device-specific event inherits from torch.Event (#134845)
# Motivation
This PR intends to make device-specific Event inherit from the generic torch.Event. The benefit is providing a generic abstract class `torch.Event` for different devices, like `torch.Stream`. This make it easier for Dynamo to capture the Event of different devices, like torch.cuda.Event and torch.xpu.Event.
And the next PR would like to remove previous useless base class `_StreamBase` and `_EventBase` to avoid multiple Inheritance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134845
Approved by: https://github.com/albanD, https://github.com/EikanWang
2024-10-01 06:28:41 +00:00
cyy
47a78daf91 [Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)
This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449
Approved by: https://github.com/malfet, https://github.com/albanD, https://github.com/eqy
2024-10-01 06:24:30 +00:00
be169f743b [Dynamo] Mark config.dead_code_elimination as deprecated (#136933)
part of #136862

For reviewers, all call sites are here: https://github.com/search?q=repo%3Apytorch%2Fpytorch+dead_code_elimination+language%3APython&type=code&l=Python

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136933
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2024-10-01 03:51:59 +00:00
6e10f7d8c1 [compiled autograd] undo view_to_reshape inductor fx pass in node name matching (#136741)
inductor mutates the aot backward graph. a solution could be to copy the graph, but since we don't know if compiled autograd is applied or not, it would be expensive to always clone it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136741
Approved by: https://github.com/jansel
ghstack dependencies: #135663
2024-10-01 03:22:49 +00:00
40157db5a7 [compiled autograd] log placeholder origin in verbose (#135663)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135663
Approved by: https://github.com/jansel
2024-10-01 03:22:49 +00:00
6966811da6 [test] skip not omit big gpu tests for cuda_cpp_wrapper (#137055)
Summary: Problem is, when gpu is not big, we will omit the test cases in the test class. We expect the test to be skipped, but due to fbcode ci it can throw an error. This causes the test to be flaky.

Test Plan: ci

Differential Revision: D62037908

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137055
Approved by: https://github.com/masnesral
2024-10-01 03:03:27 +00:00
cyy
17455695d6 [Distributed] [14/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136848)
Follows  #136713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136848
Approved by: https://github.com/H-Huang
2024-10-01 02:01:13 +00:00
951af3d3d8 Format torch.fx.experimental.validator (#136935)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136935
Approved by: https://github.com/Skylion007
ghstack dependencies: #136934
2024-10-01 01:47:17 +00:00
33c2d3232f Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136934
Approved by: https://github.com/Skylion007
2024-10-01 01:47:16 +00:00
d9c400bd9f Added some tests to prevent regressions in partitioning and flexattention (#136826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136826
Approved by: https://github.com/yanboliang, https://github.com/drisspg
2024-10-01 01:08:44 +00:00
3f457ee1f6 Fix AOT Graph capture not propagating non_blocking copy parameter to … (#136513)
…inductor codegen.

Fixes #136260

**Note**: this is my first code contribution to torch so please let me know if there's anything I need to fix/some other convention I should follow.

Regarding the bug, re-running the issue's reproduction code:
```
import torch

def fn(x):
    return x.to(device="cuda", non_blocking=True)

inp = torch.randn(3, 4)

torch.compile(fn)(inp)
```

We now have the non_blocking being passed on to codegen properly:

```
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code] TRACED GRAPH
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]  ===== pre insert_deferred_runtime_asserts __compiled_fn_1 =====
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]  <eval_with_key>.0 class GraphModule(torch.nn.Module):
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]     def forward(self, L_x_: "f32[3, 4]"):
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         l_x_ = L_x_
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         to: "f32[3, 4]" = l_x_.to(device = 'cuda', non_blocking = True);  l_x_ = None
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]         return (to,)
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.393000 679839 torch/fx/passes/runtime_assert.py:114] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code] TRACED GRAPH
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]  ===== __compiled_fn_1 =====
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]  /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.Module):
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]     def forward(self, L_x_: "f32[3, 4][4, 1]cpu"):
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         l_x_ = L_x_
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         to: "f32[3, 4][4, 1]cuda:0" = l_x_.to(device = 'cuda', non_blocking = True);  l_x_ = None
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]         return (to,)
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.394000 679839 torch/_dynamo/output_graph.py:1340] [0/0] [__graph_code]
V0922 20:33:25.404000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:114] [0/0] [__aot_graphs] aot_config id: 0, fw_metadata=ViewAndMutationMeta(input_info=[InputAliasInfo(is_leaf=True, mutates_data=False, mutates_metadata=False, mutations_hidden_from_autograd=True, mutations_under_no_grad_or_inference_mode=False, mutation_inductor_storage_resize=False, mutates_storage_metadata=False, requires_grad=False, keep_input_mutations=True)], output_info=[OutputAliasInfo(output_type=<OutputType.non_alias: 1>, raw_type=<class 'torch._subclasses.functional_tensor.FunctionalTensor'>, base_idx=None, dynamic_dims=set(), requires_grad=False, functional_tensor=None)], num_intermediate_bases=0, keep_input_mutations=True, traced_tangents=[], subclass_inp_meta=[0], subclass_fw_graph_out_meta=[0], subclass_tangent_meta=[], is_train=False, traced_tangent_metas=None, num_symints_saved_for_bw=None, grad_enabled_mutation=None, deterministic=None, static_input_indices=[], tokens={}, indices_of_inputs_that_requires_grad_with_mutations_in_bw=[], bw_donated_idxs=None, num_backward_tokens=0),subclass_metadata=None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs] TRACED GRAPH
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]  ===== Forward graph 0 =====
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]  /home/niklasz/Desktop/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module):
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]     def forward(self, arg0_1: "f32[3, 4][4, 1]cpu"):
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]          # File: /home/niklasz/Desktop/pytorch/temp/reproduction.py:4 in fn, code: return x.to(device="cuda", non_blocking=True)
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         device_put: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.device_put.default(arg0_1, device(type='cuda', index=0), True);  arg0_1 = None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         convert_element_type: "f32[3, 4][4, 1]cuda:0" = torch.ops.prims.convert_element_type.default(device_put, torch.float32);  device_put = None
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]         return (convert_element_type,)
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]
I0922 20:33:25.409000 679839 torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py:204] [0/0] [__aot_graphs]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1134] [0/0] [__output_code] Output code written to: /tmp/torchinductor_niklasz/ha/chaai264g6ribfw3q2qhl6ayjtaqaavku5wivxtzw4nabgd6htsv.py
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] Output code:
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] # AOT ID: ['0_inference']
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from ctypes import c_void_p, c_long, c_int
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import torch
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import math
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import random
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import os
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] import tempfile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from math import inf, nan
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.hooks import run_intermediate_hooks
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.utils import maybe_profile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.memory_planning import _align as align
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch import device, empty_strided
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.async_compile import AsyncCompile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.select_algorithm import extern_kernels
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] from torch._inductor.codegen.multi_kernel import MultiKernelCall
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] aten = torch.ops.aten
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] inductor_ops = torch.ops.inductor
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] _quantized = torch.ops._quantized
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] assert_size_stride = torch._C._dynamo.guards.assert_size_stride
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] empty_strided_xpu = torch._C._dynamo.guards._empty_strided_xpu
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] alloc_from_pool = torch.ops.inductor._alloc_from_pool
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile = AsyncCompile()
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] async_compile.wait(globals())
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] del async_compile
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def call(args):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     arg0_1, = args
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     args.clear()
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     assert_size_stride(arg0_1, (3, 4), (4, 1))
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     with torch.cuda._DeviceGuard(0):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         torch.cuda.set_device(0)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         buf0 = empty_strided_cuda((3, 4), (4, 1), torch.float32)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         buf0.copy_(arg0_1, True)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]         del arg0_1
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     return (buf0, )
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] def benchmark_compiled_module(times=10, repeat=10):
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._dynamo.testing import rand_strided
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._inductor.utils import print_performance
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     arg0_1 = rand_strided((3, 4), (4, 1), device='cpu', dtype=torch.float32)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     fn = lambda: call([arg0_1])
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     return print_performance(fn, times=times, repeat=repeat)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code] if __name__ == "__main__":
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     from torch._inductor.wrapper_benchmark import compiled_module_main
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]     compiled_module_main('None', benchmark_compiled_module)
V0922 20:33:25.983000 679839 torch/_inductor/codecache.py:1135] [0/0] [__output_code]
```
See above line `buf0.copy_(arg0_1, True)`. Specific log setting used: `export TORCH_LOGS="graph_code,aot_graphs,output_code"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136513
Approved by: https://github.com/eellison
2024-10-01 00:32:47 +00:00
19a4d68224 Add missing mappings to support torch.uint16 in quantization and export (#136547)
Test Plan: CI.

Differential Revision: D63142844

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136547
Approved by: https://github.com/angelayi
2024-10-01 00:01:01 +00:00
18e707645c Substitute unbacked symints in expressions (#137020)
Differential Revision: [D63647095](https://our.internmc.facebook.com/intern/diff/D63647095)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137020
Approved by: https://github.com/ezyang
2024-09-30 23:07:22 +00:00
af64c44b56 Revert "Don't uselessly recompute axiom dict every static eval call (#135429)"
This reverts commit 1d6e0412f5205b1cd709e034526d7f21d6f2d56f.

Reverted https://github.com/pytorch/pytorch/pull/135429 on behalf of https://github.com/ezyang due to try again ([comment](https://github.com/pytorch/pytorch/pull/135429#issuecomment-2384288879))
2024-09-30 22:29:13 +00:00
c07ebaf430 [triton] Try to use triton.language.extra.libdevice when possible (#136997)
Summary:
X-link: https://github.com/facebookresearch/generative-recommenders/pull/90

In view of https://github.com/triton-lang/triton/pull/3825 we should try to use `triton.language.extra.libdevice` instead of `triton.language.extra.cuda.libdevice`.

Test Plan: CI

Reviewed By: bertmaher, karthik-man

Differential Revision: D63583965

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136997
Approved by: https://github.com/bertmaher
2024-09-30 21:52:44 +00:00
b3972ee19a [triton] Unify build_paths.py for NV & AMD, fix typing (#136952)
Summary: Some build improvements.

Test Plan: CI

Differential Revision: D63583959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136952
Approved by: https://github.com/bertmaher
2024-09-30 21:51:45 +00:00
66a269afe8 Revert "Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934)"
This reverts commit cf1a7eab250ea37ca8fda0327e8e38693c3c5c1a.

Reverted https://github.com/pytorch/pytorch/pull/136934 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))
2024-09-30 21:44:44 +00:00
c94536ae74 Revert "Format torch.fx.experimental.validator (#136935)"
This reverts commit 377e4bc877a3ac4cd6d073aa513a309159ade991.

Reverted https://github.com/pytorch/pytorch/pull/136935 on behalf of https://github.com/ezyang due to merge conflict revert ([comment](https://github.com/pytorch/pytorch/pull/136934#issuecomment-2384195881))
2024-09-30 21:44:44 +00:00
8982906502 Revert "Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972)"
This reverts commit 3ff2d93d9f72fd26503ef0cf5c5956edad4c52e6.

Reverted https://github.com/pytorch/pytorch/pull/136972 on behalf of https://github.com/ezyang due to need to back out for merge conflict ([comment](https://github.com/pytorch/pytorch/pull/136972#issuecomment-2384182244))
2024-09-30 21:35:08 +00:00
b825848d85 Fix aarch64 debug build with GCC (#136990)
Fixes #136440

**Issue:**
When building PyTorch in debug mode on aarch64 architecture using GCC, we encounter relocation errors due to the R_AARCH64_CALL26 relocation limit. This occurs because debug builds with -O0 optimization generate larger code sizes, potentially exceeding the range limit for these relocations.

**Fix:**
Apply -Og optimization instead of -O0 for aarch64 GCC debug builds. This slightly reduces code size while maintaining debuggability, bringing function calls back within the range of R_AARCH64_CALL26 relocations.

The fix is implemented by conditionally setting compiler and linker flags in CMakeLists.txt:
- For aarch64 GCC debug builds: use -Og
- For all other debug builds: retain -O0

This change affects only debug builds on aarch64 with GCC, leaving other configurations unchanged.

**Testing:**
Verified that the build succeeds without relocation errors on aarch64 systems with GCC in debug mode. Ensured that debugging information is still available and useful for debugging purposes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136990
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-30 21:11:50 +00:00
866a64ce9a [FSDP2] Added check for contiguous parameters (#137000)
Since our implementation currently assumes contiguous strides, let us add an explicit check and raise an error at construction time if the parameter is not contiguous.

We can try to support this in the future. Mainly, I want to first learn more about how DTensor support for non-contiguous memory formats works.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137000
Approved by: https://github.com/weifengpy
2024-09-30 21:10:47 +00:00
66e3186a48 Revert "Init threadpool with user defined num_threads before default (#136793)"
This reverts commit adbcaee950afa6697c04962096344bf0962a542f.

Reverted https://github.com/pytorch/pytorch/pull/136793 on behalf of https://github.com/janeyx99 due to Caused internal Oculus crash, and internal force landed a diff without exporting to GH =.= ([comment](https://github.com/pytorch/pytorch/pull/136793#issuecomment-2384148132))
2024-09-30 21:10:12 +00:00
bc6adb9596 [EZ][BE] Delete ISSUE_TEMPALTE.md (#137040)
As it has been superseded by [ISSUES_TEMPLATE](https://github.com/pytorch/pytorch/tree/main/.github/ISSUE_TEMPLATE) folder, per https://docs.github.com/en/communities/using-templates-to-encourage-useful-issues-and-pull-requests/configuring-issue-templates-for-your-repository#creating-issue-forms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137040
Approved by: https://github.com/ZainRizvi
2024-09-30 21:04:32 +00:00
d46ebcb31b Enable experiments for protected branches (#136785)
This is to allow the protected branches (like `main` and `nightly`) also run on the LF fleet, now that we've migrated over
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136785
Approved by: https://github.com/jeanschmidt
2024-09-30 20:58:28 +00:00
2ef1454189 Revert "Add int1 to int7 dtypes (#136301)"
This reverts commit bfa16a161d5089a9ba008f5e665f29b58dc16526.

Reverted https://github.com/pytorch/pytorch/pull/136301 on behalf of https://github.com/PaliC due to causing internal failures ([comment](https://github.com/pytorch/pytorch/pull/136301#issuecomment-2384119600))
2024-09-30 20:50:49 +00:00
0ccd39a64b Fix prefix store seg fault (#136872)
fixes https://github.com/pytorch/pytorch/issues/136723

Do not allow `None` to be passed into `PrefixStore`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136872
Approved by: https://github.com/kwen2501
2024-09-30 20:43:08 +00:00
7b96f3c75d Fix six broken tests in test_ops.py (#136653)
## The problem.

[A commit from three weeks ago](82d00acfee) appears to have broken five tests but was not caught by CI.

[A later commit](https://github.com/pytorch/pytorch/commit/e05ea2b1797) which added a decomposition of `transpose_copy` added another broken test, also seemingly not detected, making six total (listed below).

They came to my attention when I updated some pending decomposition pull requests which passed CI, and started getting failures like [this](https://hud.pytorch.org/pr/134319) for a test unrelated to any of these pull requests, `TestCommonCPU.test_out__refs_transpose_copy_cpu_float32`

Running `python test/test_ops.py -k _copy` on `viable/strict` found failures for six `_refs` ops: `copysign`, `expand_copy`, `index_copy`, `t_copy`, `transpose_copy`, `view_copy`

## The solution

The original commit did actually cause breakage by slightly changing user-visible behavior (in a special case involving scalar tensors being copied between different devices).

This pull request fixes that breakage in a reasonable way, but I don't understand why this error didn't appear in CI until I made later changes in the same area.

## To reproduce

To reproduce the six cases in your own client:

```
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=5 python test/test_ops.py TestCommonCPU.test_out__refs_view_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=2 python test/test_ops.py TestCommonCPU.test_out__refs_t_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_index_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=7 python test/test_ops.py TestCommonCPU.test_out__refs_expand_copy_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/test_ops.py TestCommonCPU.test_out__refs_copysign_cpu_float32
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=4 python test/test_ops.py TestCommonCPU.test_out__refs_transpose_copy_cpu_float32
```

@amjames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136653
Approved by: https://github.com/zou3519
2024-09-30 20:32:55 +00:00
71aac59e93 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet
2024-09-30 20:24:52 +00:00
dfe1d45332 Enable tracing through auot_functionalized_v2 in compiled autograd (#136806)
auto_functionalize_v2 will be the same as auto_functionalize except that args will have some more constants, or symints,
and tensors are in one of the input list args.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136806
Approved by: https://github.com/zou3519
2024-09-30 19:16:13 +00:00
1ef5d4cdde Revert "Allow parallelize_module to get device_mesh from ambient context (#134247)"
This reverts commit 80e7478cc84919a48770ad85d6118294776fca73.

Reverted https://github.com/pytorch/pytorch/pull/134247 on behalf of https://github.com/malfet due to Broke lint, which one can clearly see in PR CI https://github.com/pytorch/pytorch/actions/runs/11112138513/job/30873604386  ([comment](https://github.com/pytorch/pytorch/pull/134247#issuecomment-2383952449))
2024-09-30 19:07:01 +00:00
4af03e54b7 [MPS][BE] Use None as alias for all types (#137004)
Test like `new_*` and `empty_*` fail the current implementation, see
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137004
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986, #137003
2024-09-30 19:06:13 +00:00
c610aa80dc Testing: Unblock new_* testing on MPS (#137003)
By changing `other_dtype` to `torch.half` rather than `double` in
`sample_inputs_new_fns` if MPS is available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137003
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984, #136985, #136986
2024-09-30 19:06:12 +00:00
80e7478cc8 Allow parallelize_module to get device_mesh from ambient context (#134247)
This PR is for supporting calling `parallelize_module` from within a model definition, making the model a parallel one.

Calling `parallelize_module` is an alternative to maintaining a set of `ColumnWiseLinear`, `RowWiseLinear`, etc, while still being able to directly author a parallel model.

(The motivation for authoring a parallel model is that there may be other distributed operations, which may not be easily captured by any module, see the forward function below. Alternatively speaking, the purpose is to exploit the expressiveness of DTensor -- we need to first create DTensors before calling ops on them. Having parallelized modules in model is one way of creating DTensors.)

For example:
```
class FeedForward(nn.Module):
    def __init__(self, config: TransformerArgs) -> None:
        super().__init__()
        w1 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        w2 = nn.Linear(config.hidden_dim, config.dim, bias=False)
        w3 = nn.Linear(config.dim, config.hidden_dim, bias=False)
        self.w1 = parallelize_module(w1, Colwise)
        self.w2 = parallelize_module(w2, Rowwise)
        self.w3 = parallelize_module(w3, Colwise)

    def forward(self, x: Tensor) -> Tensor:
        y: DTensor = self.w2(F.silu(self.w1(x)) * self.w3(x))
        # y is a DTensor with Partial placement; we can return it as is.
        return y
        # Or we can convert it to Replicate -- there is modeling flexibility here.
        return y.redistribute(Replicate())

with device_mesh:
    model = FeedForward(config)
    # Now model is a model parallelized onto device_mesh

y = model(x)

```

The `device_mesh` actually used for `parallelize_module` would be retrieved from the ambient context.

Calling `parallelize_module` from within model hierarchy also saves the use of *FQNs* as in the out-of-model annotation case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134247
Approved by: https://github.com/tianyu-l
2024-09-30 18:42:06 +00:00
40f80a70fa Fix lint (#137023)
By migrating some of the workflows to Python-3.9 as 3.8 has been deprecated by https://github.com/pytorch/pytorch/pull/132138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137023
Approved by: https://github.com/ZainRizvi, https://github.com/janeyx99, https://github.com/seemethere, https://github.com/kit1980, https://github.com/Skylion007
2024-09-30 18:29:02 +00:00
d33638588e [aoti][inplace] Support skipping model buffers (#136770)
Summary: Some AOTI tensor constants may be model buffers that never needs to be updated.

Differential Revision: D62777502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136770
Approved by: https://github.com/muchulee8
2024-09-30 18:28:42 +00:00
3ff2d93d9f Turn on type-checking in torch.fx.experimental.symbolic_shapes (#136972)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136972
Approved by: https://github.com/Skylion007
ghstack dependencies: #136917, #136934, #136935
2024-09-30 18:04:36 +00:00
475a8a4e0c Update ci-sev.md to make merge blocking not the default 2024-09-30 10:53:31 -07:00
76a57568de Update windows maintainers (#136901)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136901
Approved by: https://github.com/albanD
2024-09-30 16:12:49 +00:00
ae3d5ed589 [MPS] Enable nan_to_num for bfloat16 (#136986)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136986
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984, #136985
2024-09-30 16:09:44 +00:00
d8d3aeae59 [MPS] Enable Renorm for bfloat16 (#136985)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136985
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983, #136984
2024-09-30 16:09:44 +00:00
538fcd7579 [MPS] Enable torch.linalg.cross for bfloat16 (#136984)
By adding explicit instantiation. Tested in https://github.com/pytorch/pytorch/pull/136987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136984
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982, #136983
2024-09-30 16:09:40 +00:00
c13c7e11c5 Revert "[Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514)"
This reverts commit 6931c1644afdba53e63ce5671455e4e1b7265dd9.

Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its test_cpu_repro test is failing in trunk 6931c1644a ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2383563919))
2024-09-30 15:47:04 +00:00
33d3d6e42a [MPS] Enable bucketization for bfloat16 (#136983)
By simply adding explicit instantiation
Tested in https://github.com/pytorch/pytorch/pull/136987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136983
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981, #136982
2024-09-30 14:45:57 +00:00
3ed2969889 [MPS] Extend fmin/fmax/copysign and nextafter to blfoat (#136982)
Just adds instantiation of the kernels and sometimes explicit cast.
Tested in https://github.com/pytorch/pytorch/pull/136987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136982
Approved by: https://github.com/Skylion007
ghstack dependencies: #136981
2024-09-30 14:45:57 +00:00
797092b263 [MPS] Fix Gamma for bfloat16 dtypes (#136981)
Before this change, test failed with unable to compile errors, as `bfloat16` requires explicit cast.
Tested in https://github.com/pytorch/pytorch/pull/136987
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136981
Approved by: https://github.com/Skylion007
2024-09-30 14:45:52 +00:00
a15f3f51bc [AOTI] Update sam_fast from timeout to fail_to_run (#136996)
Summary: sam_fast changes from timeout to fail_to_run after https://github.com/pytorch/pytorch/pull/136591, which "regressed" in a good way. Update the expected result file and continue investigating.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136996
Approved by: https://github.com/ezyang
2024-09-30 14:05:49 +00:00
c010c6099b compile time benchmarks for AOTDispatcher (partitioner) (#136760)
compile time benchmark for the min cut partitioner. I'm hoping that this is a reasonable benchmark because:

(1) it consists of a single input + many weights that are used sequentially
(2) contains a mix of recompute vs non-recomputed ops (matmul + sin)
(3) it is relatively simple

from running locally:
```
collecting compile time instruction count for aotdispatcher_partitioner_cpu
compile time instruction count for iteration 0 is 21764219181
compile time instruction count for iteration 1 is 12475020009
compile time instruction count for iteration 2 is 12463710140
compile time instruction count for iteration 3 is 12455676489
compile time instruction count for iteration 4 is 12451344330
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136760
Approved by: https://github.com/ezyang
ghstack dependencies: #136670, #136759
2024-09-30 13:25:02 +00:00
b17cd264d3 compile time benchmarks for AOTDispatcher (inference/training/subclasses) (#136759)
this adds a few compile time benchmarks for some disjoint paths in AOTDispatcher:

(1) inference vs training code paths
(2) "subclasses" vs "no subclasses" codepaths

Also see https://github.com/pytorch/pytorch/pull/136760 for a partitioner benchmark (I'm not sure why ghstack didn't display the stack nicely)

I ran locally, and got these numbers on the 4 paths:
```
collecting compile time instruction count for aotdispatcher_inference_nosubclass_cpu
compile time instruction count for iteration 0 is 11692348671
compile time instruction count for iteration 1 is 3026287204
compile time instruction count for iteration 2 is 3011467318
compile time instruction count for iteration 3 is 3004485935
compile time instruction count for iteration 4 is 3003087410
collecting compile time instruction count for aotdispatcher_training_nosubclass_cpu
compile time instruction count for iteration 0 is 6068003223
compile time instruction count for iteration 1 is 5585418102
compile time instruction count for iteration 2 is 5581856618
compile time instruction count for iteration 3 is 5581651794
compile time instruction count for iteration 4 is 5578742619
collecting compile time instruction count for aotdispatcher_inference_subclass_cpu
compile time instruction count for iteration 0 is 8634984264
compile time instruction count for iteration 1 is 8633467573
compile time instruction count for iteration 2 is 8632182092
compile time instruction count for iteration 3 is 8632056925
compile time instruction count for iteration 4 is 8632543871
collecting compile time instruction count for aotdispatcher_training_subclass_cpu
compile time instruction count for iteration 0 is 14737239311
compile time instruction count for iteration 1 is 14734346427
compile time instruction count for iteration 2 is 14736493730
compile time instruction count for iteration 3 is 14734121272
compile time instruction count for iteration 4 is 14733852882
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136759
Approved by: https://github.com/laithsakka
ghstack dependencies: #136670
2024-09-30 13:25:02 +00:00
dfdda2f6a6 inductor: use previous guards to know if a size is 1 for broadcasting (#136670)
Fixes https://github.com/pytorch/pytorch/issues/136640

Today, inductor has some logic to figure out when it needs to do broadcasting during lowering, which just checks if any of the input shapes have sizes equal to 1.

In particular: we should already have this information by the time we get to inductor, because our FakeTensor compute will have branched/guarded on whether any ops performed broadcasting, appropriately.

In particular, if we have a tensor with a size value of `(64//((2048//(s3*((s2//s3)))))))`, and it happens to be equal to one (and it is used in an op that requires this dim to be broadcasted), FakeTensorProp will have generated a guard:
```
Eq((64//((2048//(s3*((s2//s3))))))), 1)
```

I chose the simplest possible way to beef up inductor's checks to know when a given size is equal to 1: loop over the existing shape env guards, and if our current size is a sympy expression on the LHS of one of our `Eq(LHS, 1)` guards, then return True.

I'm hoping for feedback on whether or not this approach is reasonable. One better option I could imagine is that our symbolic reasoning should have automatically simplified the size of our tensor down to a constant as part of evaluating that guard. I was originally going to try to do this directly in the shape env, but I ran into a few issues:

(1) I wanted to call some version of `set_replacement(expr, 1)`. But `set_replacement()` only accepts plain symbols on the LHS, not expressions

(2) in theory I could get this to work if I could rework the above expression to move everything that is not a free variable to the RHS, e.g. `Eq(s2, 32)`. It looks like our existing  `try_solve()` logic is... [not quite able](https://github.com/pytorch/pytorch/blob/main/torch/utils/_sympy/solve.py#L27) to do this generally though.

Checking the guards feels pretty simple-and-easy. Are we worried that it is too slow to iterate over all the guards? I could also cache the lookup so we only need to iterate over guards that are of the form `Eq(LHS, 1)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136670
Approved by: https://github.com/ezyang
2024-09-30 13:24:57 +00:00
cyy
05b15dba7e [1/N] Fix clang-tidy warnings in torch/csrc/api/ (#134545)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134545
Approved by: https://github.com/ezyang
2024-09-30 09:06:30 +00:00
d6d9183456 [Inductor] Switch cpp_wrapper tests to ABI-compatible (#136904)
Summary: Switch test_cpu_cpp_wrapper and test_cuda_cpp_wrapper to test the ABI-compatible mode only. Fixed a missing Py_NewRef issue for python 3.9.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136904
Approved by: https://github.com/Yoggie9477, https://github.com/chenyang78
2024-09-30 05:44:52 +00:00
ad8fae2aa9 [AOTI] Support test_open_device_registration in ABI-compatible (#136906)
Summary: Add a device type C shim interface to support test_open_device_registration in the ABI-compatible mode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136906
Approved by: https://github.com/chenyang78
2024-09-30 05:08:16 +00:00
8dddd45679 [BE][Ez]: Update cudnn_frontend submodule to v1.7.0 (#136920)
Updates cudnn frontend submodule to v1.7.0 which has some bugfixes and a couple new features.

https://github.com/NVIDIA/cudnn-frontend/releases/tag/v1.7.0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136920
Approved by: https://github.com/ezyang
2024-09-30 02:50:16 +00:00
80393c90b3 docs: clarify alias usage for x parameter in vector_norm function (#136921)
- Added a note in the documentation specifying that the `input` parameter can be used as an alias for `x`.

Fixes #136560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136921
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-09-30 02:50:06 +00:00
377e4bc877 Format torch.fx.experimental.validator (#136935)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136935
Approved by: https://github.com/Skylion007
ghstack dependencies: #136917, #136934
2024-09-30 02:20:40 +00:00
cf1a7eab25 Format torch.fx.experimental.symbolic_shapes with PYFMT (#136934)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136934
Approved by: https://github.com/Skylion007
ghstack dependencies: #136917
2024-09-30 02:20:40 +00:00
0a26851601 [Inductor] Handle device property warp_size is None but used on XPU. (#136834)
Fix #136820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136834
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-09-30 02:08:45 +00:00
6931c1644a [Inductor] Pick ISA for inductor based on ATEN_CPU_CAPABILITY (#123514)
It is part of https://github.com/pytorch/pytorch/issues/123224. Pick ISA based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA level for Inductor like eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-30 00:53:18 +00:00
9dbc6bacff Propagate detailed location information of shape guards to guards/recompiles output (#136917)
To see the payoff, look at test/dynamo/test_logging.py

The general idea is to refactor produce_guards into produce_guards_verbose which also returns verbose code parts, which have our annotations.

The rest of the logic is plumbing around SLocs to the places they need to be so we can print them. Guards are easy; value ranges and duck sizing take more care.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136917
Approved by: https://github.com/anijain2305
2024-09-30 00:43:12 +00:00
e205193e1c Enable failing diffs on regression (#136551)
1. example of failing diff
https://github.com/pytorch/pytorch/pull/136740

2. test this by running
python check_results.py test_check_result/expected_test.csv   test_check_result/result_test.csv

results
```
WIN: benchmark ('a', ' instruction count') failed, actual result 90 is 18.18% lower than expected 110 ±1.00% please update the expected results.
REGRESSION: benchmark ('b', ' memory') failed, actual result 200 is 100.00% higher than expected 100 ±10.00% if this is an expected regression, please update the expected results.
MISSING REGRESSION TEST: benchmark ('d', ' missing-test') does not have a regression test enabled for it
```
MISSING REGRESSION TEST does not fail but its logged.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136551
Approved by: https://github.com/ezyang
ghstack dependencies: #136383
2024-09-29 22:31:26 +00:00
d33a5e2a57 [ROCm] fastSpecializedAtomicAdd for MI300 (#135770)
MI300 adds HW support for packed bfloat16 and fp16. Enable via existing fastSpecializedAtomicAdd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135770
Approved by: https://github.com/xw285cornell, https://github.com/jianyuh
2024-09-29 21:52:09 +00:00
c9653bf2ca [Elasitc][fix] Use the right env variable TORCH_ELASTIC_WORKER_IDENTICAL for unit test (#136916)
as title, this is an easy fix for unit test.

Differential Revision: [D63577774](https://our.internmc.facebook.com/intern/diff/D63577774/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136916
Approved by: https://github.com/wz337
ghstack dependencies: #136865
2024-09-29 03:55:10 +00:00
cyy
156ca01e51 Enable clang-tidy on torch/csrc/lazy (#136851)
Enable clang-tidy on  torch/csrc/lazy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136851
Approved by: https://github.com/Skylion007
2024-09-28 21:16:40 +00:00
d3c2123ea6 [BE][CUDA][Bugfix]: Enable extended MMA shapes in CUTLASS. (#133686)
* This fixes a major CMake/Bazel configuration bug where we were leaving CUTLASS performance on the table, especially with FlashAttention. This now enables using MMA instructions on SM90+, which should close the gap between SDPA and the external FA2. Note these operations only affect H100 and newer GPUs. Thankfully, this seems to have been updated recently into being a noop on the CUTLASS side. Still better set the CMake variable properly.
*  Also enables additional new shape kernels added in the recent CUTLASS 3.5.1+ update. This was the original motivatin of the PR before I realized the basic MMA kernels were accidentally disabled since we didn't go through the submodule's CMake/Bazels.
* Adds a bit to compile time and code size, but well worth it considering it speeds up our internal flash attention significantly on H100s at the cost of some minor additional compile time.
* These kernels and settings will be needed for Flash Attention 3 whenever we add that too.

Fixes #133695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133686
Approved by: https://github.com/ezyang
2024-09-28 21:11:15 +00:00
1d6e0412f5 Don't uselessly recompute axiom dict every static eval call (#135429)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135429
Approved by: https://github.com/isuruf
2024-09-28 20:59:59 +00:00
6ecb73bafd Limit the option value of TORCH_SHOW_DISPATCH_TRACE (#136510)
It`s more convenient for user to enable or disable dispatch trace by
setting TORCH_SHOW_DISPATCH_TRACE to 1 or 0, especially debug in IDE.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136510
Approved by: https://github.com/shink, https://github.com/ezyang
2024-09-28 20:59:05 +00:00
28224329ad [Flex Attention] fix block size order (#136657)
`create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`.
This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-09-28 19:56:53 +00:00
cf53ab95dc [halide-backend] Fix ops.fma codegen (#136810)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136810
Approved by: https://github.com/eellison
ghstack dependencies: #136808, #136809
2024-09-28 19:26:04 +00:00
8da9c4178c [inductor] Benchmark Halide in operatorbench.py (#136809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136809
Approved by: https://github.com/eellison
ghstack dependencies: #136808
2024-09-28 19:26:04 +00:00
a54b69279b Bump triton pin to latest 3.1.x release branch (#136874)
Moves pin to latest in release/3.1.x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136874
Approved by: https://github.com/bertmaher, https://github.com/drisspg, https://github.com/kit1980, https://github.com/malfet
2024-09-28 13:47:07 +00:00
b35f70da05 [ez] fixup the export of D62879819 (#136900)
a line from D62879819 (#136190) went missing somehow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136900
Approved by: https://github.com/atalman
2024-09-28 13:46:17 +00:00
c4ae45104f [PyTorch Pinned Allocaor] Move background thread init from constructor to allocate function (#136879)
Differential Revision: D63553138

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136879
Approved by: https://github.com/zyan0
2024-09-28 07:24:44 +00:00
375921b755 [inductor] Improve operatorbench.py (#136808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136808
Approved by: https://github.com/eellison
2024-09-28 06:22:02 +00:00
96104db132 [easy] fix typo in debug logs for fx graph cache (#136889)
Summary: Accidentally messed up the debug logging here, fixing typo (scuba + tlparse logging is unaffected)

Test Plan: existing tests

Differential Revision: D63555766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136889
Approved by: https://github.com/oulgen
2024-09-28 03:56:09 +00:00
9e4f24f8e5 Fix PT2 Source Code Annotations (#136460)
Summary: In D60803317, we added CompileContext (trace_id) information to Kineto traces using caching when a CompileContext exits. As pointed out by some users, this gives innaccurate IDs because we are not getting the context that we is being looked up within the eval_frame. For this reason, we decided to revert that change, and go with an approach that involves getting the trace_id associated with a given CacheEntry. To do this, we add a trace_id to the GuardedCode so that it can be passed onto a CacheEntry. Then, we change the lookup function to return said trace_id alongside the code so that we can pass both into our eval function. Once we get to a Torch-Compiled Region, we can just append the context information to the name of the annotation thus bypassing any need for kwargs.

Test Plan: Added more comprehensive unit test. Saw that all the trace_ids appeared within the graph.

Differential Revision: D63138786

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136460
Approved by: https://github.com/ezyang
2024-09-28 03:54:43 +00:00
8df97d78c2 [QAT] Make Fused modules torchscriptable (#136285)
Summary:
Same as title.

Inspired by: https://pytorch.org/tutorials/recipes/script_optimized.html#fix-common-errors-when-using-the-script-method

Test Plan: CI

Differential Revision: D62980019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136285
Approved by: https://github.com/jerryzh168
2024-09-28 03:46:19 +00:00
93dcb92bae [DeviceMesh][EZ] Add group description to new group (#136558)
Add group description to new_group in device_mesh to help with debuggability.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136558
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
2024-09-28 03:09:41 +00:00
99f90c379e Simplify find_localzeros (#133325)
Instead of doing an N^2 connected thing, only do simplifications for binary max/min, and for very simple situations.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133325
Approved by: https://github.com/albanD
2024-09-28 02:38:31 +00:00
bfa16a161d Add int1 to int7 dtypes (#136301)
Summary:
Similar to https://github.com/pytorch/pytorch/pull/117208, we want to add int1 to int7 for edge use cases
for weight quantization (https://www.internalfb.com/diff/D62464487)

Test Plan:
python test/test_quantization.py -k test_uint4_int4_dtype

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136301
Approved by: https://github.com/ezyang
2024-09-28 02:08:33 +00:00
e4571e7025 Add abi flags to cpp_extension cache folder (#136890)
This is to avoid cache confusion between normal vs pydebug vs nogil builds in cpp extensions which can lead to catastrophic ABI issues.
This is rare today for people to run both normal and pydebug on the same machine, but we expect quite a few people will run normal and nogil on the same machine going forward.

This is tested locally by running each version alternatively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136890
Approved by: https://github.com/colesbury
2024-09-28 00:49:56 +00:00
f42e88fea5 [reland][Elastic] Skip store barrier and store get in host assign (#136865)
As title this is to reland https://github.com/pytorch/pytorch/pull/136579 as it broke some OSS CI

Differential Revision: [D63542918](https://our.internmc.facebook.com/intern/diff/D63542918/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136865
Approved by: https://github.com/atalman
2024-09-27 23:40:42 +00:00
ef3142d2a0 [user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686)
In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time.

However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871).

This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686
Approved by: https://github.com/zou3519
2024-09-27 23:02:46 +00:00
9d67c31758 Cast device index to int before logging (#135405)
int8_t = DeviceIndex is interpreted by cout as a char, which then shows up as a control character in logs (eg. ^A) etc.

Explicitly casting to int to have the numbers printed out correctly.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135405
Approved by: https://github.com/wconstab
2024-09-27 23:01:12 +00:00
fe158cfb47 [aoti] Add warning to ask users to switch to new API (#135893)
Instead of the following:
```
so_path = torch._export.aot_compile(...)
torch._export.aot_load(so_path)
```

The recommended path is to:
```
ep = torch.export.export(...)
pt2_path = torch._inductor.aoti_compile_and_package(ep, ...)
torch._inductor.package.load_package(pt2_path)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135893
Approved by: https://github.com/desertfire
2024-09-27 22:38:11 +00:00
adbcaee950 Init threadpool with user defined num_threads before default (#136793)
Fixes #134714 (or attempts to, idk how to test yet)

For posterity, how one can test:
1. make sure you have USE_PTHREADPOOL=1 or pull a packaged binary
2. run gdb --args python, with `r` to enter, `Ctrl-C` to pause, and `c` to get back into Python
3. import torch
4. torch.set_num_threads(1), make sure this does not trigger any additional threads getting created.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136793
Approved by: https://github.com/albanD
2024-09-27 22:22:37 +00:00
bc21689136 [sparse][semi-structured] Add float8 dtype support to 24 sparsity (#136397)
Summary:

This PR adds `torch.float8e4m3fn` support to cuSPARSELt and `to_sparse_semi_structured`.

This will let users to run fp8 + 2:4 sparse matmuls on Hopper GPUs with
cusparselt >= 0.6.2, via to `scaled_mm` API.

```
A = rand_sparse_semi_structured_mask(256, 128, dtype=torch.float16)
B = torch.rand(dense_input_shape, device=device).to(torch.float16).t()

A_fp8, A_scale = to_float8(A)
B_fp8, B_scale = to_float8(B)

dense_result = torch._scaled_mm(
    A_fp8, B_fp8,
    scale_a=A_scale, scale_b=B_scale,
    out_dtype=out_dtype
)
A_fp8_sparse = to_sparse_semi_structured(A_fp8)
sparse_result = torch._scaled_mm(
    A_fp8_sparse, B_fp8,
    scale_a=A_scale, scale_b=B_scale,
    out_dtype=out_dtype
)
```

Note that to keep this consistent with normal torch behavior, calling
`torch.mm(A_fp8_sparse, B_fp8)` will raise a NotImplementedError.

I also turned on cuSPARSELt by default and added CUSPARSELT_MAX_ID to the
backend to make the tests a bit cleaner

Test Plan:
```
python test/test_sparse_semi_structured -k scaled_mm
python test/test_sparse_semi_structured -k fp8
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136397
Approved by: https://github.com/drisspg
2024-09-27 21:37:34 +00:00
a28b40fa74 Improve is_fbcode functionality (#136871)
Summary: Previously is_fbcode just checked whether the checkout was git or not. This is extremely error prone. Lets make it fool-proof.

Test Plan: unit tests

Reviewed By: masnesral

Differential Revision: D63545169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136871
Approved by: https://github.com/masnesral
2024-09-27 21:19:01 +00:00
283bda01aa [MPS] Error checking/bf16 support for torch.normal (#136863)
Before that attempt to run something like
```
% python -c "import torch;dev,dt='mps',torch.int; print(torch.normal(mean=torch.arange(1., 11., device=dev, dtype=dt), std=torch.arange(10, 0, -1, device=dev, dtype=dt)))"
```
Resulted in hard error
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %5 = "mps.multiply"(%2, %arg1) : (tensor<10xf32>, tensor<10xsi32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
After the change, it raises a nice type error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136863
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821, #136822
2024-09-27 21:11:59 +00:00
f7ab0e9989 Revert "[Flex Attention] fix block size order (#136657)"
This reverts commit b42f1e3641314c8dc369255b850450acddf3477c.

Reverted https://github.com/pytorch/pytorch/pull/136657 on behalf of https://github.com/ZainRizvi due to Sorry, this seems to break ROCm builds. inductor/test_flex_attention.py::TestFlexAttention::test_builtin_score_mods_seqlen_lt_custom_sparse_block_size_float16_score_mod1 [GH job link](https://github.com/pytorch/pytorch/actions/runs/11069782242/job/30759299713) [HUD commit link](b42f1e3641) ([comment](https://github.com/pytorch/pytorch/pull/136657#issuecomment-2380031525))
2024-09-27 20:47:54 +00:00
6e70ec9aa5 [SymmetricMemory] expose the multicast_ptr (#136840)
This allows writing triton kernels using the `multimem` ptx instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136840
Approved by: https://github.com/Chillee
2024-09-27 20:41:33 +00:00
f21b471978 Revert "Fix numerical instability for norm (#129352)"
This reverts commit 66340e67515cd3592bda6bdd9bfe2ffa22fe7413.

Reverted https://github.com/pytorch/pytorch/pull/129352 on behalf of https://github.com/atalman due to Breaks Internal CI ([comment](https://github.com/pytorch/pytorch/pull/129352#issuecomment-2379989485))
2024-09-27 20:18:47 +00:00
d55eef5c59 [SymmetricMemory] improve multicast initialization/fallback logic (#136577)
Fixes https://github.com/pytorch/pytorch/issues/136494

Currently, CUDASymmetricMemory::rendezvous() initializes a multicast address if multicast support is present. However, if we believe multicast support is present but cuMulticastCreate still fails for some reason, we do not fallback gracefully.

- In addition to CUDART and driver version check, query CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED to determine multicast support for a rank/device.
- Before initializing multicast for a block, ensure all ranks/devices have multicast support.
- This is unlikely, but if cuMulticastCreate still fails on rank 0, print the corresponding driver error message as a warning, and gracefully skip multicast initialization for the block.
- Introduced an environment variable (TORCH_SYMM_MEM_DISABLE_MULTICAST) to allow users to explicitly disable multicast support as a workaround.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136577
Approved by: https://github.com/Chillee, https://github.com/eqy
2024-09-27 20:04:21 +00:00
e512eac410 Companion PR to https://github.com/pytorch/pytorch/pull/134022 (#136818)
Note:[ cusparselt 0.6.0](https://docs.nvidia.com/cuda/cusparselt/release_notes.html#cusparselt-v0-6-0)+ supports SM90 (Hopper). Thanks @xwang233 for catching this bug while testing upstream binaries!

Fixes the issues like:

  ```
  A_compressed = torch._cslt_compress(A)
RuntimeError: CUDA error: architecture mismatch when calling `cusparseLtInit(&handle)`
```

@kit1980 Could we get this cherry-picked to 2.5.0 please?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136818
Approved by: https://github.com/eqy, https://github.com/jcaip, https://github.com/malfet
2024-09-27 19:57:15 +00:00
e5a57932f0 [Pytorch][AO] Update choose_qparams_per_token op to output correct shape for scales and zp (#136807)
- also makes scales and zp dtype reconcile with meta impl as well as other
quantized ops representation of scales and zero point
- make sure qunatize_per_token's output_dtype is respected

There are a few places where we need to reconcile on scale and zero point dtype
but that will come later. This fixes are mainly being done to enable quantized
kv cache though ET stack

Differential Revision: [D62301840](https://our.internmc.facebook.com/intern/diff/D62301840/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136807
Approved by: https://github.com/jerryzh168
2024-09-27 18:46:17 +00:00
6075f566cc [export] simplify automatic dynamic shapes processing (#136591)
Removing `_transform_shapes_for_default_dynamic` and `assume_static_by_default=False` as added in https://github.com/pytorch/pytorch/pull/133620.

This reverts back to `assume_static_by_default=True` with the use of dynamo decorators (e.g. `maybe_mark_dynamic, mark_static`, instead) for handling Dim.AUTO & Dim.STATIC instead. This is easier to maintain, as it doesn't requiring reasoning about "inverting" the dynamic_shapes specs, and also opens up usage of other decorators (`mark_dynamic, mark_unbacked`).

On the user side this change has no effect, but internally this means dynamic behavior is determined only by the `dynamic_shapes` specs (ignoring user-side input decorators following https://github.com/pytorch/pytorch/pull/135536), but transferring this information for _DimHints via decorators, for Dynamo/non-strict to create symbolic_contexts accordingly, e.g. 7c6d543a5b/torch/_dynamo/variables/builder.py (L2646-L2666)

One caveat is we don't raise errors for dynamic decorators on the user side, since we don't know if they're from user markings, or from re-exporting with inputs we've previously marked.

Differential Revision: D63358628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136591
Approved by: https://github.com/avikchaudhuri
2024-09-27 18:28:51 +00:00
a8b5adcdd5 add types to _dynamo/code_context.py (#136665)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136665
Approved by: https://github.com/williamwen42
2024-09-27 18:27:42 +00:00
287dc36395 Revert "[user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686)"
This reverts commit 9f5b97a0065dfc4a7978a0fdf3fac2df8aef9519.

Reverted https://github.com/pytorch/pytorch/pull/136686 on behalf of https://github.com/davidberard98 due to breaks lint on main. Please rebase to see and fix the error ([comment](https://github.com/pytorch/pytorch/pull/136686#issuecomment-2379830921))
2024-09-27 18:25:49 +00:00
2208ff64ba Fix RMSNorm doc per #136597 (#136727)
Fixes #136597 (remove incorrect sqrt around `RMS(x)`)

<img width="857" alt="Screenshot 2024-09-26 at 11 46 32 AM" src="https://github.com/user-attachments/assets/21ea26ad-bd9f-4b9b-8b60-f52a1dc16da6">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136727
Approved by: https://github.com/albanD
2024-09-27 18:21:48 +00:00
2157e396a3 [dynamo] attempt run only mode when dynamo cache limit is hit (#136655)
Implement https://github.com/pytorch/pytorch/issues/135458.

Try run-only mode when dynamo cache limit is hit. If no valid cache entries are found, then skip code recursively.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136655
Approved by: https://github.com/jansel
2024-09-27 17:15:05 +00:00
36428f91e9 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit 31c0467594c7c41c8e8ff1828bf01fa31fc4454f.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517))
2024-09-27 16:54:27 +00:00
17f396b0b4 Delete project.default_flavors_mode buckconfig (#136772)
Summary: Buck1 only buckconfig

Test Plan: CI

Reviewed By: JakobDegen

Differential Revision: D63430482

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136772
Approved by: https://github.com/malfet
2024-09-27 16:24:50 +00:00
cyy
cbc182d2e0 Remove problematic constructor (#136708)
Since it calls a pure virtual function and it is not used elsewhere.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136708
Approved by: https://github.com/ezyang
2024-09-27 16:16:58 +00:00
dc8c0aaf4d [AOTAutogradCache] Log time taken_ns (#136529)
Summary:
This diff logs the time_taken_ns for the forward and backward graphs in AOTAutogradCache, saving it into the cache entry.

This information is helpful later when I remotify the cache, and also is just useful to have in tlparse and chromium events.

Test Plan: Run benchmark, see that the times are in the chromium events.

Reviewed By: aorenste

Differential Revision: D62590077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136529
Approved by: https://github.com/oulgen
2024-09-27 16:14:09 +00:00
9f5b97a006 [user triton] Make tl.constexpr specialization work for triton_op & capture_triton (#136686)
In #136512, we fixed handling for tl.constexpr and dynamic shapes: if a symint is passed to tl.constexpr, you should specialize on it, because tl.constexpr implies needing to know the concrete value at compile time.

However, when using triton_op, capture_triton, or non-strict export, the regression remains (and #136512 might technically regress some specific export scenarios) - see [Richard's comment](https://github.com/pytorch/pytorch/pull/136512/files#r1775999871).

This fixes these scenarios: implement the handling differently depending on whether we're expecting a SymNodeVariable or a SymInt(/SymBool/SymFloat)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136686
Approved by: https://github.com/zou3519
2024-09-27 16:11:02 +00:00
ad51995468 Add a nightly hotpatch utils for python only PR (#136535)
I think this could help many teams, especially compile/export teams (/cc @ezyang), to let end user/bug reporters to quickly test WIP PR when reporting a related bug.

This could quickly run in an official nightly Docker container or in  a nightly venv/coda env.

Let me know what do you think.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136535
Approved by: https://github.com/ezyang
2024-09-27 15:58:48 +00:00
9d72f7481b [MPS] Fix AvgPool2d for float16 (#136822)
This was a stupid cast error that caused MPSGraph to crash with the following exception
```
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: error: 'mps.multiply' op requires the same element type for all operands and results
(mpsFileLoc): /AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm:233:0: note: see current operation: %3 = "mps.multiply"(%2, %arg1) : (tensor<1x3x9x9xf16>, tensor<1xf32>) -> tensor<*xf32>
/AppleInternal/Library/BuildRoots/e0873e53-5185-11ef-9a51-9ab6d782fe32/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphExecutable.mm:953: failed assertion `original module failed verification'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136822
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755, #136821
2024-09-27 15:32:18 +00:00
2b6f4e9e24 [BE][MPS] Delete MacOS12 low-precision ops (#136821)
`norm` and `masked.normalize` still have to stay in the list
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136821
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754, #136755
2024-09-27 15:32:18 +00:00
45a8b5682e [inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136858)
This is a retry of https://github.com/pytorch/pytorch/pull/136594, which is having trouble landing.

Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])`

https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

Differential Revision: [D63540693](https://our.internmc.facebook.com/intern/diff/D63540693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136858
Approved by: https://github.com/atalman
2024-09-27 15:14:12 +00:00
34d788ffb0 [aotd] Do not force contiguous() for channels_last (#135225)
Original Issue: https://github.com/pytorch/pytorch/issues/134644

We assume trace_tangents to have the same memory_format as inputs, outputs, intermediate during first tracing.

=>
Tracing time:
- Store trace_tangents_memory_formats in metadata
- Coerce tangents to deduced memory_format

Runtime:
- Coerce tangents to tracing memory format from metadata

Subclasses logic:
 - Previously coercing tangents logic did not handle nested subclasses case, fixing this.

For Subclasses we deduce memory format for subclass_tensor first, then for each element of subclass:
[subclass_tensor_memory_format, subclass_tensor_elem0_memory_format, ... ]

If subclass element (__tensor_flatten__[0] tensors) is also subclass => on its place we will have a nested list of the same structure.

The recursive traversal of subclass tree is expensive. So we do memory format deduction and coercing at the same time, to keep only one traverse for this. With this approach there  is no regression in comparison with previous logic which also does one traversal. (`coerce_tangent_and_suggest_memory_format` method).

Other small change:
Remove duplicated not-related comment.

Testing

```
python test/functorch/test_aotdispatch.py -k test_channels_last_grads_no_force_contiguous
```

Benchmarking:
After change:
```
└─ $ PYTORCH_AOTD_DEBUG_PROFILE=1 python test/functorch/test_aotdispatch.py -k test_benchmark_grads_no_force_contiguous
Benchmark SUBCLASS avg_bwd_duration:4.059906005859375 ms
Benchmark NO_SUBCLASS avg_bwd_duration:3.1563830375671387 ms
```
Before change:
```
BEFORE_CHANGE SUBCLASS 4.1194
```

No siginificant changes in processing time.

(We do single traverse of subclass tree for collecting memory_formats and coercing during tracing.)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135225
Approved by: https://github.com/bdhirsh
2024-09-27 15:01:20 +00:00
de159f0c8d Revert "Deal with size oblivious before going into worker (#135137)"
This reverts commit 285fa03b5e1540a52b354664f609f8576c5b5431.

Reverted https://github.com/pytorch/pytorch/pull/135137 on behalf of https://github.com/ezyang due to this is the one that actually broke main ([comment](https://github.com/pytorch/pytorch/pull/135137#issuecomment-2379438566))
2024-09-27 14:41:27 +00:00
1be3d62237 [ONNX] Remove unused functions (#136609)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136609
Approved by: https://github.com/Skylion007
2024-09-27 14:34:05 +00:00
e5228a7771 Revert "Don't uselessly recompute axiom dict every static eval call (#135429)"
This reverts commit 507c69e20f645fdb0fbf43b05be0c5117971464e.

Reverted https://github.com/pytorch/pytorch/pull/135429 on behalf of https://github.com/malfet due to It(or it's parent) broke trunk CI, see 507c69e20f ([comment](https://github.com/pytorch/pytorch/pull/135429#issuecomment-2379422971))
2024-09-27 14:33:25 +00:00
a55aa71b04 Limit number of cores to 16 when benchmarking Inductor on ARM (#136424)
Sets OMP_NUM_THREADS to 16

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136424
Approved by: https://github.com/malfet
2024-09-27 14:22:49 +00:00
e9d2765ec8 Revert "Add deterministic path for CUDA cumsum (#136224)"
This reverts commit d1bb8e828f280d1c66fff193c043d5bc36154577.

Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Break internal CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2379214226))
2024-09-27 12:54:47 +00:00
c2637a7b26 [inductor] [cpp] fix gemm_output_name conflict (#136419)
Fixes the max-autotune failure of `soft_actor_critic` of Torchbench in FP32 single thread dynamic shape case:
```log
  File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_micro_gemm.py", line 136, in codegen_call
    C_ptr = f"&({kernel.index(C, [0, 0])})"
  File "/home/user/inductor/pytorch/torch/_inductor/codegen/cpp_template_kernel.py", line 135, in index
    else self.args.input(node.get_name())
  File "/home/user/inductor/pytorch/torch/_inductor/codegen/common.py", line 1251, in input
    assert name not in V.graph.removed_buffers, name
AssertionError: buf_GemmOut
```

The 1st and 2nd linear does not need to use local buffer while the 3rd linear needs to use local buffer.
The 3rd linear which uses local buffer will add its global buffer (named as `buf_GemmOut`) into `V.graph.removed_buffers`.

When scheduling the nodes, the 1st linear (won't use local buffer) will get its output buffer (also named as `buf_GemmOut`) from the input and found that it's in the `V.graph.removed_buffers` and raise AssertionError. The issue is that the output buffer of all these linears are all names with `buf_GemmOut`, which have a conflict.

Rename these buffers by adding the name of the `template_buffer` as the prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136419
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #136418, #136518
2024-09-27 12:23:17 +00:00
b42f1e3641 [Flex Attention] fix block size order (#136657)
`create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`.
This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-09-27 11:26:47 +00:00
9581508383 [aotd] Cleanup on subclasses in inductor freezing (#136549)
Cleanup:
1/ We do not need to unwrap_subclasses() in freezing wrapper, as it will be wrapped by AOTD wrappers which inclused SubclassesWrapper
2/ No need to use weakreferences for unwrapped list, dynamo optimizers need to clean unwrapped list along with original params_flat.
Verfified fbcode tests compiled_optimizers

Differential Revision: [D63393651](https://our.internmc.facebook.com/intern/diff/D63393651)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136549
Approved by: https://github.com/bdhirsh
2024-09-27 11:20:03 +00:00
cyy
bbff667e32 [Distributed] [13/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136713)
Follows #136528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136713
Approved by: https://github.com/kwen2501
2024-09-27 10:11:53 +00:00
48c18ff850 [dynamo] Added support for tensor's is_inference method (#136450)
Fixes #135439

This PR adds support for the `is_inference` method on torch tensors which successfully compiles the following example fn without graph breaks:
```python
def fn_simple(x):
    if x.is_inference():
        return x.sum()
    else:
        return x.min()
```

I've also tried to add guards on the tensor to guard against  `is_inference`. I wasn't 100% sure where these should go so please don't hesitate to correct me.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136450
Approved by: https://github.com/ezyang
2024-09-27 09:15:07 +00:00
e14b58ffbd Using device-agnostic autocast api (#136613)
- using torch.autocast(device_str="cuda") instead of torch.cuda.amp.autocast()
- using torch.autocast(device_str="cpu") instead of torch.cpu.amp.autocast()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136613
Approved by: https://github.com/shink, https://github.com/cyyever, https://github.com/kwen2501
2024-09-27 07:16:24 +00:00
ad6c70b656 [PP] Remove modifications to autograd nodes in ZB (#136678)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136678
Approved by: https://github.com/wconstab, https://github.com/kwen2501
ghstack dependencies: #136507, #136584
2024-09-27 07:07:58 +00:00
9529d018e9 Refactor offset logic and work for nD (#135861)
Optimize TODO task in code in distributed test files.

- TODO: make this test cleaner and work for nD
- TODO: add comments for create_plan/TestDedupTensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135861
Approved by: https://github.com/wz337
2024-09-27 06:13:06 +00:00
69bd13d12e [EZ][BE] Add torch.complex to MPS_DTYPES (#136755)
As minimal supported OS has been rasied to MacOS 13, some basic complex operations  should be supported, and the rest could be `xfailed`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136755
Approved by: https://github.com/Skylion007
ghstack dependencies: #136754
2024-09-27 05:01:40 +00:00
73f038c5b3 Log total miss inplaced bytes (#136684)
Summary: title.

Test Plan: add tests. run existing tests.

Differential Revision: D63411459

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136684
Approved by: https://github.com/zou3519
2024-09-27 04:57:57 +00:00
0200bea562 Delete grid reduction optimization that is causing specialization (#136783)
Summary:
https://fb.workplace.com/groups/1075192433118967/posts/1510513706253502

Creating a set is causing symexpr to specialize

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D63432357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136783
Approved by: https://github.com/ezyang, https://github.com/zou3519
2024-09-27 04:39:39 +00:00
a63d7cb54c add typing to _dynamo/current_scope_id.py (#136676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136676
Approved by: https://github.com/jansel, https://github.com/zou3519, https://github.com/Skylion007
2024-09-27 04:09:15 +00:00
5eb68d565f Revert "[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594)"
This reverts commit 2c5f5e303a8d6fd55b6651f4d965fafaa6a540a7.

Reverted https://github.com/pytorch/pytorch/pull/136594 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136594#issuecomment-2378358302))
2024-09-27 04:06:05 +00:00
507c69e20f Don't uselessly recompute axiom dict every static eval call (#135429)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135429
Approved by: https://github.com/isuruf
ghstack dependencies: #135137
2024-09-27 04:03:25 +00:00
285fa03b5e Deal with size oblivious before going into worker (#135137)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135137
Approved by: https://github.com/isuruf
2024-09-27 04:03:25 +00:00
86631eccda [Inductor] Remove stride-0 dimensions from more complex block pointers (#135557)
Related issue: #125077

### Feature
Inductor tries to remove dimensions with stride 0 from block pointers. Rather than loading with stride 0, it's more efficient to load a smaller block pointer, then use `tl.broadcast_to` to broadcast it up to the desired size. This already worked for simpler block pointers, but it was disabled for more complex block pointers which used `tl.reshape` to change the dimensionality after loading.

This PR generalizes the approach to work for all block pointers. The idea is to first reshape, adding singleton dimensions, then broadcast those singletons up to something larger, then reshape again to the final output shape. For readability, we emit this code only if it actually does something. Simpler loads will just have `tl.load`.

Here's an example of a complicated kernel that uses `reshape` -> `load` -> `reshape`. (The first reshape is actually the slice `[None,None,:]`).
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 64
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x2 = xindex
    x1 = (xindex // 8)
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp1 = tl.reshape(tl.broadcast_to(tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[((7 + XBLOCK) // 8)], order=[0], offsets=[(xoffset // 8)]), boundary_check=[0], eviction_policy='evict_last')[:, None, None], [((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))]), [XBLOCK])
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tmp2.to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```

Before this PR, we would have stride-0 dimensions:
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 64
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x2 = xindex
    x1 = (xindex // 8)
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp1 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr1, shape=[8, 1, 8], strides=[8, 0, 0], block_shape=[((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))], order=[2, 1, 0], offsets=[(xoffset // 8), 0, xoffset % 8]), boundary_check=[0], eviction_policy='evict_last'), [XBLOCK])
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```

Here's a simpler example where we use 2D tiling. In this case we don't actually need the broadcast. The broadcast is implied via a slice adding a new singleton dimension. This code is not changed by this PR, but it's important to know that we don't accidentally insert unnecessary broadcasts.
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 8
    xnumel = 8
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    x1 = xindex
    y0 = yindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1])
    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :]
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tmp2.to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
```
### Test Plan
Added a new expecttest to check the emitted code for broadcast addition. Looking at the test, we can see that stride 0 dimensions are removed. (This test generated the example kernels in the previous section.)

This change also removed a stride-0 dimension in an existing block pointer test. I updated the expected code accordingly.

Bonus: I noticed that the test parametrization for `config.prefer_nd_tiling` wasn't working as intended. It ended up always setting this option to `True`. Fixed it so we get the intended test coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135557
Approved by: https://github.com/shunting314, https://github.com/jansel

Co-authored-by: Yueming Hao <yhao@meta.com>
2024-09-27 04:01:40 +00:00
2c5f5e303a [inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594)
Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])
`

https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

Differential Revision: [D63465169](https://our.internmc.facebook.com/intern/diff/D63465169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136594
Approved by: https://github.com/mengluy0125, https://github.com/jansel
2024-09-27 04:01:09 +00:00
913 changed files with 24600 additions and 10152 deletions

View File

@ -21,6 +21,3 @@
cxx = /usr/bin/clang++
cxxpp = /usr/bin/clang++
ld = /usr/bin/clang++
[project]
default_flavors_mode=all

View File

@ -355,6 +355,12 @@ case "$image" in
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3-clang18-asan)
ANACONDA_PYTHON_VERSION=3.10
CLANG_VERSION=18
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3.9-gcc11)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
@ -381,6 +387,13 @@ case "$image" in
HALIDE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.12-triton-cpu)
CUDA_VERSION=12.4
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
TRITON_CPU=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
@ -510,6 +523,7 @@ docker build \
--build-arg "UCC_COMMIT=${UCC_COMMIT}" \
--build-arg "CONDA_CMAKE=${CONDA_CMAKE}" \
--build-arg "TRITON=${TRITON}" \
--build-arg "TRITON_CPU=${TRITON_CPU}" \
--build-arg "ONNX=${ONNX}" \
--build-arg "DOCS=${DOCS}" \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \

View File

@ -0,0 +1 @@
6a333f1b05671f6fada4ba7bbfae4a02a9d96f4f

View File

@ -1 +1 @@
5fe38ffd73c2ac6ed6323b554205186696631c6f
cf34004b8a67d290a962da166f5aa2fc66751326

View File

@ -13,11 +13,17 @@ if [ -n "$CLANG_VERSION" ]; then
elif [[ $UBUNTU_VERSION == 22.04 ]]; then
# work around ubuntu apt-get conflicts
sudo apt-get -y -f install
wget --no-check-certificate -O - https://apt.llvm.org/llvm-snapshot.gpg.key | sudo apt-key add -
if [[ $CLANG_VERSION == 18 ]]; then
apt-add-repository "deb http://apt.llvm.org/jammy/ llvm-toolchain-jammy-18 main"
fi
fi
sudo apt-get update
apt-get install -y --no-install-recommends clang-"$CLANG_VERSION"
apt-get install -y --no-install-recommends llvm-"$CLANG_VERSION"
apt-get install -y --no-install-recommends clang-"$CLANG_VERSION" llvm-"$CLANG_VERSION"
if [[ $CLANG_VERSION == 18 ]]; then
apt-get install -y --no-install-recommends libomp-18-dev
fi
# Install dev version of LLVM.
if [ -n "$LLVMDEV" ]; then

View File

@ -105,7 +105,7 @@ function install_121 {
}
function install_124 {
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run

View File

@ -5,19 +5,19 @@ set -ex
NCCL_VERSION=v2.21.5-1
function install_cusparselt_052 {
function install_cusparselt_062 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.6.2.3-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_124 {
echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.6.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run
@ -44,7 +44,7 @@ function install_124 {
cd ..
rm -rf nccl
install_cusparselt_052
install_cusparselt_062
ldconfig
}

View File

@ -32,7 +32,7 @@ pip_install coloredlogs packaging
pip_install onnxruntime==1.18.1
pip_install onnx==1.16.2
pip_install onnxscript==0.1.0.dev20240831 --no-deps
pip_install onnxscript==0.1.0.dev20241008 --no-deps
# required by onnxscript
pip_install ml_dtypes

View File

@ -15,8 +15,11 @@ conda_reinstall() {
if [ -n "${XPU_VERSION}" ]; then
TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"
TRITON_TEXT_FILE="triton-xpu"
elif [ -n "${TRITON_CPU}" ]; then
TRITON_REPO="https://github.com/triton-lang/triton-cpu"
TRITON_TEXT_FILE="triton-cpu"
else
TRITON_REPO="https://github.com/openai/triton"
TRITON_REPO="https://github.com/triton-lang/triton"
TRITON_TEXT_FILE="triton"
fi
@ -44,9 +47,10 @@ chown -R jenkins /var/lib/jenkins/triton
chgrp -R jenkins /var/lib/jenkins/triton
pushd /var/lib/jenkins/
as_jenkins git clone ${TRITON_REPO} triton
as_jenkins git clone --recursive ${TRITON_REPO} triton
cd triton
as_jenkins git checkout ${TRITON_PINNED_COMMIT}
as_jenkins git submodule update --init --recursive
cd python
# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

View File

@ -139,9 +139,9 @@ opt-einsum==3.3
#Pinned versions: 3.3
#test that import: test_linalg.py
optree==0.12.1
optree==0.13.0
#Description: A library for tree manipulation
#Pinned versions: 0.12.1
#Pinned versions: 0.13.0
#test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
#test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
#common_utils.py, test_eager_transforms.py, test_python_dispatch.py,

View File

@ -147,6 +147,13 @@ COPY ci_commit_pins/triton.txt triton.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt
ARG TRITON_CPU
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-cpu.txt triton-cpu.txt
RUN if [ -n "${TRITON_CPU}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-cpu.txt
ARG EXECUTORCH
# Build and install executorch
COPY ./common/install_executorch.sh install_executorch.sh

View File

@ -178,7 +178,7 @@ fi
# sccache will fail for CUDA builds if all cores are used for compiling
# gcc 7 with sccache seems to have intermittent OOM issue if all cores are used
if [ -z "$MAX_JOBS" ]; then
if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]] || [[ "$BUILD_ENVIRONMENT" == *gcc7* ]]; } && which sccache > /dev/null; then
if { [[ "$BUILD_ENVIRONMENT" == *cuda* ]]; } && which sccache > /dev/null; then
export MAX_JOBS=$(($(nproc) - 1))
fi
fi
@ -218,10 +218,6 @@ if [[ "${BUILD_ENVIRONMENT}" == *-pch* ]]; then
export USE_PRECOMPILED_HEADERS=1
fi
if [[ "${BUILD_ENVIRONMENT}" == *linux-focal-py3.7-gcc7-build* ]]; then
export USE_GLOO_WITH_OPENSSL=ON
fi
if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]; then
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi

View File

@ -191,9 +191,22 @@ function install_torchrec_and_fbgemm() {
pip_uninstall torchrec-nightly
pip_uninstall fbgemm-gpu-nightly
pip_install setuptools-git-versioning scikit-build pyre-extensions
# TODO (huydhn): I still have no clue on why sccache doesn't work with only fbgemm_gpu here, but it
# seems to be an sccache-related issue
if [[ "$IS_A100_RUNNER" == "1" ]]; then
unset CMAKE_CUDA_COMPILER_LAUNCHER
sudo mv /opt/cache/bin /opt/cache/bin-backup
fi
# See https://github.com/pytorch/pytorch/issues/106971
CUDA_PATH=/usr/local/cuda-12.1 pip_install --no-use-pep517 --user "git+https://github.com/pytorch/FBGEMM.git@${fbgemm_commit}#egg=fbgemm-gpu&subdirectory=fbgemm_gpu"
pip_install --no-use-pep517 --user "git+https://github.com/pytorch/torchrec.git@${torchrec_commit}"
if [[ "$IS_A100_RUNNER" == "1" ]]; then
export CMAKE_CUDA_COMPILER_LAUNCHER=/opt/cache/bin/sccache
sudo mv /opt/cache/bin-backup /opt/cache/bin
fi
}
function clone_pytorch_xla() {

View File

@ -376,7 +376,7 @@ test_inductor_cpp_wrapper_abi_compatible() {
echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"
PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper
python test/run_test.py --include inductor/test_cuda_cpp_wrapper inductor/test_cpu_repro
python test/run_test.py --include inductor/test_cuda_cpp_wrapper inductor/test_cpu_repro inductor/test_extension_backend
TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \
--training --inductor --disable-cudagraphs --only vit_base_patch16_224 \
@ -403,7 +403,7 @@ pr_time_benchmarks() {
PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"
echo "benchmark results on current PR: "
cat "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv"
PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks python benchmarks/dynamo/pr_time_benchmarks/check_results.py "benchmarks/dynamo/pr_time_benchmarks/expected_results.csv" "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "$TEST_REPORTS_DIR/new_expected_results.csv"
}
if [[ "${TEST_CONFIG}" == *pr_time_benchmarks* ]]; then
@ -606,6 +606,11 @@ test_inductor_halide() {
assert_git_not_dirty
}
test_inductor_triton_cpu() {
python test/run_test.py --include inductor/test_triton_cpu_backend.py --verbose
assert_git_not_dirty
}
test_dynamo_benchmark() {
# Usage: test_dynamo_benchmark huggingface 0
TEST_REPORTS_DIR=$(pwd)/test/test-reports
@ -660,15 +665,6 @@ test_inductor_torchbench_smoketest_perf() {
# The threshold value needs to be actively maintained to make this check useful
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_training_smoketest.csv" -t 1.4
TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/torchbench.py --device cuda --performance --bfloat16 --inference \
--export-aot-inductor --only nanogpt --output "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv"
# The threshold value needs to be actively maintained to make this check useful
# The perf number of nanogpt seems not very stable, e.g.
# https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,
# and thus we lower its threshold to reduce flakiness. If this continues to be a problem,
# we switch to use some other model.
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9
# Check memory compression ratio for a few models
for test in hf_Albert timm_vision_transformer; do
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --amp --training \
@ -712,6 +708,10 @@ test_inductor_set_cpu_affinity(){
export KMP_BLOCKTIME=1
fi
cores=$(test_inductor_get_core_number)
# Set number of cores to 16 on Aarch64 for performance runs.
if [[ "${TEST_CONFIG}" == *aarch64* && $cores -gt 16 ]]; then
cores=16
fi
export OMP_NUM_THREADS=$cores
end_core=$((cores-1))
export TASKSET="taskset -c 0-$end_core"
@ -1435,6 +1435,8 @@ elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then
test_inductor_halide
elif [[ "${TEST_CONFIG}" == *inductor-triton-cpu* ]]; then
test_inductor_triton_cpu
elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then
test_inductor_micro_benchmark
elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then
@ -1458,7 +1460,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
if [[ "${TEST_CONFIG}" == *inductor_torchbench_smoketest_perf* ]]; then
checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer
checkout_install_torchbench hf_Bert hf_Albert timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \

View File

@ -26,7 +26,7 @@ fi
export SCRIPT_HELPERS_DIR=$SCRIPT_PARENT_DIR/win-test-helpers
set +ex
grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h --exclude=eval_frame.c torch/
grep -E -R 'PyLong_(From|As)(Unsigned|)Long\(' --exclude=python_numbers.h --exclude=pythoncapi_compat.h --exclude=eval_frame.c torch/
PYLONG_API_CHECK=$?
if [[ $PYLONG_API_CHECK == 0 ]]; then
echo "Usage of PyLong_{From,As}{Unsigned}Long API may lead to overflow errors on Windows"

View File

@ -27,12 +27,11 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
source activate testenv >/dev/null
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
python_path="/opt/python/cp\$python_nodot-cp\${python_nodot}"
# Prior to Python 3.8 paths were suffixed with an 'm'
if [[ -d "\${python_path}/bin" ]]; then
export PATH="\${python_path}/bin:\$PATH"
elif [[ -d "\${python_path}m/bin" ]]; then
export PATH="\${python_path}m/bin:\$PATH"
if [[ "\$python_nodot" = *t ]]; then
python_digits="\$(echo $DESIRED_PYTHON | tr -cd [:digit:])"
python_path="/opt/python/cp\$python_digits-cp\${python_digits}t"
fi
export PATH="\${python_path}/bin:\$PATH"
fi
EXTRA_CONDA_FLAGS=""

View File

@ -44,7 +44,9 @@ ContinuationIndentWidth: 4
Cpp11BracedListStyle: true
DerivePointerAlignment: false
DisableFormat: false
ForEachMacros: [ FOR_EACH_RANGE, FOR_EACH, ]
ForEachMacros:
- FOR_EACH_RANGE
- FOR_EACH
IncludeCategories:
- Regex: '^<.*\.h(pp)?>'
Priority: 1
@ -58,6 +60,24 @@ IndentWrappedFunctionNames: false
KeepEmptyLinesAtTheStartOfBlocks: false
MacroBlockBegin: ''
MacroBlockEnd: ''
Macros:
- >-
PyObject_HEAD_INIT(type)={
/* this is not exactly match with PyObject_HEAD_INIT in Python source code
* but it is enough for clang-format */
{ 0xFFFFFFFF },
(type)
},
- >-
PyVarObject_HEAD_INIT(type, size)={
{
/* manually expand PyObject_HEAD_INIT(type) above
* because clang-format do not support recursive expansion */
{ 0xFFFFFFFF },
(type)
},
(size)
},
MaxEmptyLinesToKeep: 1
NamespaceIndentation: None
PenaltyBreakBeforeFirstCallParameter: 1
@ -79,7 +99,11 @@ SpacesInContainerLiterals: true
SpacesInCStyleCastParentheses: false
SpacesInParentheses: false
SpacesInSquareBrackets: false
Standard: Cpp11
Standard: c++17
StatementMacros:
- PyObject_HEAD
- PyObject_VAR_HEAD
- PyException_HEAD
TabWidth: 8
UseTab: Never
---

View File

@ -1,38 +0,0 @@
If you have a question or would like help and support, please ask at our
[forums](https://discuss.pytorch.org/).
If you are submitting a feature request, please preface the title with [feature request].
If you are submitting a bug report, please fill in the following details.
## Issue description
Provide a short description.
## Code example
Please try to provide a minimal example to repro the bug.
Error messages and stack traces are also helpful.
## System Info
Please copy and paste the output from our
[environment collection script](https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py)
(or fill out the checklist below manually).
You can get the script and run it with:
```
wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
```
- PyTorch or Caffe2:
- How you installed PyTorch (conda, pip, source):
- Build command you used (if compiling from source):
- OS:
- PyTorch version:
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- GCC version (if compiling from source):
- CMake version:
- Versions of any other relevant libraries:

View File

@ -5,7 +5,8 @@ about: Tracking incidents for PyTorch's CI infra.
> NOTE: Remember to label this issue with "`ci: sev`"
**MERGE BLOCKING** <!-- remove this line if you don't want this SEV to block merges -->
<!-- uncomment the below line if you don't want this SEV to block merges -->
<!-- **MERGE BLOCKING** -->
## Current Status
*Status could be: preemptive, ongoing, mitigated, closed. Also tell people if they need to take action to fix it (i.e. rebase)*.

View File

@ -18,8 +18,14 @@ inputs:
runs:
using: composite
steps:
- name: Check if in a container runner
shell: bash
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Clean workspace
shell: bash
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
env:
NO_SUDO: ${{ inputs.no-sudo }}
run: |

View File

@ -85,15 +85,25 @@ runs:
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Setup GPU_FLAG for docker run
id: setup-gpu-flag
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container
id: setup-sscache-port-flag
run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Lock NVIDIA A100 40GB Frequency
shell: bash
@ -101,7 +111,7 @@ runs:
sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 1215,1410
nvidia-smi
if: contains(matrix.runner, 'a100')
if: ${{ contains(matrix.runner, 'a100') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Start monitoring script
id: monitor-script
@ -172,6 +182,7 @@ runs:
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_REGION: us-east-1
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
@ -181,6 +192,9 @@ runs:
PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
IS_A100_RUNNER: ${{ contains(matrix.runner, 'a100') && '1' || '0' }}
shell: bash
run: |
set -x
@ -199,6 +213,7 @@ runs:
# shellcheck disable=SC2086,SC2090
container_name=$(docker run \
${GPU_FLAG:-} \
${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e PR_NUMBER \
-e GITHUB_ACTIONS \
@ -227,6 +242,7 @@ runs:
-e PR_LABELS \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \
-e SCCACHE_REGION \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
@ -234,7 +250,9 @@ runs:
-e PYTORCH_TEST_RERUN_DISABLED_TESTS \
-e SKIP_SCCACHE_INITIALIZATION=1 \
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e DASHBOARD_TAG \
-e IS_A100_RUNNER \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
@ -305,7 +323,7 @@ runs:
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'
# NB: We are currently having an intermittent GPU-related issue on G5 runners with
# A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does

View File

@ -28,14 +28,14 @@ runs:
echo "instance-type: $(get_ec2_metadata instance-type)"
echo "system info $(uname -a)"
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> $GITHUB_OUTPUT
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Start docker if docker deamon is not running
shell: bash
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
run: |
if systemctl is-active --quiet docker; then
echo "Docker daemon is running...";
@ -73,7 +73,7 @@ runs:
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
- name: Kill any existing containers, clean up images
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
shell: bash
run: |
# ignore expansion of "docker ps -q" since it could be empty
@ -116,7 +116,7 @@ runs:
- name: Check that the docker daemon is running
shell: bash
continue-on-error: true
if: ${{ steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'true' }}
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
run: |
set +x

View File

@ -1 +1 @@
ba696ea3dfec4cbe693bf06a84c75dc196077f5b
3f0569939c4369bec943fc27d1c9d8dfbc828c26

View File

@ -16,6 +16,7 @@ ciflow_push_tags:
- ciflow/nightly
- ciflow/periodic
- ciflow/rocm
- ciflow/s390
- ciflow/slow
- ciflow/trunk
- ciflow/unstable

View File

@ -1,4 +1,4 @@
# iOS simulator requirements
coremltools==5.0b5
protobuf==3.20.2
optree==0.12.1
optree==0.13.0

View File

@ -27,7 +27,7 @@ pytest-cpp==2.3.0
rockset==1.0.3
z3-solver==4.12.2.0
tensorboard==2.13.0
optree==0.12.1
optree==0.13.0
# NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
# which the stringify metadata is wrong when escaping double quote
protobuf==3.20.2

View File

@ -333,7 +333,7 @@ def generate_wheels_matrix(
package_type = "manywheel"
if python_versions is None:
python_versions = FULL_PYTHON_VERSIONS + ["3.13"]
python_versions = FULL_PYTHON_VERSIONS + ["3.13", "3.13t"]
if arches is None:
# Define default compute archivectures
@ -369,7 +369,13 @@ def generate_wheels_matrix(
# TODO: Enable python 3.13 on rocm, aarch64, windows
if (
gpu_arch_type == "rocm" or (os != "linux" and os != "linux-s390x")
) and python_version == "3.13":
) and (python_version == "3.13" or python_version == "3.13t"):
continue
# TODO: Enable python 3.13t on xpu and cpu-s390x
if (
gpu_arch_type == "xpu" or gpu_arch_type == "cpu-s390x"
) and python_version == "3.13t":
continue
if use_split_build and (

View File

@ -1,5 +1,9 @@
# flake8: noqa: G004
# Note: Copies of this script in runner_determinator.py and _runner-determinator.yml
# must be kept in sync. You can do it easily by running the following command:
# python .github/scripts/update_runner_determinator.py
"""
This runner determinator is used to determine which set of runners to run a
GitHub job on. It uses the first comment of a GitHub issue (by default
@ -79,6 +83,9 @@ class Experiment(NamedTuple):
rollout_perc: float = (
0 # Percentage of workflows to experiment on when user is not opted-in.
)
all_branches: bool = (
False # If True, the experiment is also enabled on the exception branches
)
# Add more fields as needed
@ -212,7 +219,7 @@ def get_potential_pr_author(
def is_exception_branch(branch: str) -> bool:
"""
Branches that get opted out of all experiments and should always use Meta runners
Branches that get opted out of experiments by default, until they're explicitly enabled.
"""
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
@ -338,7 +345,10 @@ def is_user_opted_in(user: str, user_optins: UserOptins, experiment_name: str) -
def get_runner_prefix(
rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False
rollout_state: str,
workflow_requestors: Iterable[str],
branch: str,
is_canary: bool = False,
) -> str:
settings = parse_settings(rollout_state)
user_optins = parse_users(rollout_state)
@ -348,6 +358,12 @@ def get_runner_prefix(
for experiment_name, experiment_settings in settings.experiments.items():
enabled = False
if not experiment_settings.all_branches and is_exception_branch(branch):
log.info(
f"Branch {branch} is an exception branch. Not enabling experiment {experiment_name}."
)
continue
# Is any workflow_requestor opted in to this experiment?
opted_in_users = [
requestor
@ -407,35 +423,34 @@ def get_rollout_state_from_issue(github_token: str, repo: str, issue_num: int) -
def main() -> None:
args = parse_args()
if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):
log.info(
f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."
runner_label_prefix = DEFAULT_LABEL_PREFIX
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
runner_label_prefix = DEFAULT_LABEL_PREFIX
else:
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
is_canary = args.github_repo == "pytorch/pytorch-canary"
is_canary = args.github_repo == "pytorch/pytorch-canary"
runner_label_prefix = get_runner_prefix(
rollout_state, (args.github_issue_owner, username), is_canary
)
runner_label_prefix = get_runner_prefix(
rollout_state,
(args.github_issue_owner, username),
args.github_branch,
is_canary,
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)

View File

@ -4,6 +4,10 @@ from unittest.mock import Mock, patch
import runner_determinator as rd
USER_BRANCH = "somebranch"
EXCEPTION_BRANCH = "main"
class TestRunnerDeterminatorIssueParser(TestCase):
def test_parse_settings(self) -> None:
settings_text = """
@ -66,6 +70,40 @@ class TestRunnerDeterminatorIssueParser(TestCase):
"otherExp settings not parsed correctly",
)
def test_parse_all_branches_setting(self) -> None:
settings_text = """
```
experiments:
lf:
rollout_perc: 25
all_branches: true
otherExp:
all_branches: True
rollout_perc: 0
```
---
Users:
@User1,lf
@User2,lf,otherExp
"""
settings = rd.parse_settings(settings_text)
self.assertTupleEqual(
rd.Experiment(rollout_perc=25, all_branches=True),
settings.experiments["lf"],
"lf settings not parsed correctly",
)
self.assertTrue(settings.experiments["otherExp"].all_branches)
self.assertTupleEqual(
rd.Experiment(rollout_perc=0, all_branches=True),
settings.experiments["otherExp"],
"otherExp settings not parsed correctly",
)
def test_parse_users(self) -> None:
settings_text = """
experiments:
@ -119,7 +157,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)
self.assertEqual("lf.", prefix, "Runner prefix not correct for User1")
def test_opted_in_user_two_experiments(self) -> None:
@ -136,7 +174,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User2"])
prefix = rd.get_runner_prefix(settings_text, ["User2"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for User2")
@patch("random.uniform", return_value=50)
@ -154,7 +192,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User3"])
prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)
self.assertEqual("", prefix, "Runner prefix not correct for user")
@patch("random.uniform", return_value=10)
@ -174,7 +212,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
# User3 is opted out, but is pulled into both experiments by the 10% rollout
prefix = rd.get_runner_prefix(settings_text, ["User3"])
prefix = rd.get_runner_prefix(settings_text, ["User3"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_lf_prefix_always_comes_first(self) -> None:
@ -192,7 +230,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
prefix = rd.get_runner_prefix(settings_text, ["User2"])
prefix = rd.get_runner_prefix(settings_text, ["User2"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_ignores_commented_users(self) -> None:
@ -210,7 +248,7 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)
self.assertEqual("", prefix, "Runner prefix not correct for user")
def test_ignores_extra_experiments(self) -> None:
@ -229,9 +267,44 @@ class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
prefix = rd.get_runner_prefix(settings_text, ["User1"], USER_BRANCH)
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_disables_experiment_on_exception_branches_when_not_explicitly_opted_in(
self,
) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 100
---
Users:
@User,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"], EXCEPTION_BRANCH)
self.assertEqual("", prefix, "Runner prefix not correct for user")
def test_allows_experiment_on_exception_branches_when_explicitly_opted_in(
self,
) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 100
all_branches: true
---
Users:
@User,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"], EXCEPTION_BRANCH)
self.assertEqual("lf.", prefix, "Runner prefix not correct for user")
if __name__ == "__main__":
main()

View File

@ -12,7 +12,7 @@ import json
import os
import warnings
from hashlib import sha256
from typing import Any, Dict, List, Optional
from typing import Any, List, Optional
from unittest import main, mock, skip, TestCase
from urllib.error import HTTPError
@ -24,7 +24,6 @@ from trymerge import (
find_matching_merge_rule,
get_classifications,
get_drci_classifications,
get_rockset_results,
gh_get_team_members,
GitHubPR,
JobCheckState,
@ -42,7 +41,6 @@ if "GIT_REMOTE_URL" not in os.environ:
os.environ["GIT_REMOTE_URL"] = "https://github.com/pytorch/pytorch"
GQL_MOCKS = "gql_mocks.json.gz"
ROCKSET_MOCKS = "rockset_mocks.json.gz"
DRCI_MOCKS = "drci_mocks.json.gz"
@ -77,16 +75,11 @@ def mock_query(
if err.code == 401 or err.code == 403:
err_msg = f"If you are seeing this message during workflow run, please make sure to update {file_name}"
err_msg += f" locally, by deleting it and running {os.path.basename(__file__)} with"
err_msg += " GitHub Personal Access Token passed via GITHUB_TOKEN,"
err_msg += " the rockset api key passed via ROCKSET_API_KEY,"
err_msg += " GitHub Personal Access Token passed via GITHUB_TOKEN"
err_msg += " and drci api key passed via DRCI_BOT_KEY environment variables"
if (
os.getenv("GITHUB_TOKEN") is None
or os.getenv("ROCKSET_API_KEY") is None
or os.getenv("DRCI_BOT_KEY") is None
):
if os.getenv("GITHUB_TOKEN") is None or os.getenv("DRCI_BOT_KEY") is None:
err_msg = (
"Failed to update cached queries as GITHUB_TOKEN or ROCKSET_API_KEY or DRCI_BOT_KEY "
"Failed to update cached queries as GITHUB_TOKEN or DRCI_BOT_KEY "
+ "is not defined. "
+ err_msg
)
@ -110,16 +103,6 @@ def mocked_gh_graphql(query: str, **kwargs: Any) -> Any:
return mock_query(gh_graphql_wrapper, GQL_MOCKS, key_function, query, kwargs)
def mocked_rockset_results(head_sha: str, merge_base: str, num_retries: int = 3) -> Any:
return mock_query(
get_rockset_results,
ROCKSET_MOCKS,
lambda x, y: f"{x} {y}",
head_sha,
merge_base,
)
def mocked_drci_classifications(pr_num: int, project: str, num_retries: int = 3) -> Any:
return mock_query(
get_drci_classifications,
@ -273,10 +256,6 @@ def xla_merge_rules(repo: Any, org: str, project: str) -> List[MergeRule]:
]
def empty_rockset_results(head_sha: str, merge_base: str) -> List[Dict[str, Any]]:
return []
class DummyGitRepo(GitRepo):
def __init__(self) -> None:
super().__init__(get_git_repo_dir(), get_git_remote_name())
@ -288,7 +267,6 @@ class DummyGitRepo(GitRepo):
return "super awsome commit message"
@mock.patch("trymerge.get_rockset_results", side_effect=empty_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch(
"trymerge.get_drci_classifications", side_effect=mocked_drci_classifications
@ -604,7 +582,6 @@ class TestTryMerge(TestCase):
mocked_gh_fetch_merge_base.assert_called_once()
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch(
@ -843,7 +820,7 @@ class TestBypassFailures(TestCase):
checks = pr.get_checkrun_conclusions()
# Known flaky failure takes precedence over ignore current (need to set the
# merge base here to get the results from Rockset, and that categorize the
# merge base here to get the results from Dr. CI, and that categorize the
# broken trunk failure too
checks = get_classifications(
pr.pr_num,
@ -929,7 +906,6 @@ class TestBypassFailures(TestCase):
)
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch("trymerge.get_drci_classifications", return_value={})
@ -1008,7 +984,6 @@ class TestBypassFailuresOnSandCastle(TestCase):
self.assertTrue(len(failed) == 2)
@mock.patch("trymerge.get_rockset_results", side_effect=mocked_rockset_results)
@mock.patch("trymerge.gh_graphql", side_effect=mocked_gh_graphql)
@mock.patch("trymerge.gh_fetch_merge_base", return_value="")
@mock.patch(

View File

@ -452,8 +452,6 @@ RE_DIFF_REV = re.compile(r"^Differential Revision:.+?(D[0-9]+)", re.MULTILINE)
CIFLOW_LABEL = re.compile(r"^ciflow/.+")
CIFLOW_TRUNK_LABEL = re.compile(r"^ciflow/trunk")
MERGE_RULE_PATH = Path(".github") / "merge_rules.yaml"
ROCKSET_MERGES_COLLECTION = "merges"
ROCKSET_MERGES_WORKSPACE = "commons"
REMOTE_MAIN_BRANCH = "origin/main"
DRCI_CHECKRUN_NAME = "Dr.CI"
INTERNAL_CHANGES_CHECKRUN_NAME = "Meta Internal-Only Changes Check"
@ -1180,7 +1178,7 @@ class GitHubPR:
merge_commit_sha = repo.rev_parse(name=self.default_branch())
if comment_id and self.pr_num:
# Finally, upload the record to Rockset. The list of pending and failed
# Finally, upload the record to s3. The list of pending and failed
# checks are at the time of the merge
save_merge_record(
comment_id=comment_id,
@ -1202,7 +1200,7 @@ class GitHubPR:
ignore_current=bool(ignore_current_checks),
)
else:
print("Missing comment ID or PR number, couldn't upload to Rockset")
print("Missing comment ID or PR number, couldn't upload to s3")
# Usually Github will see that the commit has "resolves <pr_num>" in the
# commit message and close the PR, but sometimes it doesn't, leading to
@ -1481,7 +1479,7 @@ def find_matching_merge_rule(
# Categorize all checks when skip_mandatory_checks (force merge) is set. Do it here
# where the list of checks is readily available. These records will be saved into
# Rockset merge records
# s3 merge records
(
pending_mandatory_checks,
failed_mandatory_checks,
@ -1568,7 +1566,7 @@ def save_merge_record(
This saves the merge records as a json, which can later be uploaded to s3
"""
# Prepare the record to be written into Rockset
# Prepare the record to be written into s3
data = [
{
"comment_id": comment_id,
@ -1590,7 +1588,8 @@ def save_merge_record(
"ignore_current": ignore_current,
"error": error,
# This is a unique identifier for the record for deduping purposes
# in rockset. Any unique string would work
# in Rockset. Any unique string would work. This will not be used
# after we migrate off Rockset
"_id": f"{project}-{pr_num}-{comment_id}-{os.environ.get('GITHUB_RUN_ID')}",
}
]
@ -1600,36 +1599,6 @@ def save_merge_record(
json.dump(data, f)
@retries_decorator(rc=[])
def get_rockset_results(head_sha: str, merge_base: str) -> List[Dict[str, Any]]:
query = f"""
SELECT
w.name as workflow_name,
j.id,
j.name,
j.conclusion,
j.completed_at,
j.html_url,
j.head_sha,
j.torchci_classification.captures as failure_captures,
LENGTH(j.steps) as steps,
FROM
commons.workflow_job j join commons.workflow_run w on w.id = j.run_id
where
j.head_sha in ('{head_sha}','{merge_base}')
"""
try:
import rockset # type: ignore[import]
res = rockset.RocksetClient(
host="api.usw2a1.rockset.com", api_key=os.environ["ROCKSET_API_KEY"]
).sql(query)
return cast(List[Dict[str, Any]], res.results)
except ModuleNotFoundError:
print("Could not use RockSet as rocket dependency is missing")
return []
@retries_decorator()
def get_drci_classifications(pr_num: int, project: str = "pytorch") -> Any:
"""
@ -2067,7 +2036,7 @@ def categorize_checks(
pending_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
failed_checks: List[Tuple[str, Optional[str], Optional[int]]] = []
# failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on Rockset
# failed_checks_categorization is used to keep track of all ignorable failures when saving the merge record on s3
failed_checks_categorization: Dict[str, List[Any]] = defaultdict(list)
# If required_checks is not set or empty, consider all names are relevant
@ -2126,7 +2095,7 @@ def categorize_checks(
):
failed_checks = failed_checks + flaky_or_broken_trunk
# The list of failed_checks_categorization is returned so that it can be saved into the Rockset merge record
# The list of failed_checks_categorization is returned so that it can be saved into the s3 merge record
return (pending_checks, failed_checks, failed_checks_categorization)
@ -2410,7 +2379,7 @@ def main() -> None:
handle_exception(e)
if args.comment_id and args.pr_num:
# Finally, upload the record to Rockset, we don't have access to the
# Finally, upload the record to s3, we don't have access to the
# list of pending and failed checks here, but they are not really
# needed at the moment
save_merge_record(
@ -2433,7 +2402,7 @@ def main() -> None:
error=str(e),
)
else:
print("Missing comment ID or PR number, couldn't upload to Rockset")
print("Missing comment ID or PR number, couldn't upload to s3")
finally:
if not args.check_mergeability:
gh_remove_label(

31
.github/scripts/update_runner_determinator.py vendored Executable file
View File

@ -0,0 +1,31 @@
#!/usr/bin/env python3
import re
# Read the contents of runner_determinator.py
with open(".github/scripts/runner_determinator.py") as script_file:
script_content = script_file.read()
# Indent the script content by 10 spaces to match destination indentation
indented_script_content = "\n".join(
[" " * 10 + line if line else line for line in script_content.splitlines()]
)
# Read the contents of _runner-determinator.yml
with open(".github/workflows/_runner-determinator.yml") as yml_file:
yml_content = yml_file.read()
# Replace the content between the markers
new_yml_content = re.sub(
r"(cat <<EOF > runner_determinator.py\n)(.*?)(\n\s+EOF)",
lambda match: match.group(1) + indented_script_content + match.group(3),
yml_content,
flags=re.DOTALL,
)
# Save the modified content back to _runner-determinator.yml
with open(".github/workflows/_runner-determinator.yml", "w") as yml_file:
yml_file.write(new_yml_content)
print("Updated _runner-determinator.yml with the contents of runner_determinator.py")

View File

@ -68,6 +68,7 @@ jobs:
needs: get-label-type
with:!{{ upload.binary_env_as_input(config) }}
{%- if "aarch64" in build_environment %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "s390x" in build_environment %}
@ -102,6 +103,7 @@ jobs:
build_name: !{{ config["build_name"] }}
build_environment: !{{ build_environment }}
{%- if "aarch64" in build_environment %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "s390x" in build_environment %}

View File

@ -91,14 +91,14 @@ jobs:
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ inputs.cuda-version != 'cpu' && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ inputs.cuda-version != 'cpu' && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Output disk space left
run: |

View File

@ -114,22 +114,32 @@ jobs:
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Check if in a ARC runner
- name: Check if in a container runner
shell: bash
id: check_arc_runner
run: echo "IN_ARC_RUNNER=$([ -f /.inarc ] && echo true || echo false)" >> "$GITHUB_OUTPUT"
id: check_container_runner
run: echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT"
- name: Install nvidia driver, nvidia-docker runtime, set GPU_FLAG
id: install-nvidia-driver
uses: pytorch/test-infra/.github/actions/setup-nvidia@main
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_arc_runner.outputs.IN_ARC_RUNNER == 'false' }}
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Setup GPU_FLAG for docker run
id: setup-gpu-flag
run: echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}"
if: ${{ contains(inputs.build-environment, 'cuda') && !contains(matrix.config, 'nogpu') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Setup SCCACHE_SERVER_PORT environment for docker run when on container
id: setup-sscache-port-flag
run: echo "SCCACHE_SERVER_PORT_DOCKER_FLAG=-e SCCACHE_SERVER_PORT=$((RUNNER_UID + 4226))" >> "${GITHUB_ENV}"
if: ${{ steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'true' }}
- name: Lock NVIDIA A100 40GB Frequency
run: |
sudo nvidia-smi -pm 1
sudo nvidia-smi -ac 1215,1410
nvidia-smi
if: contains(matrix.runner, 'a100')
if: ${{ contains(matrix.runner, 'a100') && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false' }}
- name: Start monitoring script
id: monitor-script
@ -208,6 +218,7 @@ jobs:
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_REGION: us-east-1
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
SHM_SIZE: ${{ contains(inputs.build-environment, 'cuda') && '2g' || '1g' }}
DOCKER_IMAGE: ${{ inputs.docker-image }}
@ -218,6 +229,7 @@ jobs:
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
IS_A100_RUNNER: ${{ contains(matrix.runner, 'a100') && '1' || '0' }}
run: |
set -x
@ -236,6 +248,7 @@ jobs:
# shellcheck disable=SC2086,SC2090
container_name=$(docker run \
${GPU_FLAG:-} \
${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \
-e BUILD_ENVIRONMENT \
-e PR_NUMBER \
-e GITHUB_ACTIONS \
@ -265,6 +278,7 @@ jobs:
-e PR_LABELS \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e SCCACHE_BUCKET \
-e SCCACHE_REGION \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
@ -274,6 +288,7 @@ jobs:
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e DASHBOARD_TAG \
-e IS_A100_RUNNER \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
@ -343,7 +358,7 @@ jobs:
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
if: always() && steps.check_container_runner.outputs.IN_CONTAINER_RUNNER == 'false'
# NB: We are currently having an intermittent GPU-related issue on G5 runners with
# A10G GPU. Once this happens, trying to reset the GPU as done in setup-nvidia does

View File

@ -59,6 +59,10 @@ jobs:
cat <<EOF > runner_determinator.py
# flake8: noqa: G004
# Note: Copies of this script in runner_determinator.py and _runner-determinator.yml
# must be kept in sync. You can do it easily by running the following command:
# python .github/scripts/update_runner_determinator.py
"""
This runner determinator is used to determine which set of runners to run a
GitHub job on. It uses the first comment of a GitHub issue (by default
@ -138,6 +142,9 @@ jobs:
rollout_perc: float = (
0 # Percentage of workflows to experiment on when user is not opted-in.
)
all_branches: bool = (
False # If True, the experiment is also enabled on the exception branches
)
# Add more fields as needed
@ -271,7 +278,7 @@ jobs:
def is_exception_branch(branch: str) -> bool:
"""
Branches that get opted out of all experiments and should always use Meta runners
Branches that get opted out of experiments by default, until they're explicitly enabled.
"""
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
@ -397,7 +404,10 @@ jobs:
def get_runner_prefix(
rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False
rollout_state: str,
workflow_requestors: Iterable[str],
branch: str,
is_canary: bool = False,
) -> str:
settings = parse_settings(rollout_state)
user_optins = parse_users(rollout_state)
@ -407,6 +417,12 @@ jobs:
for experiment_name, experiment_settings in settings.experiments.items():
enabled = False
if not experiment_settings.all_branches and is_exception_branch(branch):
log.info(
f"Branch {branch} is an exception branch. Not enabling experiment {experiment_name}."
)
continue
# Is any workflow_requestor opted in to this experiment?
opted_in_users = [
requestor
@ -466,35 +482,34 @@ jobs:
def main() -> None:
args = parse_args()
if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):
log.info(
f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."
runner_label_prefix = DEFAULT_LABEL_PREFIX
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
runner_label_prefix = DEFAULT_LABEL_PREFIX
else:
try:
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
username = get_potential_pr_author(
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
is_canary = args.github_repo == "pytorch/pytorch-canary"
is_canary = args.github_repo == "pytorch/pytorch-canary"
runner_label_prefix = get_runner_prefix(
rollout_state, (args.github_issue_owner, username), is_canary
)
runner_label_prefix = get_runner_prefix(
rollout_state,
(args.github_issue_owner, username),
args.github_branch,
is_canary,
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
except Exception as e:
log.error(
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)

View File

@ -189,7 +189,7 @@ jobs:
run: |
pushd "${PYTORCH_FINAL_PACKAGE_DIR}"
# shellcheck disable=SC2046,SC2102
python3 -mpip install $(echo *.whl)[opt-einsum,optree] optree==0.12.1
python3 -mpip install $(echo *.whl)[opt-einsum,optree] optree==0.13.0
popd
.ci/pytorch/win-test.sh

View File

@ -43,7 +43,7 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.8", "3.9", "3.10", "3.11", "3.12" ]
py_vers: [ "3.9", "3.10", "3.11", "3.12" ]
device: ["cuda", "rocm", "xpu"]
include:
- device: "rocm"
@ -91,9 +91,6 @@ jobs:
# Determine python executable for given version
case $PY_VERS in
3.8)
PYTHON_EXECUTABLE=/opt/python/cp38-cp38/bin/python
;;
3.9)
PYTHON_EXECUTABLE=/opt/python/cp39-cp39/bin/python
;;
@ -214,7 +211,7 @@ jobs:
strategy:
fail-fast: false
matrix:
py_vers: [ "3.8", "3.9", "3.10", "3.11", "3.12" ]
py_vers: [ "3.9", "3.10", "3.11", "3.12" ]
timeout-minutes: 40
env:
DOCKER_IMAGE: pytorch/conda-builder:cpu

View File

@ -67,6 +67,7 @@ jobs:
pytorch-linux-jammy-py3.12-halide,
pytorch-linux-jammy-xpu-2024.0-py3,
pytorch-linux-jammy-py3-clang15-asan,
pytorch-linux-jammy-py3-clang18-asan,
pytorch-linux-focal-py3-clang10-onnx,
pytorch-linux-focal-linter,
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter,
@ -78,7 +79,9 @@ jobs:
- docker-image-name: pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks
runner: linux.arm64.m7g.4xlarge
timeout-minutes: 600
runs-on: "${{ needs.get-label-type.outputs.label-type }}${{ matrix.runner }}"
# Docker uploads fail from LF runners, see https://github.com/pytorch/pytorch/pull/137358
# runs-on: "${{ needs.get-label-type.outputs.label-type }}${{ matrix.runner }}"
runs-on: "${{ matrix.runner }}"
env:
DOCKER_IMAGE_BASE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/${{ matrix.docker-image-name }}
steps:

View File

@ -60,6 +60,7 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cpu-aarch64
@ -86,6 +87,7 @@ jobs:
DESIRED_PYTHON: "3.9"
build_name: manywheel-py3_9-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -130,6 +132,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.9"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_9-cuda-aarch64
@ -177,6 +180,7 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cpu-aarch64
@ -203,6 +207,7 @@ jobs:
DESIRED_PYTHON: "3.10"
build_name: manywheel-py3_10-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -247,6 +252,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.10"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_10-cuda-aarch64
@ -294,6 +300,7 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cpu-aarch64
@ -320,6 +327,7 @@ jobs:
DESIRED_PYTHON: "3.11"
build_name: manywheel-py3_11-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -364,6 +372,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.11"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_11-cuda-aarch64
@ -411,6 +420,7 @@ jobs:
DOCKER_IMAGE: pytorch/manylinuxaarch64-builder:cpu-aarch64-main
use_split_build: False
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cpu-aarch64
@ -437,6 +447,7 @@ jobs:
DESIRED_PYTHON: "3.12"
build_name: manywheel-py3_12-cpu-aarch64
build_environment: linux-aarch64-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.2xlarge
ALPINE_IMAGE: "arm64v8/alpine"
secrets:
@ -481,6 +492,7 @@ jobs:
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.12"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
build_name: manywheel-py3_12-cuda-aarch64

View File

@ -3324,3 +3324,353 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cpu
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cpu-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cpu-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cpu-cxx11-abi-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cpu-cxx11-abi
build_environment: linux-binary-manywheel
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-cxx11-abi-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cpu-cxx11-abi-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu-cxx11-abi
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-cxx11-abi-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cpu-cxx11-abi-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu-cxx11-abi
GPU_ARCH_TYPE: cpu-cxx11-abi
DOCKER_IMAGE: pytorch/manylinuxcxx11-abi-builder:cpu-cxx11-abi-main
DESIRED_DEVTOOLSET: cxx11-abi
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu-cxx11-abi
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda11_8
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda11_8-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda11_8
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda11_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda11_8-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda11_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_1
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda12_1-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_1
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda12_1-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_4
build_environment: linux-binary-manywheel
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda12_4-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_4
build_environment: linux-binary-manywheel
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: False
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -1514,3 +1514,283 @@ jobs:
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda11_8-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda11_8
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu11==11.8.89; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu11==11.8.87; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu11==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu11==11.11.3.6; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu11==10.9.0.58; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu11==10.3.0.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu11==11.4.1.48; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu11==11.7.5.86; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu11==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu11==11.8.86; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda11_8-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda11_8-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda11_8
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda11_8-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda11_8-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu118
GPU_ARCH_VERSION: 11.8
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda11.8-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda11_8
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda12_1-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_1
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.1.3.1; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.0.2.54; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.2.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.4.5.107; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.1.0.106; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_1-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda12_1-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_1
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_1-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda12_1-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu121
GPU_ARCH_VERSION: 12.1
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.1-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_1
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cuda12_4-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cuda12_4
build_environment: linux-binary-manywheel-split
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_4-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cuda12_4-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_4
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cuda12_4-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cuda12_4-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cu124
GPU_ARCH_VERSION: 12.4
GPU_ARCH_TYPE: cuda
DOCKER_IMAGE: pytorch/manylinux-builder:cuda12.4-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cuda12_4
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml
manywheel-py3_13t-cpu-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
build_name: manywheel-py3_13t-cpu
build_environment: linux-binary-manywheel-split
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs:
- manywheel-py3_13t-cpu-build
- get-label-type
uses: ./.github/workflows/_binary-test-linux.yml
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu
build_environment: linux-binary-manywheel-split
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
manywheel-py3_13t-cpu-upload: # Uploading
if: ${{ github.repository_owner == 'pytorch' }}
permissions:
id-token: write
contents: read
needs: manywheel-py3_13t-cpu-test
with:
PYTORCH_ROOT: /pytorch
BUILDER_ROOT: /builder
PACKAGE_TYPE: manywheel
# TODO: This is a legacy variable that we eventually want to get rid of in
# favor of GPU_ARCH_VERSION
DESIRED_CUDA: cpu
GPU_ARCH_TYPE: cpu
DOCKER_IMAGE: pytorch/manylinux-builder:cpu-main
use_split_build: True
DESIRED_PYTHON: "3.13t"
build_name: manywheel-py3_13t-cpu
secrets:
github-token: ${{ secrets.GITHUB_TOKEN }}
conda-pytorchbot-token: ${{ secrets.CONDA_PYTORCHBOT_TOKEN }}
conda-pytorchbot-token-test: ${{ secrets.CONDA_PYTORCHBOT_TOKEN_TEST }}
uses: ./.github/workflows/_binary-upload.yml

View File

@ -38,25 +38,25 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_distributed", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.12xlarge.nvidia.gpu" },
{ config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
{ config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
secrets: inherit
@ -81,8 +81,8 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
test-matrix: |
{ include: [
{ config: "inductor", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
secrets: inherit
@ -106,7 +106,7 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
test-matrix: |
{ include: [
{ config: "inductor-halide", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "inductor-halide", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
]}
secrets: inherit
@ -120,6 +120,28 @@ jobs:
test-matrix: ${{ needs.linux-jammy-cpu-py3_12-inductor-halide-build.outputs.test-matrix }}
secrets: inherit
linux-jammy-cpu-py3_12-inductor-triton-cpu-build:
name: linux-jammy-cpu-py3.12-gcc11-inductor-triton-cpu
uses: ./.github/workflows/_linux-build.yml
needs: get-label-type
with:
build-environment: linux-jammy-py3.12-gcc11
docker-image-name: pytorch-linux-jammy-py3.12-triton-cpu
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
test-matrix: |
{ include: [
{ config: "inductor-triton-cpu", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
]}
linux-jammy-cpu-py3_12-inductor-triton-cpu-test:
name: linux-jammy-cpu-py3.12-gcc11-inductor-triton-cpu
uses: ./.github/workflows/_linux-test.yml
needs: linux-jammy-cpu-py3_12-inductor-triton-cpu-build
with:
build-environment: linux-jammy-py3.12-gcc11
docker-image: ${{ needs.linux-jammy-cpu-py3_12-inductor-triton-cpu-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-jammy-cpu-py3_12-inductor-triton-cpu-build.outputs.test-matrix }}
linux-focal-cuda12_4-py3_10-gcc9-inductor-build:
# Should be synced with the one in inductor-periodic.yml but this only runs inductor_timm
name: cuda12.4-py3.10-gcc9-sm86
@ -133,8 +155,8 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
test-matrix: |
{ include: [
{ config: "inductor_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
secrets: inherit
@ -159,47 +181,47 @@ jobs:
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
test-matrix: |
{ include: [
{ config: "inductor_avx512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "inductor_avx512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_inductor_huggingface", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_inductor_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_inductor_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_inductor_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_inductor_freezing_huggingface", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_inductor_freezing_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_inductor_freezing_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_inductor_freezing_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_inductor_freezing_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_inductor_amp_freezing_huggingface", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.16xlarge.spr" },
{ config: "cpu_inductor_amp_freezing_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.16xlarge.spr" },
{ config: "cpu_inductor_amp_freezing_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.16xlarge.spr" },
{ config: "cpu_inductor_amp_freezing_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.16xlarge.spr" },
{ config: "cpu_inductor_amp_freezing_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.16xlarge.spr" },
{ config: "dynamic_cpu_inductor_huggingface", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "dynamic_cpu_inductor_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "dynamic_cpu_inductor_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "dynamic_cpu_inductor_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "dynamic_cpu_inductor_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_aot_inductor_freezing_huggingface", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_aot_inductor_freezing_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_aot_inductor_freezing_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_aot_inductor_freezing_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_aot_inductor_freezing_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_aot_inductor_amp_freezing_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "cpu_aot_inductor_amp_freezing_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "dynamic_cpu_aot_inductor_freezing_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "dynamic_cpu_aot_inductor_freezing_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "dynamic_cpu_aot_inductor_amp_freezing_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "dynamic_cpu_aot_inductor_amp_freezing_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.12xlarge" },
{ config: "inductor_torchbench_cpu_smoketest_perf", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.24xl.spr-metal" },
{ config: "inductor_avx2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.10xlarge.avx2" },
{ config: "inductor_avx2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_huggingface", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_torchbench", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_torchbench", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_timm", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_timm", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.10xlarge.avx2" },
{ config: "inductor_avx512", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "inductor_avx512", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "cpu_inductor_timm", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_timm", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_freezing_huggingface", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "cpu_inductor_freezing_timm", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_freezing_timm", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_freezing_torchbench", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_freezing_torchbench", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_inductor_amp_freezing_huggingface", shard: 1, num_shards: 1, runner: "linux.16xlarge.spr" },
{ config: "cpu_inductor_amp_freezing_timm", shard: 1, num_shards: 2, runner: "linux.16xlarge.spr" },
{ config: "cpu_inductor_amp_freezing_timm", shard: 2, num_shards: 2, runner: "linux.16xlarge.spr" },
{ config: "cpu_inductor_amp_freezing_torchbench", shard: 1, num_shards: 2, runner: "linux.16xlarge.spr" },
{ config: "cpu_inductor_amp_freezing_torchbench", shard: 2, num_shards: 2, runner: "linux.16xlarge.spr" },
{ config: "dynamic_cpu_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_timm", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_timm", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_freezing_huggingface", shard: 1, num_shards: 1, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_freezing_timm", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_freezing_timm", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_freezing_torchbench", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_freezing_torchbench", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_amp_freezing_torchbench", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "cpu_aot_inductor_amp_freezing_torchbench", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_aot_inductor_freezing_torchbench", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_aot_inductor_freezing_torchbench", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_aot_inductor_amp_freezing_torchbench", shard: 1, num_shards: 2, runner: "linux.12xlarge" },
{ config: "dynamic_cpu_aot_inductor_amp_freezing_torchbench", shard: 2, num_shards: 2, runner: "linux.12xlarge" },
{ config: "inductor_torchbench_cpu_smoketest_perf", shard: 1, num_shards: 1, runner: "linux.24xl.spr-metal" },
{ config: "inductor_avx2", shard: 1, num_shards: 2, runner: "linux.10xlarge.avx2" },
{ config: "inductor_avx2", shard: 2, num_shards: 2, runner: "linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_huggingface", shard: 1, num_shards: 1, runner: "linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_torchbench", shard: 1, num_shards: 2, runner: "linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_torchbench", shard: 2, num_shards: 2, runner: "linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_timm", shard: 1, num_shards: 2, runner: "linux.10xlarge.avx2" },
{ config: "cpu_inductor_freezing_avx2_timm", shard: 2, num_shards: 2, runner: "linux.10xlarge.avx2" },
]}
secrets: inherit

View File

@ -11,7 +11,6 @@ jobs:
contents: read
pull-requests: write
runs-on: lf.linux.2xlarge
continue-on-error: true
if: ${{ github.repository_owner == 'pytorch' }}
steps:
- name: Checkout pytorch
@ -31,10 +30,12 @@ jobs:
bash .github/scripts/lintrunner.sh
- name: Check for changes
id: git-check
continue-on-error: true
run: |
git diff --exit-code || echo "changes=true" >> "$GITHUB_OUTPUT"
- name: Suggest changes
if: steps.git-check.outputs.changes == 'true'
continue-on-error: true
uses: parkerbxyz/suggest-changes@v1
with:
comment: "Please commit the suggested changes from pytorch's linter."

View File

@ -215,14 +215,15 @@ jobs:
with:
submodules: false
fetch-depth: 1
- name: Setup Python 3.8
- name: Setup Python 3.9
uses: actions/setup-python@v4
with:
python-version: '3.8'
python-version: '3.9'
architecture: x64
cache: pip
- name: Install dependencies
run: |
python3 -m pip install --upgrade pip
pip install pytest-rerunfailures==11.1.* pytest-flakefinder==1.1.* pytest-xdist==3.3.* expecttest==0.2.* fbscribelogger==0.1.* numpy==1.24.*
pip install torch --pre --index-url https://download.pytorch.org/whl/nightly/cpu/
- name: Run run_test.py (nonretryable)
@ -278,4 +279,4 @@ jobs:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.sha }}-${{ github.event_name == 'workflow_dispatch' }}
cancel-in-progress: true
cancel-in-progress: true

View File

@ -57,10 +57,10 @@ jobs:
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-test:
@ -89,10 +89,10 @@ jobs:
{ config: "default", shard: 3, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}
@ -118,9 +118,10 @@ jobs:
docker-image-name: pytorch-linux-jammy-py3.9-gcc11
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
]}
parallelnative-linux-jammy-py3_9-gcc11-test:
@ -339,10 +340,10 @@ jobs:
docker-image-name: pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
]}

View File

@ -185,10 +185,10 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.9-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
@ -217,10 +217,10 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.11-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "crossref", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "crossref", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
@ -251,10 +251,10 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.12-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "default", shard: 4, num_shards: 4, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
@ -588,9 +588,9 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.12-clang10
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.2xlarge" },
{ config: "default", shard: 1, num_shards: 3, runner: "linux.4xlarge" },
{ config: "default", shard: 2, num_shards: 3, runner: "linux.4xlarge" },
{ config: "default", shard: 3, num_shards: 3, runner: "linux.4xlarge" },
{ config: "dynamo", shard: 1, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 2, num_shards: 3, runner: "linux.2xlarge" },
{ config: "dynamo", shard: 3, num_shards: 3, runner: "linux.2xlarge" },

24
.github/workflows/s390.yml vendored Normal file
View File

@ -0,0 +1,24 @@
name: s390
on:
push:
branches:
- main
tags:
- ciflow/s390/*
workflow_dispatch:
concurrency:
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref_name }}-${{ github.ref_type == 'branch' && github.sha }}-${{ github.event_name == 'workflow_dispatch' }}-${{ github.event_name == 'schedule' }}
cancel-in-progress: true
permissions: read-all
jobs:
linux-manylinux-2_28-py3-cpu-s390x-build:
name: linux-manylinux-2_28-py3-cpu-s390x
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-s390x-binary-manywheel
docker-image-name: pytorch/manylinuxs390x-builder:cpu-s390x-main
runner: linux.s390x

View File

@ -56,14 +56,14 @@ jobs:
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "default", shard: 1, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 6, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 7, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 8, num_shards: 8, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 8, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 8, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 3, num_shards: 8, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 4, num_shards: 8, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 5, num_shards: 8, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 6, num_shards: 8, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 7, num_shards: 8, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "default", shard: 8, num_shards: 8, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3-gcc9-slow-gradcheck-test:
@ -89,9 +89,9 @@ jobs:
cuda-arch-list: 8.6
test-matrix: |
{ include: [
{ config: "slow", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "slow", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "slow", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.g5.4xlarge.nvidia.gpu" },
{ config: "slow", shard: 1, num_shards: 3, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "slow", shard: 2, num_shards: 3, runner: "linux.g5.4xlarge.nvidia.gpu" },
{ config: "slow", shard: 3, num_shards: 3, runner: "linux.g5.4xlarge.nvidia.gpu" },
]}
linux-focal-cuda12_1-py3_10-gcc9-sm86-test:
@ -115,8 +115,8 @@ jobs:
docker-image-name: pytorch-linux-focal-py3.9-clang10
test-matrix: |
{ include: [
{ config: "slow", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "slow", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "slow", shard: 1, num_shards: 2, runner: "linux.2xlarge" },
{ config: "slow", shard: 2, num_shards: 2, runner: "linux.2xlarge" },
]}
linux-focal-py3_9-clang10-test:
@ -168,9 +168,9 @@ jobs:
docker-image-name: pytorch-linux-jammy-py3-clang15-asan
test-matrix: |
{ include: [
{ config: "slow", shard: 1, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "slow", shard: 2, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "slow", shard: 3, num_shards: 3, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "slow", shard: 1, num_shards: 3, runner: "linux.4xlarge" },
{ config: "slow", shard: 2, num_shards: 3, runner: "linux.4xlarge" },
{ config: "slow", shard: 3, num_shards: 3, runner: "linux.4xlarge" },
]}
sync-tag: asan-build
secrets: inherit

View File

@ -266,10 +266,10 @@ jobs:
docker-image-name: pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9
test-matrix: |
{ include: [
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.2xlarge" },
{ config: "nogpu_AVX512", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_AVX512", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 1, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "nogpu_NO_AVX2", shard: 2, num_shards: 2, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge" },
{ config: "jit_legacy", shard: 1, num_shards: 1, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 1, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
{ config: "default", shard: 2, num_shards: 5, runner: "${{ needs.get-label-type.outputs.label-type }}linux.4xlarge.nvidia.gpu" },
@ -316,11 +316,3 @@ jobs:
build-environment: linux-focal-cuda11.8-py3.10-gcc9-experimental-split-build
docker-image: ${{ needs.linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build.outputs.docker-image }}
test-matrix: ${{ needs.linux-focal-cuda11_8-py3_10-gcc9-experimental-split-build.outputs.test-matrix }}
linux-manylinux-2_28-py3-cpu-s390x-build:
name: linux-manylinux-2_28-py3-cpu-s390x
uses: ./.github/workflows/_linux-build.yml
with:
build-environment: linux-s390x-binary-manywheel
docker-image-name: pytorch/manylinuxs390x-builder:cpu-s390x-main
runner: linux.s390x

View File

@ -28,7 +28,7 @@ jobs:
check-latest: false
cache: pip
architecture: x64
- run: pip install pyyaml==6.0 rockset==1.0.3
- run: pip install pyyaml==6.0
- name: Setup committer id
run: |
@ -43,7 +43,6 @@ jobs:
COMMENT_ID: ${{ github.event.client_payload.comment_id }}
REBASE: ${{ github.event.client_payload.rebase }}
IGNORE_CURRENT: ${{ github.event.client_payload.ignore_current }}
ROCKSET_API_KEY: ${{ secrets.ROCKSET_API_KEY }}
DRCI_BOT_KEY: ${{ secrets.DRCI_BOT_KEY }}
GITHUB_RUN_ID: ${{ github.run_id }}
run: |

View File

@ -153,7 +153,7 @@ init_command = [
'junitparser==2.1.1',
'rich==10.9.0',
'pyyaml==6.0.1',
'optree==0.12.1',
'optree==0.13.0',
]
[[linter]]
@ -216,6 +216,10 @@ include_patterns = [
'torch/csrc/*.cpp',
'torch/csrc/**/*.h',
'torch/csrc/**/*.cpp',
'torch/csrc/distributed/autograd/**/*.cpp',
'torch/csrc/distributed/autograd/**/*.h',
'torch/csrc/distributed/rpc/**/*.cpp',
'torch/csrc/distributed/rpc/**/*.h',
'torch/csrc/jit/serialization/*.h',
'torch/csrc/jit/serialization/*.cpp',
]
@ -246,7 +250,6 @@ exclude_patterns = [
'torch/csrc/inductor/aoti_torch/c/shim.h',
'torch/csrc/jit/**/*',
'torch/csrc/jit/serialization/mobile_bytecode_generated.h',
'torch/csrc/lazy/**/*',
]
init_command = [
'python3',
@ -1255,7 +1258,6 @@ exclude_patterns = [
'torch/fx/experimental/refinement_types.py',
'torch/fx/experimental/rewriter.py',
'torch/fx/experimental/schema_type_annotation.py',
'torch/fx/experimental/symbolic_shapes.py',
'torch/fx/experimental/unification/__init__.py',
'torch/fx/experimental/unification/core.py',
'torch/fx/experimental/unification/dispatch.py',
@ -1271,7 +1273,6 @@ exclude_patterns = [
'torch/fx/experimental/unification/utils.py',
'torch/fx/experimental/unification/variable.py',
'torch/fx/experimental/unify_refinements.py',
'torch/fx/experimental/validator.py',
'torch/fx/graph.py',
'torch/fx/graph_module.py',
'torch/fx/interpreter.py',

View File

@ -1083,8 +1083,16 @@ if(NOT MSVC)
append_cxx_flag_if_supported("-Wno-unused-but-set-variable" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Wno-maybe-uninitialized" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-fstandalone-debug" CMAKE_CXX_FLAGS_DEBUG)
string(APPEND CMAKE_CXX_FLAGS_DEBUG " -fno-omit-frame-pointer -O0")
string(APPEND CMAKE_LINKER_FLAGS_DEBUG " -fno-omit-frame-pointer -O0")
if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64" AND CMAKE_CXX_COMPILER_ID MATCHES "GNU")
if(CMAKE_BUILD_TYPE MATCHES Debug)
message(Warning "Applying -Og optimization for aarch64 GCC debug build to workaround ICE")
endif()
string(APPEND CMAKE_CXX_FLAGS_DEBUG " -fno-omit-frame-pointer -Og")
string(APPEND CMAKE_LINKER_FLAGS_DEBUG " -fno-omit-frame-pointer -Og")
else()
string(APPEND CMAKE_CXX_FLAGS_DEBUG " -fno-omit-frame-pointer -O0")
string(APPEND CMAKE_LINKER_FLAGS_DEBUG " -fno-omit-frame-pointer -O0")
endif()
append_cxx_flag_if_supported("-fno-math-errno" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-fno-trapping-math" CMAKE_CXX_FLAGS)
append_cxx_flag_if_supported("-Werror=format" CMAKE_CXX_FLAGS)

View File

@ -121,7 +121,7 @@ torch/profiler/ @aaronenyeshi @sraikund16
test/functorch/test_aotdispatch.py @ezyang @Chillee
# Dataloader
torch/utils/data/ @andrewkho @gokulavasan
torch/utils/data/ @andrewkho @divyanshk
# hipify
torch/utils/hipify/ @jeffdaily @jithunnair-amd

View File

@ -208,6 +208,8 @@ If you want to compile with ROCm support, install
- [AMD ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html) 4.0 and above installation
- ROCm is currently supported only for Linux systems.
By default the build system expects ROCm to be installed in `/opt/rocm`. If ROCm is installed in a different directory, the `ROCM_PATH` environment variable must be set to the ROCm installation directory. The build system automatically detects the AMD GPU architecture. Optionally, the AMD GPU architecture can be explicitly set with the `PYTORCH_ROCM_ARCH` environment variable [AMD GPU architecture](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-gpus)
If you want to disable ROCm support, export the environment variable `USE_ROCM=0`.
Other potentially useful environment variables may be found in `setup.py`.

View File

@ -467,6 +467,9 @@ if(NOT EMSCRIPTEN AND NOT INTERN_BUILD_MOBILE)
endif()
if(USE_CUDA AND NOT USE_ROCM)
add_definitions(-DCUTLASS_ENABLE_TENSOR_CORE_MMA=1)
add_definitions(-DCUTLASS_ENABLE_SM90_EXTENDED_MMA_SHAPES=1)
add_definitions(-DCUTE_SM90_EXTENDED_MMA_SHAPES_ENABLED)
list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/cutlass/include)
list(APPEND ATen_CUDA_INCLUDE ${CMAKE_CURRENT_SOURCE_DIR}/../../../third_party/cutlass/tools/util/include)
if($ENV{ATEN_STATIC_CUDA})

View File

@ -39,25 +39,16 @@ class TORCH_API Context {
const Generator& defaultGenerator(Device device) {
c10::DeviceType device_type = device.type();
initCUDAIfNeeded(device_type);
initHIPIfNeeded(device_type);
lazyInitDevice(device_type);
if (device_type == at::kCPU) {
return at::detail::getDefaultCPUGenerator();
} else if (device_type == at::kCUDA) {
return at::detail::getCUDAHooks().getDefaultCUDAGenerator(device.index());
} else if (device_type == at::kMPS) {
return at::detail::getMPSHooks().getDefaultMPSGenerator();
} else if (device_type == at::kXPU) {
return at::detail::getXPUHooks().getDefaultXPUGenerator(device.index());
} else if (device_type == at::kIPU) {
return at::detail::getIPUHooks().getDefaultIPUGenerator(device.index());
} else if (device_type == at::kPrivateUse1) {
return at::detail::getPrivateUse1Hooks().getDefaultGenerator(
device.index());
} else {
AT_ERROR(c10::DeviceTypeName(device_type), " device type not enabled.");
return getAcceleratorHooksInterface(device_type)
.getDefaultGenerator(device.index());
}
}
const AcceleratorHooksInterface& getAcceleratorHooksInterface(
std::optional<c10::DeviceType> opt_device_type = std::nullopt) {
c10::DeviceType device_type = opt_device_type.has_value()
@ -80,10 +71,10 @@ class TORCH_API Context {
c10::DeviceTypeName(device_type), " device type not an accelerator.");
}
}
Device getDeviceFromPtr(void* data, c10::DeviceType device_type) {
initCUDAIfNeeded(device_type);
initHIPIfNeeded(device_type);
initXPUIfNeeded(device_type);
lazyInitDevice(device_type);
if (device_type == at::kCPU) {
return c10::DeviceType::CPU;
} else if (device_type == at::kCUDA) {
@ -96,6 +87,7 @@ class TORCH_API Context {
AT_ERROR(c10::DeviceTypeName(device_type), " device type not enabled.");
}
}
bool isPinnedPtr(
const void* data,
std::optional<c10::DeviceType> device_type = std::nullopt) {
@ -106,13 +98,22 @@ class TORCH_API Context {
opt_device_type.value())) { // passed device not an accelerator
return false;
}
return getAcceleratorHooksInterface(opt_device_type.value())
.isPinnedPtr(data);
return getAcceleratorHooksInterface(opt_device_type).isPinnedPtr(data);
}
Allocator* getPinnedMemoryAllocator(
std::optional<c10::DeviceType> device_type = std::nullopt) {
return getAcceleratorHooksInterface(device_type).getPinnedMemoryAllocator();
}
void lazyInitDevice(c10::DeviceType device_type) {
if (device_type != at::kCPU) {
c10::call_once(init_[static_cast<int8_t>(device_type)], [&] {
getAcceleratorHooksInterface(device_type).init();
});
}
}
static bool hasOpenMP();
static bool hasMKL();
static bool hasLAPACK();
@ -165,27 +166,6 @@ class TORCH_API Context {
static bool hasMAIA() {
return c10::impl::hasDeviceGuardImpl(c10::DeviceType::MAIA);
}
// defined in header so that getNonVariableType has ability to inline
// call_once check. getNonVariableType is called fairly frequently
void lazyInitCUDA() {
c10::call_once(thc_init, [&] { detail::getCUDAHooks().initCUDA(); });
}
void lazyInitHIP() {
c10::call_once(thh_init, [&] { detail::getHIPHooks().initHIP(); });
}
void lazyInitXPU() {
c10::call_once(thx_init, [&] { detail::getXPUHooks().initXPU(); });
}
void lazyInitMTIA() {
c10::call_once(th_mtia_init, [&] { detail::getMTIAHooks().initMTIA(); });
}
void lazyInitPrivateUse1() {
c10::call_once(thp_init, [&] {
if (isPrivateUse1HooksRegistered()) {
at::detail::getPrivateUse1Hooks().initPrivateUse1();
}
});
}
static const at::cuda::NVRTC& getNVRTC() {
return detail::getCUDAHooks().nvrtc();
}
@ -361,27 +341,8 @@ class TORCH_API Context {
void setAllowFP16ReductionCPU(bool);
private:
void initCUDAIfNeeded(c10::DeviceType p) {
if (p == c10::DeviceType::CUDA) {
lazyInitCUDA();
}
}
void initHIPIfNeeded(c10::DeviceType p) {
if (p == c10::DeviceType::HIP) {
lazyInitHIP();
}
}
void initXPUIfNeeded(c10::DeviceType p) {
if (p == c10::DeviceType::XPU) {
lazyInitXPU();
}
}
static bool checkCuBLASConfigDeterministic();
c10::once_flag thc_init;
c10::once_flag thh_init;
c10::once_flag thx_init;
c10::once_flag th_mtia_init;
c10::once_flag thp_init;
std::array<c10::once_flag, at::COMPILE_TIME_MAX_DEVICE_TYPES> init_;
bool enabled_cudnn = true;
bool deterministic_cudnn = false;
bool deterministic_mkldnn = false;
@ -513,7 +474,7 @@ inline size_t getNumGPUs() {
"to be CUDA (e.g., when you say CUDA, on a HIP build of ATen, this actually "
"means HIP. Rebuild PyTorch with one or the other disabled.");
} else if (hasCUDA()) {
return detail::getCUDAHooks().getNumGPUs();
return detail::getCUDAHooks().deviceCount();
} else if (hasHIP()) {
return detail::getHIPHooks().getNumGPUs();
} else {
@ -550,7 +511,7 @@ inline void manual_seed(uint64_t seed) {
}
// NB: Sometimes we build with CUDA, but we don't have any GPUs
// available. In that case, we must not seed CUDA; it will fail!
const auto cuda_num_gpus = detail::getCUDAHooks().getNumGPUs();
const auto cuda_num_gpus = detail::getCUDAHooks().deviceCount();
if (hasCUDA() && cuda_num_gpus > 0) {
for (const auto i : c10::irange(cuda_num_gpus)) {
auto cuda_gen = globalContext().defaultGenerator(
@ -563,7 +524,7 @@ inline void manual_seed(uint64_t seed) {
}
}
const auto xpu_num_gpus = detail::getXPUHooks().getNumGPUs();
const auto xpu_num_gpus = detail::getXPUHooks().deviceCount();
if (hasXPU() && xpu_num_gpus) {
for (const auto i : c10::irange(xpu_num_gpus)) {
auto xpu_gen = globalContext().defaultGenerator(

View File

@ -18,6 +18,8 @@ c10::Allocator* GetCPUAllocatorMaybePinned(bool pin_memory) {
// To properly support this, see https://github.com/pytorch/pytorch/issues/14560
if (at::globalContext().hasCUDA()) {
return at::detail::getCUDAHooks().getPinnedMemoryAllocator();
} else if (at::globalContext().hasMTIA()) {
return at::detail::getMTIAHooks().getPinnedMemoryAllocator();
} else if (at::globalContext().hasXPU()) {
return at::detail::getXPUHooks().getPinnedMemoryAllocator();
} else if(at::isPrivateUse1HooksRegistered()) {

View File

@ -420,15 +420,15 @@ inline c10::MaybeOwned<Tensor> expand_size(
inline std::vector<Tensor> expand_outplace(TensorList to_expand) {
// expands a list of Tensors; ignores undefined (null) tensors
bool first = true;
DimVector sizes;
SymDimVector sizes;
for (const auto i : c10::irange(to_expand.size())) {
if (!to_expand[i].defined()) {
continue;
} else if (first) {
sizes = to_expand[i].sizes();
sizes = to_expand[i].sym_sizes();
first = false;
} else {
sizes = infer_size_dimvector(sizes, to_expand[i].sizes());
sizes = infer_size_symdimvector(sizes, to_expand[i].sym_sizes());
}
}
@ -436,10 +436,10 @@ inline std::vector<Tensor> expand_outplace(TensorList to_expand) {
for (const auto i : c10::irange(to_expand.size())) {
if (!to_expand[i].defined()) {
continue;
} else if (to_expand[i].sizes().equals(sizes)) {
} else if (to_expand[i].sym_sizes().equals(sizes)) {
result[i] = to_expand[i];
} else {
result[i] = to_expand[i].expand(sizes);
result[i] = to_expand[i].expand_symint(sizes);
}
}
return result;

View File

@ -61,9 +61,8 @@ void set_num_threads(int nthreads) {
#endif
#ifdef USE_PTHREADPOOL
// because PyTorch uses caffe2::pthreadpool() in QNNPACK
caffe2::PThreadPool* const pool = caffe2::pthreadpool();
caffe2::PThreadPool* const pool = caffe2::pthreadpool(nthreads);
TORCH_INTERNAL_ASSERT(pool, "Invalid thread pool!");
pool->set_thread_count(nthreads);
#endif
#if AT_MKLDNN_ENABLED()
at::native::mkldnn::clear_computation_cache();

View File

@ -19,7 +19,7 @@ Tensor& scalar_fill(Tensor& self, const Scalar& value) {
AT_DISPATCH_V2(
self.scalar_type(), "fill_out", AT_WRAP([&]() {
fill_inplace<scalar_t>(self, value);
}), kComplexHalf, kHalf, kBool, kBFloat16, AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES));
}), kComplexHalf, kHalf, kBool, kBFloat16, AT_EXPAND(AT_ALL_TYPES_AND_COMPLEX), AT_EXPAND(AT_FLOAT8_TYPES), AT_EXPAND(AT_BAREBONES_UNSIGNED_TYPES));
return self;
}

View File

@ -144,8 +144,8 @@ class CheckSparseTensorInvariants {
bool old_state;
public:
CheckSparseTensorInvariants(bool state) {
old_state = at::globalContext().checkSparseTensorInvariants();
CheckSparseTensorInvariants(bool state)
: old_state(at::globalContext().checkSparseTensorInvariants()) {
at::globalContext().setCheckSparseTensorInvariants(state);
}

View File

@ -255,7 +255,9 @@ inline Tensor applySelect(
// the other hand, indexing wraping is valid for all negative int64_t
// values, as x[INT64_MIN] is the same as x[INT64_MAX]
TORCH_CHECK_INDEX(
size > -1 - index && size > index,
size.sym_gt(-1 - index)
.sym_and(size.sym_gt(index))
.expect_true(__FILE__, __LINE__),
"index ",
index,
" is out of bounds for dimension ",

View File

@ -82,7 +82,7 @@ class TORCH_API ThreadLocalState {
!defined(BUILD_LITE_INTERPRETER)
// TLS for autocast dtypes
std::array<at::ScalarType, at::COMPILE_TIME_MAX_DEVICE_TYPES>
autocast_dtypes_;
autocast_dtypes_{};
#endif
friend class ThreadLocalStateGuard;

View File

@ -111,17 +111,6 @@ template <
typename E,
typename B = HostBlock<S>>
struct CachingHostAllocatorImpl {
CachingHostAllocatorImpl() {
// Launch the background thread and process events in a loop.
if (pinned_use_background_threads()) {
getBackgroundThreadPool()->run([&]() {
while (true) {
process_events();
std::this_thread::sleep_for(std::chrono::microseconds(100));
}
});
}
}
virtual ~CachingHostAllocatorImpl() = default;
public:
@ -155,6 +144,17 @@ struct CachingHostAllocatorImpl {
if (block) {
return {block->ptr_, reinterpret_cast<void*>(block)};
}
// Launch the background thread and process events in a loop.
static c10::once_flag background_thread_flag;
c10::call_once(background_thread_flag, [this] {
getBackgroundThreadPool()->run([&]() {
while (true) {
process_events();
std::this_thread::sleep_for(std::chrono::microseconds(100));
}
});
});
}
// Slow path: if we can't allocate from the cached free list, we need

View File

@ -13,8 +13,6 @@
#include <ATen/core/Array.h>
#include <c10/macros/Macros.h>
#include <c10/util/Exception.h>
#include <c10/util/Half.h>
#include <cmath>
#include <cstdint>

View File

@ -17,8 +17,22 @@ TORCH_SDT_DEFINE_SEMAPHORE(operator_end)
#endif
bool show_dispatch_trace() {
static char const* temp = getenv("TORCH_SHOW_DISPATCH_TRACE");
return temp != nullptr;
static auto envar = std::getenv("TORCH_SHOW_DISPATCH_TRACE");
if (envar) {
if (strcmp(envar, "0") == 0) {
return false;
}
if (strcmp(envar, "1") == 0) {
return true;
}
TORCH_WARN(
"ignoring invalid value for TORCH_SHOW_DISPATCH_TRACE: ",
envar,
" valid values are 0 or 1.");
}
return false;
}
static thread_local int64_t dispatch_trace_nesting_value_;

View File

@ -84,6 +84,14 @@ bool init_amx() {
#endif
}
bool is_arm_sve_supported() {
#if !defined(__s390x__) && !defined(__powerpc__)
return cpuinfo_initialize() && cpuinfo_has_arm_sve();
#else
return false;
#endif
}
static uint32_t get_cache_size(int level) {
#if !defined(__s390x__) && !defined(__powerpc__)
if (!cpuinfo_initialize()) {

View File

@ -21,6 +21,9 @@ TORCH_API bool is_amx_tile_supported();
// Enable the system to use AMX instructions.
TORCH_API bool init_amx();
// Detect if CPU supports Arm(R) architecture SVE ISA
TORCH_API bool is_arm_sve_supported();
// Get the L1 cache size per core in Byte
TORCH_API uint32_t L1d_cache_size();

View File

@ -107,6 +107,30 @@ struct VecReduceAllSIMD<float, Op> {
};
#endif // defined(__aarch64__)
#if defined(__aarch64__) && !defined(C10_MOBILE) && !defined(__CUDACC__) && defined(CPU_CAPABILITY_SVE256)
template <typename Op>
struct VecReduceAllSIMD<float, Op> {
static inline float apply(const Op& vec_fun, const Vectorized<float>& acc_vec) {
using Vec = Vectorized<float>;
Vec v = acc_vec;
// 128-bit shuffle
svuint32_t ind = svdupq_n_u32(4, 5, 6, 7);
Vec v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
// 64-bit shuffle
ind = svdupq_n_u32(2, 3, 0, 1);
v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
// 32-bit shuffle
ind = svdupq_n_u32(1, 0, 2, 3);
v1 = svtbl_f32(v, ind);
v = vec_fun(v, v1);
return svlasta(svpfalse(), v);
}
};
#endif // defined(__aarch64__)
template <typename scalar_t, typename Op>
inline scalar_t vec_reduce_all(const Op& vec_fun, const Vectorized<scalar_t>& acc_vec) {
return VecReduceAllSIMD<scalar_t, Op>::apply(vec_fun, acc_vec);

View File

@ -306,11 +306,10 @@ struct VecConvert<float, 1, BFloat16, 1> {
const VectorizedN<BFloat16, 1>& src) {
VectorizedN<float, 1> result;
uint16x8_t u16_8 = vld1q_u16(reinterpret_cast<const uint16_t*>(&src[0]));
int32x4_t shift = vdupq_n_s32(16);
auto u16_low1 = vget_low_u16(u16_8);
auto u16_high1 = vget_high_u16(u16_8);
float32x4_t f32x4_0 = vreinterpretq_f32_u32(vshlq_u32(vmovl_u16(u16_low1), shift));
float32x4_t f32x4_1 = vreinterpretq_f32_u32(vshlq_u32(vmovl_u16(u16_high1), shift));
float32x4_t f32x4_0 = vreinterpretq_f32_u32(vshlq_n_u32(vmovl_u16(u16_low1), 16));
float32x4_t f32x4_1 = vreinterpretq_f32_u32(vshlq_n_u32(vmovl_u16(u16_high1), 16));
result[0] = {f32x4_0, f32x4_1};
return result;
}

View File

@ -216,27 +216,27 @@ public:
}
Vectorized<float> exp_u20() const {
// A faster version of exp with ULP=20
static __m256 vec_factorial_1 =
const __m256 vec_factorial_1 =
_mm256_set1_ps(0.999999701f); // 1/factorial(1)
static __m256 vec_factorial_2 =
const __m256 vec_factorial_2 =
_mm256_set1_ps(0.499991506f); // 1/factorial(2)
static __m256 vec_factorial_3 =
const __m256 vec_factorial_3 =
_mm256_set1_ps(0.166676521f); // 1/factorial(3)
static __m256 vec_factorial_4 =
const __m256 vec_factorial_4 =
_mm256_set1_ps(0.0418978221f); // 1/factorial(4)
static __m256 vec_factorial_5 =
const __m256 vec_factorial_5 =
_mm256_set1_ps(0.00828929059f); // 1/factorial(5)
static __m256 vec_exp_log2ef =
const __m256 vec_exp_log2ef =
_mm256_castsi256_ps(_mm256_set1_epi32(0x3fb8aa3b)); // log2(e)
static __m256 vec_half = _mm256_set1_ps(0.5f);
static __m256 vec_one = _mm256_set1_ps(1.f);
static __m256 vec_zero = _mm256_set1_ps(0.f);
static __m256 vec_two = _mm256_set1_ps(2.f);
static __m256 vec_ln2f = _mm256_castsi256_ps(_mm256_set1_epi32(0x3f317218)); // ln(2)
static __m256 vec_ln_flt_min = _mm256_castsi256_ps(_mm256_set1_epi32(0xc2aeac50));
static __m256 vec_ln_flt_max = _mm256_castsi256_ps(_mm256_set1_epi32(0x42b17218));
static __m256i vec_127 = _mm256_set1_epi32(0x0000007f);
static int n_mantissa_bits = 23;
const __m256 vec_half = _mm256_set1_ps(0.5f);
const __m256 vec_one = _mm256_set1_ps(1.f);
const __m256 vec_zero = _mm256_set1_ps(0.f);
const __m256 vec_two = _mm256_set1_ps(2.f);
const __m256 vec_ln2f = _mm256_castsi256_ps(_mm256_set1_epi32(0x3f317218)); // ln(2)
const __m256 vec_ln_flt_min = _mm256_castsi256_ps(_mm256_set1_epi32(0xc2aeac50));
const __m256 vec_ln_flt_max = _mm256_castsi256_ps(_mm256_set1_epi32(0x42b17218));
const __m256i vec_127 = _mm256_set1_epi32(0x0000007f);
const int n_mantissa_bits = 23;
// exp(x) =
// = exp(n * ln(2) + r) // divide x by ln(2) and get quot and rem

View File

@ -75,7 +75,7 @@ inline __m256i pack_saturate_and_clamp<int32_t>(
int32_t /*min_val*/,
int32_t /*max_val*/) {
// This function is for linkage only, will not be used
AT_ERROR("pack_saturate_and_clamp<int32_t> is not supported");
TORCH_CHECK(false, "pack_saturate_and_clamp<int32_t> is not supported");
}
template <>

View File

@ -236,27 +236,27 @@ public:
}
Vectorized<float> exp_u20() const {
// A faster version of exp with ULP=20
static __m512 vec_factorial_1 =
const __m512 vec_factorial_1 =
_mm512_set1_ps(0.999999701f); // 1/factorial(1)
static __m512 vec_factorial_2 =
const __m512 vec_factorial_2 =
_mm512_set1_ps(0.499991506f); // 1/factorial(2)
static __m512 vec_factorial_3 =
const __m512 vec_factorial_3 =
_mm512_set1_ps(0.166676521f); // 1/factorial(3)
static __m512 vec_factorial_4 =
const __m512 vec_factorial_4 =
_mm512_set1_ps(0.0418978221f); // 1/factorial(4)
static __m512 vec_factorial_5 =
const __m512 vec_factorial_5 =
_mm512_set1_ps(0.00828929059f); // 1/factorial(5)
static __m512 vec_exp_log2ef =
const __m512 vec_exp_log2ef =
_mm512_castsi512_ps(_mm512_set1_epi32(0x3fb8aa3b)); // log2(e)
static __m512 vec_half = _mm512_set1_ps(0.5f);
static __m512 vec_one = _mm512_set1_ps(1.f);
static __m512 vec_zero = _mm512_set1_ps(0.f);
static __m512 vec_two = _mm512_set1_ps(2.f);
static __m512 vec_ln2f = _mm512_castsi512_ps(_mm512_set1_epi32(0x3f317218)); // ln(2)
static __m512 vec_ln_flt_min = _mm512_castsi512_ps(_mm512_set1_epi32(0xc2aeac50));
static __m512 vec_ln_flt_max = _mm512_castsi512_ps(_mm512_set1_epi32(0x42b17218));
static __m512i vec_127 = _mm512_set1_epi32(0x0000007f);
static int n_mantissa_bits = 23;
const __m512 vec_half = _mm512_set1_ps(0.5f);
const __m512 vec_one = _mm512_set1_ps(1.f);
const __m512 vec_zero = _mm512_set1_ps(0.f);
const __m512 vec_two = _mm512_set1_ps(2.f);
const __m512 vec_ln2f = _mm512_castsi512_ps(_mm512_set1_epi32(0x3f317218)); // ln(2)
const __m512 vec_ln_flt_min = _mm512_castsi512_ps(_mm512_set1_epi32(0xc2aeac50));
const __m512 vec_ln_flt_max = _mm512_castsi512_ps(_mm512_set1_epi32(0x42b17218));
const __m512i vec_127 = _mm512_set1_epi32(0x0000007f);
const int n_mantissa_bits = 23;
// exp(x) =
// = exp(n * ln(2) + r) // divide x by ln(2) and get quot and rem

View File

@ -77,7 +77,7 @@ inline __m512i pack_saturate_and_clamp<int32_t>(
int32_t min_val [[maybe_unused]],
int32_t max_val [[maybe_unused]]) {
// This function is for linkage only, will not be used
AT_ERROR("pack_saturate_and_clamp<int32_t> is not supported");
TORCH_CHECK(false, "pack_saturate_and_clamp<int32_t> is not supported");
return __m512i{};
}

View File

@ -209,8 +209,13 @@ public:
}
return vector;
}
static Vectorized<T> blendv(const Vectorized<T>& a, const Vectorized<T>& b,
const Vectorized<T>& mask) {
// Workaround for https: //gcc.gnu.org/bugzilla/show_bug.cgi?id=117001
#if __GNUC__ <= 12 && defined(__ARM_FEATURE_SVE)
static Vectorized<T> __attribute__ ((optimize("-fno-tree-loop-vectorize"))) blendv(const Vectorized<T>& a,
#else
static Vectorized<T> blendv(const Vectorized<T>& a,
#endif
const Vectorized<T>& b, const Vectorized<T>& mask) {
Vectorized vector;
int_same_size_t<T> buffer[size()];
mask.store(buffer);

View File

@ -125,7 +125,7 @@ void CUDAGraph::capture_begin(MempoolId_t pool/*=0*/, cudaStreamCaptureMode capt
// due to the capture status being updated _after_ a capture had already started.
c10::cuda::CUDACachingAllocator::beginAllocateToPool(capture_dev_, mempool_id_, [this](cudaStream_t stream) {
cudaStreamCaptureStatus status;
CaptureId_t stream_capture_id;
CaptureId_t stream_capture_id = 0;
AT_CUDA_CHECK(cudaStreamGetCaptureInfo(stream, &status, &stream_capture_id));
return status == cudaStreamCaptureStatus::cudaStreamCaptureStatusActive && stream_capture_id == capture_id_;
});

View File

@ -10,7 +10,7 @@ TensorBase empty_cuda(
ScalarType dtype,
std::optional<Device> device_opt,
std::optional<c10::MemoryFormat> memory_format_opt) {
at::globalContext().lazyInitCUDA();
at::globalContext().lazyInitDevice(c10::DeviceType::CUDA);
const auto device = device_or_default(device_opt);
TORCH_INTERNAL_ASSERT(device.is_cuda());
const DeviceGuard device_guard(device);
@ -50,7 +50,7 @@ TensorBase empty_strided_cuda(
IntArrayRef stride,
ScalarType dtype,
std::optional<Device> device_opt) {
at::globalContext().lazyInitCUDA();
at::globalContext().lazyInitDevice(c10::DeviceType::CUDA);
const auto device = device_or_default(device_opt);
TORCH_INTERNAL_ASSERT(device.is_cuda());
const DeviceGuard device_guard(device);

View File

@ -34,7 +34,7 @@ void init_p2p_access_cache(int64_t num_devices) {
} // namespace detail
bool get_p2p_access(int dev, int dev_to_access) {
at::globalContext().lazyInitCUDA();
at::globalContext().lazyInitDevice(c10::DeviceType::CUDA);
TORCH_CHECK(dev >= 0 || dev < num_devices_,
dev, " is not a device");

View File

@ -14,6 +14,7 @@
#include <ATen/detail/CUDAHooksInterface.h>
#include <ATen/native/cuda/CuFFTPlanCache.h>
#include <c10/util/Exception.h>
#include <c10/util/env.h>
#include <c10/cuda/CUDACachingAllocator.h>
#include <c10/cuda/CUDAFunctions.h>
#include <c10/util/irange.h>
@ -79,30 +80,19 @@ struct _Initializer {
} initializer;
} // anonymous namespace
// Sets the CUDA_MODULE_LOADING environment variable
// if it's not set by the user.
void maybe_set_cuda_module_loading(const std::string &def_value) {
auto value = std::getenv("CUDA_MODULE_LOADING");
if (!value) {
#ifdef _WIN32
auto env_var = "CUDA_MODULE_LOADING=" + def_value;
_putenv(env_var.c_str());
#else
setenv("CUDA_MODULE_LOADING", def_value.c_str(), 1);
#endif
}
}
// NB: deleter is dynamic, because we need it to live in a separate
// compilation unit (alt is to have another method in hooks, but
// let's not if we don't need to!)
void CUDAHooks::initCUDA() const {
void CUDAHooks::init() const {
C10_LOG_API_USAGE_ONCE("aten.init.cuda");
// Force the update to enable unit testing. This code get executed before unit tests
// have a chance to enable vitals.
at::vitals::VitalsAPI.setVital("CUDA", "used", "true", /* force = */ true);
maybe_set_cuda_module_loading("LAZY");
// Sets the CUDA_MODULE_LOADING environment variable
// if it's not set by the user.
c10::utils::set_env("CUDA_MODULE_LOADING", "LAZY", false);
const auto num_devices = c10::cuda::device_count_ensure_non_zero();
c10::cuda::CUDACachingAllocator::init(num_devices);
at::cuda::detail::init_p2p_access_cache(num_devices);
@ -113,7 +103,7 @@ void CUDAHooks::initCUDA() const {
#endif
}
const Generator& CUDAHooks::getDefaultCUDAGenerator(DeviceIndex device_index) const {
const Generator& CUDAHooks::getDefaultGenerator(DeviceIndex device_index) const {
return at::cuda::detail::getDefaultCUDAGenerator(device_index);
}
@ -241,6 +231,9 @@ DeviceIndex current_device() {
return -1;
}
/**
* DEPRECATED: use getCurrentDevice() instead
*/
DeviceIndex CUDAHooks::current_device() const {
return at::cuda::detail::current_device();
}
@ -436,10 +429,21 @@ void CUDAHooks::cuFFTClearPlanCache(DeviceIndex device_index) const {
at::native::detail::cufft_clear_plan_cache_impl(device_index);
}
/**
* DEPRECATED: use deviceCount() instead
*/
int CUDAHooks::getNumGPUs() const {
return at::cuda::device_count();
}
DeviceIndex CUDAHooks::deviceCount() const {
return at::cuda::device_count();
}
DeviceIndex CUDAHooks::getCurrentDevice() const {
return at::cuda::detail::current_device();
}
#ifdef USE_ROCM
bool CUDAHooks::isGPUArch(DeviceIndex device_index, const std::vector<std::string>& archs) const {
hipDeviceProp_t* prop = at::cuda::getDeviceProperties(device_index);

View File

@ -19,10 +19,11 @@ TORCH_CUDA_CPP_API void set_magma_init_fn(void (*magma_init_fn)());
// The real implementation of CUDAHooksInterface
struct CUDAHooks : public at::CUDAHooksInterface {
CUDAHooks(at::CUDAHooksArgs) {}
void initCUDA() const override;
void init() const override;
Device getDeviceFromPtr(void* data) const override;
bool isPinnedPtr(const void* data) const override;
const Generator& getDefaultCUDAGenerator(DeviceIndex device_index = -1) const override;
const Generator& getDefaultGenerator(
DeviceIndex device_index = -1) const override;
bool hasCUDA() const override;
bool hasMAGMA() const override;
bool hasCuDNN() const override;
@ -49,6 +50,9 @@ struct CUDAHooks : public at::CUDAHooksInterface {
int64_t cuFFTGetPlanCacheSize(DeviceIndex device_index) const override;
void cuFFTClearPlanCache(DeviceIndex device_index) const override;
int getNumGPUs() const override;
DeviceIndex deviceCount() const override;
DeviceIndex getCurrentDevice() const override;
#ifdef USE_ROCM
bool isGPUArch(DeviceIndex device_index, const std::vector<std::string>& archs) const override;
#endif

View File

@ -77,6 +77,31 @@ default, now called through TunableOp. Any call to at::cuda::blas::gemm() or ::b
when enabled. Calling gemm() for a given set of input arguments (transa, transb, m, n, k) will attempt to use the
fastest available implementation across both rocblas and hipblaslt.
## Offline Tuning
### Motivation
Basically it is used for workload with high-memory utilization where one might run out of memory with regular tuning.
### Workflow
There are basically two steps:
1) Set the environment variables to collect the untuned GEMM and this will generate `tunableop_untuned?.csv` ("?" is placeholder for the GPU ID), like:
```
PYTORCH_TUNABLEOP_ENABLED=1
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_RECORD_UNTUNED=1
...
```
2) Run a Python script that reads the `tunableop_untuned?.csv` and generates the `tunableop_results?.csv`, like:
```
import torch.cuda.tunable as tunable
import os
os.putenv('PYTORCH_TUNABLEOP_ENABLED', '1')
os.putenv('PYTORCH_TUNABLEOP_TUNING', '1')
os.putenv('PYTORCH_TUNABLEOP_RECORD_UNTUNED', '0')
tunable.tune_gemm_in_file("tunableop_results?.csv")
```
## Tuning Context
The behavior of TunableOp is currently manipulated through environment variables, the C++ interface of
at::cuda::tunable::getTuningContext(), or the `torch.cuda.tunable` python interfaces. The environment variables take
@ -90,6 +115,8 @@ programmatically since the settings become fixed. Use the C++ or Python APIs ins
| -------------------- | ----------- |
| PYTORCH_TUNABLEOP_ENABLED | Default is 0. Set to 1 to enable. |
| PYTORCH_TUNABLEOP_TUNING | Default is 1. Set to 0 to disable. |
| PYTORCH_TUNABLEOP_RECORD_UNTUNED | Default is 0. Set to 1 to enable. |
| PYTORCH_TUNABLEOP_UNTUNED_FILENAME | Default is 'tunableop_untuned.csv'. |
| PYTORCH_TUNABLEOP_VERBOSE | Default is 0. Set to 1 to enable basic logging. 2 for basic tuning status. 3 for full trace. |
| PYTORCH_TUNABLEOP_VERBOSE_FILENAME | Default is "err" for stderr. Set to "out" for stdout or a filename for capturing verbose logging. |
| PYTORCH_TUNABLEOP_FILENAME | Default is 'tunableop_results.csv'. |
@ -112,6 +139,8 @@ All python APIs exist in the `torch.cuda.tunable` module.
| is_enabled() -> bool | |
| tuning_enable(val: bool = True) -> None | Default is True. |
| tuning_is_enabled() -> bool | |
| record_untuned_enable(val: bool = True) -> None | Default is True. |
| record_untuned_is_enabled() -> bool | |
| set_max_tuning_duration(duration: int) -> None | |
| get_max_tuning_duration() -> int | |
| set_max_tuning_iterations(iterations: int) -> None | |
@ -123,6 +152,7 @@ All python APIs exist in the `torch.cuda.tunable` module.
| write_file_on_exit(val: bool) -> None | Default is True. |
| write_file(filename: Optional[str] = None) -> None | If filename not given, it will call get_filename(). |
| read_file(filename: Optional[str] = None) -> None | If filename not given, it will call get_filename(). |
| tune_gemm_in_file(filename: str) -> None | read an untuned file and tune GEMMs in it. |
### C++ Interface
Example:

View File

@ -112,6 +112,32 @@ void TuningResultsManager::Add(const std::string& op_signature, const std::strin
AddImpl(op_signature, params_signature, best, it->second);
}
void TuningResultsManager::RecordUntuned( std::ofstream& untuned_file, const std::string& op_signature, const std::string& params_signature) {
std::scoped_lock l{lock_};
if (!untuned_file.good()) {
TORCH_WARN_ONCE("failed to open file for writing; untuned gemm will not be saved");
return;
} else {
bool isNew = false;
auto it = untuned_results_.find(op_signature);
if (it == untuned_results_.end()) {
it = untuned_results_.insert({op_signature, {}}).first;
isNew = true;
}
auto it_kernel_map = it->second.find(params_signature);
if (it_kernel_map == it->second.end()) {
it->second.insert(params_signature);
isNew = true;
}
if (isNew) {
untuned_file << op_signature << "," << params_signature << std::endl;
TUNABLE_LOG3("Untuned,", op_signature, ",", params_signature);
}
}
}
void TuningResultsManager::Delete(const std::string& op_signature, const std::string& params_signature) {
std::scoped_lock l{lock_};
@ -359,6 +385,7 @@ TuningStatus TuningResultsValidator::ValidatePyTorchVersion(const std::string& v
TuningContext::TuningContext() :
enable_{false},
tuning_enable_{true},
record_untuned_enable_{false},
manager_initialized_{false},
write_file_on_exit_{true},
numerics_check_enable_{false},
@ -369,6 +396,7 @@ TuningContext::TuningContext() :
icache_flush_{true},
rotating_buffer_size_{-1},
filename_{},
untuned_file_{},
results_count_from_input_file_{0}
{
}
@ -394,6 +422,10 @@ TuningContext::~TuningContext() {
}
}
}
if (untuned_file_.good()) {
untuned_file_.close();
}
}
void TuningContext::EnableTunableOp(bool value) {
@ -424,6 +456,15 @@ void TuningContext::EnableTuning(bool value) {
}
}
void TuningContext::EnableRecordUntuned(bool value) {
record_untuned_enable_ = value;
if (value) {
TUNABLE_LOG1("Enable Record Untuned for TunableOp");
} else {
TUNABLE_LOG1("Disable Record Untuned for TunableOp");
}
}
bool TuningContext::IsTuningEnabled() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_TUNING");
if (env != nullptr && strcmp(env, "0") == 0) {
@ -432,6 +473,33 @@ bool TuningContext::IsTuningEnabled() const {
return tuning_enable_;
}
bool TuningContext::IsRecordUntunedEnabled() const {
static const char *env = std::getenv("PYTORCH_TUNABLEOP_RECORD_UNTUNED");
if (env != nullptr && strcmp(env, "1") == 0) {
return true;
}
return record_untuned_enable_;
}
std::ofstream& TuningContext::GetUntunedFile(){
if (!untuned_file_.is_open()) {
const char *env = std::getenv("PYTORCH_TUNABLEOP_UNTUNED_FILENAME");
std::string filename = (env == nullptr) ? "tunableop_untuned.csv" : env;
std::string device = c10::str(int(c10::cuda::current_device()));
std::size_t found = filename.rfind(".");
if (found != std::string::npos) {
filename.insert(found, device);
} else {
// all else fails, just append
filename.append(device);
}
untuned_file_ = std::ofstream(filename, std::ios::out | std::ios::trunc);
}
return untuned_file_;
}
void TuningContext::WriteFileOnExit(bool value) {
write_file_on_exit_ = value;
}
@ -545,7 +613,7 @@ TuningResultsManager& TuningContext::GetTuningResultsManager() {
SetFilename(filename, true);
}
auto filename = GetFilename();
if (!filename.empty()) {
if (!filename.empty() && !IsRecordUntunedEnabled()) {
ReadFile(filename);
// attempt immediately to open file for writing to catch errors early
std::ofstream file(filename, std::ios::out | std::ios::app);

View File

@ -19,6 +19,7 @@
#include <string>
#include <type_traits>
#include <unordered_map>
#include <unordered_set>
#include <utility>
#include <vector>
@ -87,6 +88,7 @@ class TORCH_CUDA_CPP_API ResultEntry {
typedef std::unordered_map<std::string, ResultEntry> KernelMap;
typedef std::unordered_map<std::string, KernelMap> ResultsMap;
typedef std::unordered_map<std::string, std::unordered_set<std::string>> UntunedMap;
struct TORCH_CUDA_CPP_API TuningResults {
// Validates if these results are compatible with the libraries
@ -129,9 +131,12 @@ class TORCH_CUDA_CPP_API TuningResultsManager {
size_t GetSize();
void RecordUntuned( std::ofstream& untuned_file, const std::string& op_signature, const std::string& params_signature);
private:
std::mutex lock_;
ResultsMap results_;
UntunedMap untuned_results_;
};
class TORCH_CUDA_CPP_API TuningResultsValidator {
@ -173,6 +178,10 @@ class TORCH_CUDA_CPP_API TuningContext {
void EnableTuning(bool value);
bool IsTuningEnabled() const;
void EnableRecordUntuned(bool value);
bool IsRecordUntunedEnabled() const;
std::ofstream& GetUntunedFile();
void EnableNumericsCheck(bool value);
bool IsNumericsCheckEnabled() const;
@ -213,6 +222,7 @@ class TORCH_CUDA_CPP_API TuningContext {
private:
bool enable_;
bool tuning_enable_;
bool record_untuned_enable_;
bool manager_initialized_;
bool write_file_on_exit_;
bool numerics_check_enable_;
@ -226,6 +236,7 @@ class TORCH_CUDA_CPP_API TuningContext {
mutable c10::once_flag manager_init_once_;
TuningResultsValidator validator_;
std::string filename_;
std::ofstream untuned_file_;
size_t results_count_from_input_file_;
};

View File

@ -54,9 +54,15 @@ class TunableOp {
auto params_sig = params->Signature();
result = mgr.Lookup(op_sig, params_sig);
// If there is not previous tuning result been found, we do the tuning iff tuning is enabled
if (result == ResultEntry::Null() && ctx->IsTuningEnabled()) {
result = FindFastest(params);
mgr.Add(op_sig, params_sig, result);
if (result == ResultEntry::Null()) {
if (ctx->IsTuningEnabled()) {
result = FindFastest(params);
mgr.Add(op_sig, params_sig, result);
}
else if (ctx->IsRecordUntunedEnabled()) {
// or record the gemm into file
mgr.RecordUntuned(ctx->GetUntunedFile(), op_sig, params_sig);
}
}
}
else {

View File

@ -1,9 +1,13 @@
#pragma once
#include <ATen/core/Generator.h>
#include <c10/core/Allocator.h>
#include <c10/core/Device.h>
#include <c10/core/Stream.h>
#include <c10/core/Allocator.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-parameter")
namespace at {
// AcceleratorHooksInterface is a shared interface provided by all
@ -19,6 +23,10 @@ struct TORCH_API AcceleratorHooksInterface {
// Whether the device at device_index is fully initialized or not.
virtual bool hasPrimaryContext(DeviceIndex device_index) const = 0;
virtual void init() const {
TORCH_CHECK(false, "Backend doesn`t support init()");
}
virtual DeviceIndex deviceCount() const {
return 0;
}
@ -50,7 +58,18 @@ struct TORCH_API AcceleratorHooksInterface {
TORCH_CHECK(false, "Backend doesn't support getPinnedMemoryAllocator()");
return nullptr;
}
virtual const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Backend doesn`t support getDefaultGenerator()");
}
virtual Generator getNewGenerator(
C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Backend doesn`t support getNewGenerator()");
}
};
} // namespace at
C10_DIAGNOSTIC_POP()

View File

@ -6,16 +6,13 @@
#include <ATen/detail/AcceleratorHooksInterface.h>
// Forward-declares at::Generator and at::cuda::NVRTC
// NB: Class must live in `at` due to limitations of Registry.h.
namespace at {
struct Generator;
// Forward-declares at::cuda::NVRTC
namespace cuda {
struct NVRTC;
} // namespace cuda
} // namespace at
// NB: Class must live in `at` due to limitations of Registry.h.
namespace at {
#ifdef _MSC_VER
constexpr const char* CUDA_HELP =
@ -65,12 +62,16 @@ struct TORCH_API CUDAHooksInterface : AcceleratorHooksInterface {
~CUDAHooksInterface() override = default;
// Initialize THCState and, transitively, the CUDA state
virtual void initCUDA() const {
void init() const override {
TORCH_CHECK(false, "Cannot initialize CUDA without ATen_cuda library. ", CUDA_HELP);
}
virtual const Generator& getDefaultCUDAGenerator(C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Cannot get default CUDA generator without ATen_cuda library. ", CUDA_HELP);
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(
false,
"Cannot get default CUDA generator without ATen_cuda library. ",
CUDA_HELP);
}
virtual Device getDeviceFromPtr(void* /*data*/) const {

View File

@ -1,19 +1,13 @@
#pragma once
#include <c10/core/Allocator.h>
#include <c10/core/GeneratorImpl.h>
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <memory>
namespace at {
class Context;
}
// NB: Class must live in `at` due to limitations of Registry.h.
namespace at {
@ -26,13 +20,13 @@ struct TORCH_API HIPHooksInterface : AcceleratorHooksInterface {
// squelch -Werror=non-virtual-dtor
~HIPHooksInterface() override = default;
// Initialize the HIP library state
virtual void initHIP() const {
AT_ERROR("Cannot initialize HIP without ATen_hip library.");
void init() const override {
TORCH_CHECK(false, "Cannot initialize HIP without ATen_hip library.");
}
virtual std::unique_ptr<c10::GeneratorImpl> initHIPGenerator(Context*) const {
AT_ERROR("Cannot initialize HIP generator without ATen_hip library.");
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(false, "Cannot initialize HIP without ATen_hip library.");
}
virtual bool hasHIP() const {
@ -51,10 +45,6 @@ struct TORCH_API HIPHooksInterface : AcceleratorHooksInterface {
AT_ERROR("Pinned memory requires HIP.");
}
virtual void registerHIPTypes(Context*) const {
AT_ERROR("Cannot registerHIPTypes() without ATen_hip library.");
}
virtual int getNumGPUs() const {
return 0;
}

View File

@ -1,25 +1,33 @@
#pragma once
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <c10/core/Allocator.h>
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
namespace at {
struct TORCH_API IPUHooksInterface {
virtual ~IPUHooksInterface() = default;
struct TORCH_API IPUHooksInterface : AcceleratorHooksInterface {
~IPUHooksInterface() override = default;
virtual const Generator& getDefaultIPUGenerator(
DeviceIndex device_index [[maybe_unused]] = -1) const {
AT_ERROR(
"Cannot get the default IPU generator: the IPU backend is not "
"available.");
void init() const override {
TORCH_CHECK(false, "Cannot initialize IPU without ATen_ipu library.");
}
virtual Generator newIPUGenerator(DeviceIndex device_index [[maybe_unused]] = -1) const {
AT_ERROR(
"Cannot create a new IPU generator: the IPU backend is not available.");
bool hasPrimaryContext(DeviceIndex device_index) const override {
TORCH_CHECK(false, "Cannot initialize IPU without ATen_ipu library.");
return false;
}
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(false, "Cannot initialize IPU without ATen_ipu library.");
}
Generator getNewGenerator(
DeviceIndex device_index [[maybe_unused]] = -1) const override {
TORCH_CHECK(false, "Cannot initialize IPU without ATen_ipu library.");
}
};

View File

@ -3,13 +3,24 @@
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
// NB: Class must live in `at` due to limitations of Registry.h.
namespace at {
struct TORCH_API MAIAHooksInterface {
struct TORCH_API MAIAHooksInterface : AcceleratorHooksInterface {
// This should never actually be implemented, but it is used to
// squelch -Werror=non-virtual-dtor
virtual ~MAIAHooksInterface() = default;
~MAIAHooksInterface() override = default;
void init() const override {
TORCH_CHECK(false, "Cannot initialize MAIA without ATen_maia library.");
}
bool hasPrimaryContext(DeviceIndex device_index) const override {
TORCH_CHECK(false, "Cannot initialize MAIA without ATen_maia library.");
return false;
}
virtual std::string showConfig() const {
TORCH_CHECK(false, "Cannot query detailed MAIA version information.");

View File

@ -2,9 +2,9 @@
#pragma once
#include <c10/core/Allocator.h>
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <c10/core/Allocator.h>
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
@ -22,7 +22,7 @@ struct TORCH_API MPSHooksInterface : AcceleratorHooksInterface {
~MPSHooksInterface() override = default;
// Initialize the MPS library state
virtual void initMPS() const {
void init() const override {
FAIL_MPSHOOKS_FUNC(__func__);
}
virtual bool hasMPS() const {
@ -31,7 +31,8 @@ struct TORCH_API MPSHooksInterface : AcceleratorHooksInterface {
virtual bool isOnMacOSorNewer(unsigned major = 13, unsigned minor = 0) const {
FAIL_MPSHOOKS_FUNC(__func__);
}
virtual const Generator& getDefaultMPSGenerator() const {
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
FAIL_MPSHOOKS_FUNC(__func__);
}
virtual Allocator* getMPSDeviceAllocator() const {

View File

@ -31,7 +31,7 @@ struct TORCH_API MTIAHooksInterface : AcceleratorHooksInterface {
~MTIAHooksInterface() override = default;
virtual void initMTIA() const {
void init() const override {
// Avoid logging here, since MTIA needs init devices first then it will know
// how many devices are available. Make it as no-op if mtia extension is not
// dynamically loaded.

View File

@ -1,18 +1,20 @@
#pragma once
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
#include <c10/core/Allocator.h>
#include <c10/core/Device.h>
#include <c10/core/Storage.h>
#include <c10/util/Exception.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-parameter")
namespace at {
struct TORCH_API PrivateUse1HooksInterface : AcceleratorHooksInterface {
~PrivateUse1HooksInterface() override = default;
virtual const at::Generator& getDefaultGenerator(
c10::DeviceIndex device_index) const {
const at::Generator& getDefaultGenerator(
c10::DeviceIndex device_index) const override {
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `getDefaultGenerator`.");
@ -24,23 +26,23 @@ struct TORCH_API PrivateUse1HooksInterface : AcceleratorHooksInterface {
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `getDeviceFromPtr`.");
}
virtual bool isPinnedPtr(const void* data) const override {
bool isPinnedPtr(const void* data) const override {
return false;
}
virtual Allocator* getPinnedMemoryAllocator() const override {
Allocator* getPinnedMemoryAllocator() const override {
TORCH_CHECK(
false,
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `getPinnedMemoryAllocator`.");
}
virtual bool hasPrimaryContext(DeviceIndex device_index) const override {
bool hasPrimaryContext(DeviceIndex device_index) const override {
TORCH_CHECK_NOT_IMPLEMENTED(
false,
"You should register `PrivateUse1HooksInterface` for PrivateUse1 before call `hasPrimaryContext`.");
}
virtual void initPrivateUse1() const {}
void init() const override {}
virtual void resizePrivateUse1Bytes(
const c10::Storage& storage,
size_t newsize) const {

View File

@ -4,7 +4,6 @@
#include <c10/util/Exception.h>
#include <c10/util/Registry.h>
#include <ATen/core/Generator.h>
#include <ATen/detail/AcceleratorHooksInterface.h>
C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED("-Wunused-parameter")
@ -14,10 +13,8 @@ namespace at {
struct TORCH_API XPUHooksInterface : AcceleratorHooksInterface{
~XPUHooksInterface() override = default;
virtual void initXPU() const {
TORCH_CHECK(
false,
"Cannot initialize XPU without ATen_xpu library.");
void init() const override {
TORCH_CHECK(false, "Cannot initialize XPU without ATen_xpu library.");
}
virtual bool hasXPU() const {
@ -34,12 +31,15 @@ struct TORCH_API XPUHooksInterface : AcceleratorHooksInterface{
TORCH_CHECK(false, "Cannot get XPU global device index without ATen_xpu library.");
}
virtual Generator getXPUGenerator(C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Cannot get XPU generator without ATen_xpu library.");
const Generator& getDefaultGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(
false, "Cannot get default XPU generator without ATen_xpu library.");
}
virtual const Generator& getDefaultXPUGenerator(C10_UNUSED DeviceIndex device_index = -1) const {
TORCH_CHECK(false, "Cannot get default XPU generator without ATen_xpu library.");
Generator getNewGenerator(
C10_UNUSED DeviceIndex device_index = -1) const override {
TORCH_CHECK(false, "Cannot get XPU generator without ATen_xpu library.");
}
virtual DeviceIndex getNumGPUs() const {

View File

@ -362,6 +362,7 @@ static std::tuple<Tensor,Tensor,Tensor> convolution_backward_plumbing(
const Tensor& grad_output_, const Tensor& input_, const Tensor& weight_,
const c10::OptionalArrayRef<SymInt> bias_sizes_opt,
c10::SymIntArrayRef stride, c10::SymIntArrayRef padding, c10::SymIntArrayRef dilation, bool transposed,
// NOLINTNEXTLINE(performance-unnecessary-value-param)
c10::SymIntArrayRef output_padding, c10::SymInt groups, std::array<bool, 3> output_mask) {
const auto maybe_layer = maybeCurrentDynamicLayer();
vmap_check_escaped(maybe_layer, "convolution_backward_plumbing");

View File

@ -458,6 +458,16 @@ inline int64_t get_bdim_size2(
TORCH_INTERNAL_ASSERT(false);
}
inline c10::SymInt get_bdim_size2_symint(
const Tensor& a_value, std::optional<int64_t> a_bdim,
const Tensor& b_value, std::optional<int64_t> b_bdim) {
if (a_bdim)
return a_value.sym_size(*a_bdim);
if (b_bdim)
return b_value.sym_size(*b_bdim);
TORCH_INTERNAL_ASSERT(false);
}
// [start, start + 1, ..., stop - 1]
inline VmapDimVector range(int64_t start, int64_t stop) {
TORCH_INTERNAL_ASSERT(stop >= start);

View File

@ -8,7 +8,7 @@
#include <ATen/core/dispatch/Dispatcher.h>
#include <ATen/functorch/BatchRulesHelper.h>
namespace at { namespace functorch {
namespace at::functorch {
#define OP_DECOMPOSE(op) m.impl(#op, static_cast<decltype(&ATEN_FN(op))>(native::op));
#define OP_DECOMPOSE2(op, overload) m.impl(#op"."#overload, static_cast<decltype(&ATEN_FN2(op, overload))>(native::op));
@ -20,4 +20,4 @@ TORCH_LIBRARY_IMPL(aten, FuncTorchBatched, m) {
OP_DECOMPOSE(_unsafe_masked_index_put_accumulate);
}
}}
}

View File

@ -226,7 +226,7 @@ static Tensor one_hot_decomposition_hack(const Tensor &self, int64_t num_classes
if (num_classes <= 0) {
AT_ERROR("Can not infer total number of classes from empty tensor.");
} else {
shape.push_back(num_classes);
shape.emplace_back(num_classes);
return at::empty_symint(shape, self.options());
}
}
@ -246,7 +246,7 @@ static Tensor one_hot_decomposition_hack(const Tensor &self, int64_t num_classes
// TORCH_CHECK(num_classes > self.max().item().toLong(), "Class values must be smaller than num_classes.");
// }
shape.push_back(num_classes);
shape.emplace_back(num_classes);
Tensor ret = at::zeros_symint(shape, self.options());
return ret.scatter(-1, self.unsqueeze(-1), 1);
}

View File

@ -213,7 +213,7 @@ static std::tuple<Tensor,Tensor> native_dropout_batching_rule(const Tensor& tens
return std::make_tuple(output, mask);
}
static Tensor multinomial_batching_rule(const Tensor& self, const int64_t num_samples, const bool replacement, const std::optional<Generator> generator) {
static Tensor multinomial_batching_rule(const Tensor& self, const int64_t num_samples, const bool replacement, std::optional<Generator> generator) {
c10::impl::ExcludeDispatchKeyGuard guard(DispatchKey::FuncTorchVmapMode);
auto maybe_layer = maybeCurrentDynamicLayer();
const auto cur_level = maybe_layer->layerId();
@ -237,7 +237,7 @@ static Tensor multinomial_batching_rule(const Tensor& self, const int64_t num_sa
if (is_2D_case) {
self_value = reshape_dim_into(0, 0, self_value);
}
auto out = multinomial(self_value, num_samples, replacement, generator);
auto out = multinomial(self_value, num_samples, replacement, std::move(generator));
if (is_2D_case) {
out = reshape_dim_outof_symint(0, maybe_layer->batchSize(), out);
}
@ -249,7 +249,7 @@ static Tensor multinomial_batching_rule(const Tensor& self, const int64_t num_sa
// Must be same randomness with unbatched input
// 1D case: S -> multinomial(S) -> S
// 2D case: MS -> multinomial(MS) -> MS
return multinomial(self_value, num_samples, replacement, generator);
return multinomial(self_value, num_samples, replacement, std::move(generator));
}
template <typename A, A a, typename C>

Some files were not shown because too many files have changed in this diff Show More