Compare commits

..

2160 Commits

Author SHA1 Message Date
507c69e20f Don't uselessly recompute axiom dict every static eval call (#135429)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135429
Approved by: https://github.com/isuruf
ghstack dependencies: #135137
2024-09-27 04:03:25 +00:00
285fa03b5e Deal with size oblivious before going into worker (#135137)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135137
Approved by: https://github.com/isuruf
2024-09-27 04:03:25 +00:00
86631eccda [Inductor] Remove stride-0 dimensions from more complex block pointers (#135557)
Related issue: #125077

### Feature
Inductor tries to remove dimensions with stride 0 from block pointers. Rather than loading with stride 0, it's more efficient to load a smaller block pointer, then use `tl.broadcast_to` to broadcast it up to the desired size. This already worked for simpler block pointers, but it was disabled for more complex block pointers which used `tl.reshape` to change the dimensionality after loading.

This PR generalizes the approach to work for all block pointers. The idea is to first reshape, adding singleton dimensions, then broadcast those singletons up to something larger, then reshape again to the final output shape. For readability, we emit this code only if it actually does something. Simpler loads will just have `tl.load`.

Here's an example of a complicated kernel that uses `reshape` -> `load` -> `reshape`. (The first reshape is actually the slice `[None,None,:]`).
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 64
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x2 = xindex
    x1 = (xindex // 8)
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp1 = tl.reshape(tl.broadcast_to(tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[((7 + XBLOCK) // 8)], order=[0], offsets=[(xoffset // 8)]), boundary_check=[0], eviction_policy='evict_last')[:, None, None], [((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))]), [XBLOCK])
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tmp2.to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```

Before this PR, we would have stride-0 dimensions:
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 64
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x2 = xindex
    x1 = (xindex // 8)
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
    tmp1 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr1, shape=[8, 1, 8], strides=[8, 0, 0], block_shape=[((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))], order=[2, 1, 0], offsets=[(xoffset // 8), 0, xoffset % 8]), boundary_check=[0], eviction_policy='evict_last'), [XBLOCK])
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```

Here's a simpler example where we use 2D tiling. In this case we don't actually need the broadcast. The broadcast is implied via a slice adding a new singleton dimension. This code is not changed by this PR, but it's important to know that we don't accidentally insert unnecessary broadcasts.
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 8
    xnumel = 8
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    x1 = xindex
    y0 = yindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1])
    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :]
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tmp2.to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
```
### Test Plan
Added a new expecttest to check the emitted code for broadcast addition. Looking at the test, we can see that stride 0 dimensions are removed. (This test generated the example kernels in the previous section.)

This change also removed a stride-0 dimension in an existing block pointer test. I updated the expected code accordingly.

Bonus: I noticed that the test parametrization for `config.prefer_nd_tiling` wasn't working as intended. It ended up always setting this option to `True`. Fixed it so we get the intended test coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135557
Approved by: https://github.com/shunting314, https://github.com/jansel

Co-authored-by: Yueming Hao <yhao@meta.com>
2024-09-27 04:01:40 +00:00
2c5f5e303a [inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor (#136594)
Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])
`

https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

Differential Revision: [D63465169](https://our.internmc.facebook.com/intern/diff/D63465169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136594
Approved by: https://github.com/mengluy0125, https://github.com/jansel
2024-09-27 04:01:09 +00:00
a2d2a30311 Add torch._dynamo.config.fail_on_cache_limit_hit (#136767)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136767
Approved by: https://github.com/albanD, https://github.com/jansel
ghstack dependencies: #136533
2024-09-27 03:58:00 +00:00
2521cd3874 Skip kernel saving if already existed. (#136389)
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm:

Before:
<img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a">

After:
<img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118">

We can see that before the change, the benchmarking includes two parts,
(1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation.
(2) The exact computation of Triton kernel.

We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels.

Test Plan:
Existing OSS CI
[Redacted, Some internal model results in D63441430]

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136389
Approved by: https://github.com/desertfire
2024-09-27 03:03:28 +00:00
d1382aaf3d skip test_out_of_memory for jetson (#133270)
Skip test_out_of_memory in test/test_cuda.py on Jetson as OOM reporting in Jetson has issues due to partially missing NVML support. cc @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133270
Approved by: https://github.com/eqy, https://github.com/albanD, https://github.com/seemethere
2024-09-27 02:36:48 +00:00
26869d38e1 [Inductor] Further solve missing aoti_torch_check symbole issue (#136775)
Summary: https://github.com/pytorch/pytorch/pull/136669 didn't resolve all the internal test failures. Add more tests to OSS CI to catch the remaining issues, and fix some internal TARGETS dependency.

Differential Revision: D63473744

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136775
Approved by: https://github.com/henrylhtsang
2024-09-27 02:26:49 +00:00
66340e6751 Fix numerical instability for norm (#129352)
Fixes #123645
When the reduce size is large, reducing directly may exceed the range that FP32 can represent, resulting in incorrect results.
Reducing in group and using double as the intermediate cumulative type can avoid exceeding the representation range.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129352
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-27 00:51:31 +00:00
adc77a9b7f [lintrunner] auto apply formatting changes as suggestions (#136239)
(Remove spurious cc)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136239
Approved by: https://github.com/huydhn, https://github.com/eqy

Co-authored-by: Huy Do <huydhn@gmail.com>
2024-09-27 00:51:21 +00:00
faedee12fa [test] enable test_triton_wrapper again (#136721)
Summary:
Reenable the `test_triton_wrapper.py` test again

# Why

We want this to run internally

# What

- fix python path issue on the test
- reenable the test

# Background

It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:triton_wrapper

Differential Revision: D63438186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136721
Approved by: https://github.com/henrylhtsang
2024-09-27 00:44:40 +00:00
22a4129a76 Generalization of FSDP common for non-cuda execution (#133209)
## Motivation
The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209
Approved by: https://github.com/kwen2501
2024-09-27 00:38:10 +00:00
a619ced5ed Revert "Update run_test.py"
This reverts commit 193073b4914a7f80758541d391eacbe21194ecdf.
2024-09-26 17:34:52 -07:00
193073b491 Update run_test.py 2024-09-26 16:56:29 -07:00
aa56f80ec1 Dont pairwise check unfusable nodes in scheduler (#136682)
Gives 8% wall time speedup on n=1000 benchmark in https://github.com/pytorch/pytorch/pull/136429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136682
Approved by: https://github.com/ezyang, https://github.com/jansel, https://github.com/shunting314
2024-09-26 23:46:52 +00:00
0b62ebfeaa [CI] Populate JOB_ID for MPS tests (#136791)
Move `get-job-id` steps before running the tests and copy-n-paste environment variables from `_mac-test.yml` added in https://github.com/pytorch/pytorch/pull/113099

Should fix the following warning during MPS test run:
```
/Users/ec2-user/runner/_work/pytorch/pytorch/tools/stats/upload_metrics.py:147: UserWarning: Not emitting metrics for td_test_failure_stats_v2. Missing job_id. Please set the JOB_ID environment variable to pass in this value.
  warn(f"Not emitting metrics for {metric_name}. {e}")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136791
Approved by: https://github.com/albanD, https://github.com/izaitsevfb
2024-09-26 23:00:52 +00:00
da5c7b6f4e [AOTI] Set CUDA device for torch._export.aot_load (#136715)
Summary: Fixes https://github.com/pytorch/pytorch/issues/136369. When a CUDA device with index is specified when calling torch._export.aot_load, we need to specify the CUDA device when running model.so.

Differential Revision: [D63438335](https://our.internmc.facebook.com/intern/diff/D63438335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136715
Approved by: https://github.com/angelayi
2024-09-26 22:20:12 +00:00
991f8f8ec3 Bias gradient calculation for NJT linear backward (#136660)
Previously NYI - @mikaylagawarecki needs it for Transformers.

Fixes #136652
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136660
Approved by: https://github.com/soulitzer
2024-09-26 21:38:10 +00:00
eqy
c0e98a485b [FP8][CUDA] Fix stale expected error message (#136581)
CC @nWEIdia as I think we have seen internal failures on this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136581
Approved by: https://github.com/mikaylagawarecki
2024-09-26 20:57:38 +00:00
5789f8d5dc [MPS] Add regression test for large inputs to F.linear (#136084)
This PR adds a regression test for the issue reported in #122045. I was not able to reproduce on macOS > 13.

~Expect the first iteration of the tests to fail for macOS 13, but pass for 14 and 15.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136084
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-26 20:46:14 +00:00
9656a603b2 Fix lint (#136781)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136781
Approved by: https://github.com/clee2000, https://github.com/ZainRizvi, https://github.com/malfet
2024-09-26 19:13:56 +00:00
c878ea2c4e Add info about "release tracker" label for cherry-picking bot (#136777)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136777
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-09-26 18:45:45 +00:00
851b9732aa Download pre-compiled AOTriton from GitHub unless AOTRITON_INSTALL_FROM_SOURCE=1 is set (#136603)
PyTorch community members have reported issues with building PyTorch from source for ROCm in an environment that doesn't have aotriton pre-installed, because aotriton is only installed in the [CI](a8ed873ba2/.ci/docker/manywheel/Dockerfile (L197)) docker images. Building aotriton from source can take ~45 minutes.

This PR fixes the issue by downloading the aotriton tarball in such scenarios, *unless the user explicitly wants to build aotriton from source using the AOTRITON_INSTALL_FROM_SOURCE=1 env var*

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136603
Approved by: https://github.com/atalman

Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
2024-09-26 18:05:51 +00:00
f0a92541fe [export] fix lifted constants order for 0-input graphs (#136658)
Summary:
With empty graphs, the `graph.inserting_before(first_user_input = None)` call turns into a `graph.inserting_after(root)` call, inverting the order of constant input nodes being inserted.

This fixes the issue by initializing to the first node in the graph (still valid if not a user input - only used for insertion).

Test Plan: test_export

Differential Revision: D63403514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136658
Approved by: https://github.com/avikchaudhuri
2024-09-26 17:44:24 +00:00
40c825d773 [reland] [torchelastic][c10d] Fix store prefix race in rendezvous (#136768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136768
Approved by: https://github.com/kwen2501, https://github.com/atalman
2024-09-26 17:37:07 +00:00
da09984c0d [AOTI][Tooling][9/n] Add debug printer support for cpp kernel type (#136465)
Summary:

As title.

Cpp kernel has a different codegen path: https://www.internalfb.com/code/fbsource/[6df946858879dd9bcefa18710dd79095a957f0dd]/fbcode/caffe2/torch/_inductor/codegen/cpp.py?lines=4643
Previously it is not wrapped/supported by the debug printer manager. This diff adds this support.
It can be useful for cpu models. See this for a use case: https://www.internalfb.com/phabricator/paste/view/P1598561051?lines=927

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run 'fbcode//mode/opt' fbcode//accelerators/workloads/models/slimdsnn:slimdsnn -- aot --batch-size 1
```

Differential Revision: D63053101

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136465
Approved by: https://github.com/hl475
2024-09-26 17:30:43 +00:00
e4e83a4ac4 Remove aten.item hack (#136663)
Summary: Title

Test Plan: CI

Differential Revision: D63404353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136663
Approved by: https://github.com/bdhirsh
2024-09-26 17:14:48 +00:00
2421344d8f Update current maintainers (#136672)
This file didn't had an overall in a few years so long overdue. Most of the credit goes to @orionr for gathering all of this info.

The main rules we followed:
- No code contributor is removed, they're all placed as emeritus
- Breakdown too big categories to make this document useful to know who to ping
- No category where the code is still in the codebase is removed
- We did not rework the categories (for example to be closer to module: labels) and leave that for later
- All non-emeritus names are ordered by their number of comments on issues related to their topic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136672
Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/seemethere, https://github.com/malfet
2024-09-26 17:13:16 +00:00
beb46de342 Correctly convert Python float to float64 when passing argument as Tensor (#136413)
I can't actually test the Dynamo codegen fix as it is impossible to
directly use the Tensor at the moment.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413
Approved by: https://github.com/bobrenjc93
ghstack dependencies: #136599
2024-09-26 16:50:13 +00:00
11fd55827d Make CLOSURE_VARS construction lazy (#136599)
This makes us less likely to hit import cycle problems with torch

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136599
Approved by: https://github.com/anijain2305
2024-09-26 16:50:13 +00:00
ff2360c733 [FlexAttention] Reduce expensive test time by 10x (#136677)
Now that we support non 128 divisble sequence lengths; drops expensive tests by like 10x
Before
```Shell
46.32s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1
45.61s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2
44.45s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3
43.61s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0
```

After:
```Shell
4.25s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod5
4.20s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod4
4.19s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod1
4.04s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod2
3.99s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod0
3.98s call     test/inductor/test_flex_attention.py::TestFlexAttention::test_aot_eager_gradcheck_score_mod3
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136677
Approved by: https://github.com/Chillee
ghstack dependencies: #136673
2024-09-26 16:40:21 +00:00
840c6b7a68 [FlexAttention] Add Better error message for cpu tensors (#136673)
Partially address: #136525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136673
Approved by: https://github.com/Chillee
2024-09-26 16:40:21 +00:00
ddab704b28 Use wildcard for portion of AMI version number (#136764)
Rather than specifying a specific version number for the AMIs, use wildcards for the date section.

Issue: pytorch/pytorch#136762

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136764
Approved by: https://github.com/ZainRizvi
2024-09-26 16:39:25 +00:00
cyy
59e8f8228f [3/N] Fix clang-tidy warnings in torch/csrc/lazy (#136705)
Follows #136634
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136705
Approved by: https://github.com/Skylion007
2024-09-26 16:29:43 +00:00
31c0467594 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel, https://github.com/blaine-rister, https://github.com/malfet
2024-09-26 15:35:26 +00:00
68579ef665 [EZ][MPS] Extend arange to bfloat16 (#136754)
RangeFactories class is the only one that uses `AT_DISPATCH_MPS_TYPES`

Fixes https://github.com/pytorch/pytorch/issues/136624
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136754
Approved by: https://github.com/Skylion007
2024-09-26 15:33:45 +00:00
73ec76ed50 [MPS] Implement isposinf and isneginf (#136689)
Not sure, why `isinf` is a composite op, but those needs to be implemented by hand.

Implementation is a trivial call to
```objc
[mpsGraph equalWithPrimaryTensor:input
                 secondaryTensor:[mpsGraph constantWithScalar:std::numeric_limits<T>::infinity()
                                                     dataType:input.dataType]]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136689
Approved by: https://github.com/Skylion007
2024-09-26 15:33:20 +00:00
d05645841e Update get_device_properties to take in optional device (#136683)
Aligns behavior with the rest of cuda's device info query methods

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136683
Approved by: https://github.com/eqy
2024-09-26 15:07:31 +00:00
d5e4a20c17 Revert "Introduce _ArglessActivation base class for parameterless activation functions (#136296)"
This reverts commit dda0e4de32b29098f25f9b2889423c9446680cc1.

Reverted https://github.com/pytorch/pytorch/pull/136296 on behalf of https://github.com/atalman due to Breaks Internal CI. Error: Too many arguments [19]: Call `nn.modules.activation._ArglessActivation.__init__` expects 0 positional arguments, 1 was provided. ([comment](https://github.com/pytorch/pytorch/pull/136296#issuecomment-2377091280))
2024-09-26 14:12:12 +00:00
4150ab44a4 Fix composite op redispatch for NJT in inference mode (#134683)
Prior to this PR, calling `reshape()` under `inference_mode()` would throw a `NotImplementedError`. This is because `inference_mode()` disables autograd key dispatch, incidentally preventing the decomposition of reshape for NJT.

This PR fixes this by redispatching on the `CompositeImplicitAutogradNestedTensor` key whenever a composite implicit op is encountered in `NJT.__torch_dispatch__()`. This fixes reshape and any other composite implicit ops underneath `inference_mode()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134683
Approved by: https://github.com/soulitzer, https://github.com/albanD
ghstack dependencies: #136566
2024-09-26 14:10:53 +00:00
f8debd5d83 Fix wrapper subclass reentrant dispatch + TorchDispatchMode (#136566)
Fixes #136565

This PR makes the python fallback robust to the case where there are no active modes & no tensors with the Python key. In this case, simply redispatch with the Python key disabled.

This was found when trying to use reentrant dispatch for NJT to get decompositions under `inference_mode()` when the autograd key is disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136566
Approved by: https://github.com/bdhirsh
2024-09-26 14:06:51 +00:00
963e793e1b [Inductor][CPP] Optimize WOQ INT8 wgt dequant in AMX GEMM template (#136630)
**Summary**
Optimize the WOQ int8 AMX performance by changing the int8 -> bf16 conversion.
Earlier, 16 int8 elements were being loaded at a time & converted to 16 BF16 elements.
With this change, 32 int8 elements will be loaded at a time, and converted to a cache-line of 32 BF16 elements more efficiently.

Performance before
```
AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096)
  cpp_packed_gemm_0 38.0439 ms 100.0%
  _weight_int8pack_mm 50.2524 ms 75.7%
SingleProcess AUTOTUNE benchmarking takes 1.1087 seconds and 1.9791 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008)
  cpp_packed_gemm_4 78.2038 ms 100.0%
  _weight_int8pack_mm 119.1962 ms 65.6%
SingleProcess AUTOTUNE benchmarking takes 1.9274 seconds and 1.9949 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096)
  cpp_packed_gemm_6 79.2368 ms 100.0%
  _weight_int8pack_mm 118.3212 ms 67.0%
SingleProcess AUTOTUNE benchmarking takes 1.9200 seconds and 2.0015 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000)
  cpp_packed_gemm_224 225.7201 ms 100.0%
  _weight_int8pack_mm 388.5588 ms 58.1%
```

Performance after this PR
```
AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096)
  cpp_packed_gemm_0 11.0086 ms 100.0%
  _weight_int8pack_mm 50.2918 ms 21.9%
SingleProcess AUTOTUNE benchmarking takes 1.0837 seconds and 2.0301 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008)
  cpp_packed_gemm_4 24.3528 ms 100.0%
  _weight_int8pack_mm 119.8492 ms 20.3%
SingleProcess AUTOTUNE benchmarking takes 1.8303 seconds and 1.8195 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096)
  cpp_packed_gemm_6 24.6148 ms 100.0%
  _weight_int8pack_mm 119.1908 ms 20.7%
SingleProcess AUTOTUNE benchmarking takes 1.8315 seconds and 1.8352 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000)
  cpp_packed_gemm_224 78.1369 ms 100.0%
  _weight_int8pack_mm 387.6289 ms 20.2%
SingleProcess AUTOTUNE benchmarking takes 4.5059 seconds and 1.8010 seconds precompiling
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136630
Approved by: https://github.com/jgong5
ghstack dependencies: #136353
2024-09-26 08:41:58 +00:00
77fba0c407 [PT2][Optimus] Fix a group batch fusion corner case (#136650)
Summary:
We have a user report on BA model that it raised "AttributeError: 'SymFloat' object has no attribute 'shape'", thus we add type check for the meta node.

See more context in the post
https://fb.workplace.com/groups/1075192433118967/permalink/1510477489590457/

Test Plan:
# local reproduce

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split-batch-decompose --flow_id 646303196
```

P1609807876

# E2E

before fix

f646303196

after fix

Differential Revision: D63399959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136650
Approved by: https://github.com/ezyang
2024-09-26 06:35:11 +00:00
d1bb8e828f Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby
2024-09-26 04:52:05 +00:00
b408591b53 Revert "[Flex Attention] fix block size order (#136657)"
This reverts commit 529b6ab0bb9f8800ed795ec8e4fa1f0e8042bb0a.

Reverted https://github.com/pytorch/pytorch/pull/136657 on behalf of https://github.com/huydhn due to Sorry for reverting your change but some test_flex_attention is failing in trunk after this change 529b6ab0bb ([comment](https://github.com/pytorch/pytorch/pull/136657#issuecomment-2375824802))
2024-09-26 04:06:41 +00:00
cyy
3c542ce831 [Reland] Check function declarations of COREML code (#136070)
Reland of #135467 by fixing periodic workflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136070
Approved by: https://github.com/ezyang
2024-09-26 03:52:06 +00:00
042af7ec53 [BE] [MPS] Use validation helper for input tensors (#134609)
Small refactor to use already existing helper with equivalent behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134609
Approved by: https://github.com/malfet
2024-09-26 03:47:30 +00:00
e4d32d2194 Improve data-dependent-output meta kernel error message (#136671)
Test Plan:
- code reading
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136671
Approved by: https://github.com/williamwen42
2024-09-26 03:46:04 +00:00
190e09d8b6 [Inductor UT] Generalize device-bias code introduced from #134874 and (#136596)
[Inductor UT] Generalize device-bias code introduced from #134874 and fix unexpected success test cases.
Fix #136595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136596
Approved by: https://github.com/EikanWang, https://github.com/jansel

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
2024-09-26 02:56:59 +00:00
dda0e4de32 Introduce _ArglessActivation base class for parameterless activation functions (#136296)
Fixes #133683
Fixes #133684
Fixes #133688

This PR introduces a new base class `_ArglessActivation` and refactors five existing activation functions to inherit from it. This change aims to improve documentation consistency and also API consistency with other activation functions that do have parameters and explicitly call `super().__init__()`

Key changes and considerations:
1. Added new class `_ArglessActivation`:
2. Refactored the following classes to inherit from `_ArglessActivation`:
   - Sigmoid
   - Tanh
   - Softsign
   - Tanhshrink
   - Softmax2d
3. Performance consideration:
   - This change introduces a slight overhead for creating a new stack frame and handling an additional function call on every instance creation
   - The impact is expected to be minimal in most use cases

Docs view before:
<img width="425" alt="Screen Shot 2024-09-18 at 3 00 22 PM" src="https://github.com/user-attachments/assets/ca0d1000-44c5-4c52-b344-68f7e170bafe">

Docs view after:
<img width="431" alt="Screen Shot 2024-09-18 at 3 00 52 PM" src="https://github.com/user-attachments/assets/f7ceb8f3-a2a2-4fd6-a2b8-39105a02bcbd">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136296
Approved by: https://github.com/mikaylagawarecki
2024-09-26 02:45:05 +00:00
d0456b4274 noop on torch.library APIs under torch::deploy (multipy) (#136645)
Fixes https://github.com/pytorch/pytorch/issues/136177

The motivation is that torch::deploy doesn't handle this well. The
workaround for users is to use C++ custom ops.

All torch.library APIs ultimately go through the torch.library.Library
object, so we add checks to noop for torch::deploy there.

Test Plan:
- new test
- going to test this internally and hope nothing breaks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136645
Approved by: https://github.com/ezyang
2024-09-26 02:34:34 +00:00
5c78c6b05a [CI] Switch aarch64 dashboard run back to nightly (#136643)
Summary: Reduce the frequency of the aarch64 dashboard CI run since we don't need to monitor its instability anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136643
Approved by: https://github.com/huydhn
2024-09-26 01:26:05 +00:00
141cae2eb8 [pipelining] Fix more leaks and check leaks in tests (#136584)
Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details).

This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress.

Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles.

Uses objgraph for a nice debug utility when a leak is found.

Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak.

I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker.

Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py,
and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`:

```
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle?
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png
Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes)
Graph viewer (xdot) not found, generating a png instead
Image generated as /tmp/objgraph-ztz642h3.png
```

rendering of ` /tmp/objgraph-ztz642h3.png`:
<img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
ghstack dependencies: #136507

Co-authored-by: Howard Huang <howardhuang@fb.com>
2024-09-26 01:10:40 +00:00
e8f1dd6ba0 Fix hardcoded ROCm paths in Caffe2Targets.cmake (#136283)
Fixes #131701

Use CMake imported targets more consistently to eliminate hardcode paths.

Here is the new relevant sections of Caffe2Targets.cmake:
```
set_target_properties(c10_hip PROPERTIES
  INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
  INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64"
)
```

```
set_target_properties(torch_hip PROPERTIES
  INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL"
  INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS"
  INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
  INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver"
)
```

HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136283
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman
2024-09-26 00:34:43 +00:00
f3dd1721f4 [Update] Update note for Getting Started with PyTorch on Intel GPUs (#129946)
remove the hardware and software prerequisites and set up env part.
keep the prerequisites section and link to pytorch prerequistes for intel gpus for driver install, intel support package install and env set up
https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html
Update the support for Intel Client GPU MTL-H
Update inference & training examples

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129946
Approved by: https://github.com/seemethere
2024-09-26 00:22:05 +00:00
9223c16208 Revert "Fix constant propagation in builtins and UserClasses (#131354)"
This reverts commit dd4a51b39aa02cba23b3a387b41c5026770d9220.

Reverted https://github.com/pytorch/pytorch/pull/131354 on behalf of https://github.com/atalman due to Breaks torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/131354#issuecomment-2375417145))
2024-09-25 23:01:03 +00:00
ecc15c4f89 [AOTI] Fix a missing aoti_torch_check symbol issue (#136669)
Summary: When Inductor generates cpp kernels, they should be pure cpp loops which are independent to libtorch as much as possible.

Differential Revision: D63403473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136669
Approved by: https://github.com/henrylhtsang
2024-09-25 22:56:10 +00:00
b7a5c7d331 Do not XFAIL test_segfault in fbcode (#136661)
https://github.com/pytorch/pytorch/pull/136252 silence the failure on OSS, but the test actually passed on fbcode [T202241133](https://www.internalfb.com/intern/tasks/?t=202241133)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136661
Approved by: https://github.com/malfet
2024-09-25 22:26:24 +00:00
8d65d9f11b Constraint setuptools to 72.1.0 or older in requirements.txt (#136489)
FIXES: https://github.com/pytorch/pytorch/issues/136541

Setuptools>=74.0.0 has deprecated support for some functions in distutils, and so the builds run into error such as ```AttributeError: module 'distutils' has no attribute '_msvccompiler'```. Also, the pytorch builds have setuptools pin to 72.1.0 according to these PRs: https://github.com/pytorch/builder/pull/1995 and 89d9a8cf6f. So, until there is a fix to change the function usage in accordance with latest setuptools, the 72.1.0 version works fine.

Also observed in CI jobs: https://github.com/pytorch/pytorch/actions/runs/10979326524
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136489
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-25 22:06:05 +00:00
c9d12f6360 [inductor][memory] add signpost event for memory pass (#136538)
Add logging to scuba table for internal models.

For verification, I triggered a sample workflow internally and checked the scuba table logging to make sure the `Paramaters` column has the expected loggings, see [here](https://fburl.com/scuba/workflow_signpost/39h7qo9s).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136538
Approved by: https://github.com/yf225
2024-09-25 21:47:46 +00:00
b5c2a657ae Add zou3519 to CODEOWNERS for HOPs (#136679)
There are some tricky things that I want to guard against
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136679
Approved by: https://github.com/Chillee
2024-09-25 21:29:48 +00:00
289df45cee Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)" (#136590)
This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266.

Reverts
* https://github.com/pytorch/pytorch/pull/135503
* https://github.com/pytorch/pytorch/pull/135502
* https://github.com/pytorch/pytorch/pull/135422

This passes this test. Earlier, the getitem would stay like a getitem in the Fx graph. But now the fake tensor propagations fails saying that .item is called. It seems that torch function is not getting triggered while fake tensor propagation.

```
import torch
from torch.nn.attention.flex_attention import BlockMask, _mask_mod_signature, _score_mod_signature, flex_attention
from torch._inductor.lowering import make_pointwise, register_lowering
from torch._inductor.virtualized import ops
from torch.nn.attention.flex_attention import create_block_mask

torch.set_default_device('cuda')

flex_attention = torch.compile(flex_attention, dynamic=False)

prefix_lengths = torch.arange(8)
def prefix_lm(b, h, q, kv):
    return prefix_lengths[b] >= kv

mask = create_block_mask(prefix_lm, 8, None, 512, 512, _compile=True)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136590
Approved by: https://github.com/Chillee
2024-09-25 21:10:43 +00:00
529b6ab0bb [Flex Attention] fix block size order (#136657)
`create_block_mask` currently gives wrong BLOCK_SIZE and shape when using non-default block size `(128,128)`.
This PR fixes the issue by using BLOCK_SIZE order `(Q_BLOCK_SIZE, KV_BLOCK_SIZE)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136657
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-09-25 21:08:40 +00:00
76b044d7cb Don't actually import module when checking if its valid (#136548)
Summary: If you actually import the module, you might end up with some import cycle situation where a module is imported too early and accesses things that are not initialized yet.

Test Plan:
sandcastle and ossci

```
TORCH_LOGS=+torch._inductor.codecache buck run mode/opt caffe2/benchmarks/dynamo:torchbench
```

Differential Revision: D63330224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136548
Approved by: https://github.com/Skylion007
2024-09-25 20:47:32 +00:00
11c5f9ac3b Use amazon linux 2023 runners for Docker builds (#136544)
Migrate these builds to linux 2023. We want to build and test the Docker images in CD.

Looks like we are hitting this issue: https://github.com/docker/buildx/issues/379 when trying to build Docker on Amazon Linux 2023.

Conda Docker build is timing out. While Manywheel is executing but failing because BUILDKIT is turned off: https://github.com/pytorch/pytorch/actions/runs/11036043157/job/30653543264?pr=136544

Proposed Solution is to fix it in user_data . Please see: https://github.com/pytorch/test-infra/issues/5712

I see docker builds are executed successfully here: https://github.com/pytorch/pytorch/actions/runs/11040149229/job/30667448668?pr=136544

Workaround timeout problem (reported in https://bugzilla.redhat.com/show_bug.cgi?id=1537564 ) by configuring number of open files per container to 1048576
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136544
Approved by: https://github.com/ZainRizvi

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-25 20:39:56 +00:00
13b0baf2a1 [FX] Update _inline_module util function to work with both args and kwargs (#136631)
Summary: Previously `_inline_module ` helper function only works with submodules that have args specified. This diff updates the util function to look for input arguments from submodule kwargs first using placeholder node names, then fallback to list of args if node name not found.

Test Plan:
```
buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_connected_fusions
```

Differential Revision: D63347675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136631
Approved by: https://github.com/jfix71
2024-09-25 20:20:57 +00:00
a8ed873ba2 Add missing input "eps" to adam docs (#135191)
Minor fix for missing input argument in the Adam optimizer docs page.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135191
Approved by: https://github.com/janeyx99
2024-09-25 20:17:23 +00:00
cyy
6aa6bd4ca5 [Distributed] [12/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136528)
Follows #136439. A dangling reference to qualifiedName was found and fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136528
Approved by: https://github.com/kwen2501
2024-09-25 20:12:08 +00:00
5a29a06aa3 [AMD][inductor] do not use float64 on AMD internally (#136441)
Summary:
Internal AMD triton seems to have issue with float64 constant:

```
### Most recent error lines found on the logs:
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]                ^
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp8 = tl.broadcast_to((libdevice.llrint((tl.full([1], 1.00000000000000, tl.float64))*(ks3.to(tl.float64)))) / ks1, [XBLOCK, RBLOCK])
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp7 = tmp5 + tmp6
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp6 = 0.5
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp5 = tmp4.to(tl.float32)
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp4 = (((r3 + (x0*((17 + (16*ks0*ks1)) // 18))) % ks2) // ks0) % ks1
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp3 = tmp2.to(tl.int1)
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp2 = tmp0 < tmp1
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp1 = 16*ks0*ks1
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         tmp0 = r3 + (x0*((17 + (16*ks0*ks1)) // 18))
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         r3 = rindex
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         rmask = rindex < rnumel
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]         rindex = roffset + rbase
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2] triton.compiler.errors.CompilationError: at 26:15:
E0920 13:23:56.391000 2026 torch/_inductor/runtime/triton_heuristics.py:446] [2/2]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns)
```

Bisecting showing this error introduced by D62465575

This diff tries to not convert constant to float64 on AMD, and emu1.4 predictor now can run on AMD with rocm6.0.

Test Plan:
rocm6.0 can work
```
TORCHINDUCTOR_AUTOTUNE_REMOTE_CACHE=1 HIP_FORCE_DEV_KERNARG=1 HIP_GRAPH=--use-cuda-graph PYTORCH_MIOPEN_SUGGEST_NHWC=1 TORCHINDUCTOR_LAYOUT_OPTIMIZATION=1 CUDA_VISIBLE_DEVICES="2" TORCH_LOGS="recompiles,cudagraphs" buck2 run @//mode/opt-amd-gpu -c fbcode.rocm_ck_rtz=true -m rocm60 fblearner/predictor/py/applications/photogen:ip_python_predictor_photogen_cm -- --model=photogen_v1p4_9b --thrift_server_port=15008 --max_predict_calls=1 --enable_tunable_op --load_from_torch_package=genai:937233660_1
```

emu1.4 predictor on AMD fails with rocm6.2 with some other triton errors (https://www.internalfb.com/phabricator/paste/view/P1603842354)

Differential Revision: D63263806

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136441
Approved by: https://github.com/houseroad
2024-09-25 19:13:17 +00:00
37f340c1e5 [EZ] Remove remaining amz2023 runner variant references (#136540)
Validated no jobs use the amz2023 runner variant anymore ([proof](https://github.com/search?type=code&q=org%3Apytorch+%2F%5Cbamz2023%5Cb%2F+&p=1)) so removing all references to it

Explicit references to the amz2023 runner type variants were removed in the following PRs:
- https://github.com/pytorch/ignite/pull/3285
- https://github.com/pytorch/ao/pull/887
- https://github.com/pytorch/fbscribelogger/pull/1
- https://github.com/pytorch/pytorch/pull/134355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136540
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-09-25 19:01:00 +00:00
9c2c61d2dd [inductor] ELEMENTS_PER_WARP_32 -> ONE_ELEMENT_PER_THREAD (#136472)
AMD devices have 64 elements per thread; this PR makes the handling of the "ELEMENTS_PER_WARP_32" generic and uses DeviceProperties.warp_size to determine the warp size instead of hard-coding the warp size as 32. It also renames the enum value. Added a unit test for this.

Note: I left the old enum option (ELEMENTS_PER_WARP_32) as is instead of renaming it. I'm not sure whether we expect should caches to get invalidated here; if this concern is valid, then there's a risk that this would get updated, but some model could use the cached inductor code, which would reference "ELEMENTS_PER_WARP_32", which would no longer exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136472
Approved by: https://github.com/jansel
2024-09-25 18:21:09 +00:00
cyy
a259fbf72c [2/N] Fix clang-tidy warnings in torch/csrc/lazy (#136634)
Follows #134655
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136634
Approved by: https://github.com/Skylion007
2024-09-25 18:08:29 +00:00
0b38fa154a Fix meta registry in export (#136492)
Summary: Title

Test Plan: CI

This fixes some breaking tests in executorch. I think the root cause is when we have aten::matmul which we are not preserving, we register meta implementation from C++ side. It seems like the C++ kernel doesn't work well with mix of FakeTensor and real tensor. This PR sidesteps this problem by always preferring python CIA decomp over C++ Cia decomp

Differential Revision: D63297050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136492
Approved by: https://github.com/bdhirsh
2024-09-25 17:53:02 +00:00
8582835499 [ONNX] Remove the operators test (#136335)
The tests are obsolete and hard to maintain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335
Approved by: https://github.com/xadupre, https://github.com/cyyever

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-09-25 17:44:18 +00:00
7cb6d31567 Dump partially traced make_fx graph in event of error to tlparse (#136508)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136508
Approved by: https://github.com/zou3519, https://github.com/bdhirsh, https://github.com/malfet
ghstack dependencies: #136533
2024-09-25 17:44:15 +00:00
9409274bc1 Fix bug in functional tensor decomp (#136600)
Summary: Previously we had a very bad bug where we don't allow any decomp on CIA. This never mattered before because we never had to actually push CIA decomp to Python key level in export.

Test Plan: CI

Differential Revision: D63363749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136600
Approved by: https://github.com/bdhirsh
2024-09-25 17:37:50 +00:00
5d7ed02f52 [user-written triton kernels] specialize exprs if they are expected to be tl.constexpr (#136512)
Fixes #136504

If you have a tl.constexpr parameter to a triton kernel, and you pass in a SymNode, then, right now, you run into failures (see under 'constants'):

```
  File "/tmp/torchinductor_dberard/na/cnax67r5zmslz7bvdfizteaepj7fajpjallb3bu2gyetjcdqtbzj.py", line 14, in <module>
    triton_meta={'signature': {0: '*fp32', 1: '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, cc=90, major=9, regs_per_multiprocessor=65536, max_threads_per_multi_processor=2048, multi_processor_count=132, warp_size=32), 'constants': {2: s0, 3: 256}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]},
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
NameError: name 's0' is not defined
```

To fix this, we specialize on the value during dynamo tracing, so that we have a real integer when we do codegen.

Alternatives: specialize somewhere else (e.g. inductor); or figure out how to actually pass the value dynamically into the user-written kernel. However, if we try to pass a dynamic value, then we wouldn't be able to precompile the triton kernels in inductor or use AOTI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136512
Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/eellison
2024-09-25 17:12:11 +00:00
7c6d543a5b [export] fix _get_non_persistent_buffers for duplicates (#136552)
Summary: Export's method _get_non_persistent_buffers doesn't check duplicate submodules, so we run into state_dict related issues if non-persistent buffers exist on shared submodules.

Test Plan: test_export

Differential Revision: D63332976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136552
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
2024-09-25 16:46:31 +00:00
aa80b82cea [hygiene] Delete dead alerting code (#136583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136583
Approved by: https://github.com/clee2000
2024-09-25 15:44:46 +00:00
0232278b33 Fix comment posting permissions for check-labels.yml (#136610)
Currently it fails with

Error fetching https://api.github.com/repos/pytorch/pytorch/issues/136607/comments HTTP Error 403: Forbidden

(see https://github.com/pytorch/pytorch/actions/runs/11026434368/job/30622960113?pr=136607)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136610
Approved by: https://github.com/malfet
2024-09-25 15:43:19 +00:00
34711fe8c9 Fix test_skip_data_serialization pickle exception match (#136617)
The test is failing in trunk atm with the following error:

```
test_serialization.py::TestSerialization::test_skip_data_serialization_materialize_fake_False - AssertionError: "Can't pickle local object 'WeakValueDictionary.__init__.<locals>.remove'" does not match "Can't get local object 'WeakValueDictionary.__init__.<locals>.remove'"
```

for example, 36f0e61166

This comes from this cpython commit a3076c734d, and manifests in python 3.12.5 currently used in CI.  The failure doesn't happen when I try it out with 3.12.3 and 3.12.4.  Looking at the commit logs of https://github.com/python/cpython/commits/main/Lib/pickle.py, it looks like the exception message is changing back and forth, so I guess a regex match would capture both.
2024-09-25 08:35:46 -07:00
deb820602a viable/strict update: log push to s3 (#136470)
As stated in https://github.com/pytorch/test-infra/pull/5686, I cannot figure out a way to determine the push time from webhooks (other than when the webhook was sent, but that isn't super accurate either).  Instead, manually save a json file to s3 that contains information for the sha and date so that we can still get this information.

Relies on https://github.com/pytorch/test-infra/pull/5690

tested in https://github.com/pytorch/pytorch/pull/136387 (but I squashed so it's kinda hard to find now)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136470
Approved by: https://github.com/huydhn
2024-09-25 15:28:53 +00:00
e3b89ca124 Revert "Add deterministic path for CUDA cumsum (#136224)"
This reverts commit b1a02bf70824a4802411ddd5be1d3610e7a2e269.

Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/ezyang due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2374201626))
2024-09-25 14:11:01 +00:00
20a855bf01 [AOTI] Move stack_allocation logic from PythonWrapperCodegen (#136463)
Summary: Move stack_allocation logic from PythonWrapperCodegen to CppWrapperCpuArrayRef

Differential Revision: [D63319970](https://our.internmc.facebook.com/intern/diff/D63319970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136463
Approved by: https://github.com/chenyang78
ghstack dependencies: #136062, #136461, #136462
2024-09-25 14:06:33 +00:00
5171b0e3c6 Revert "[ONNX] Remove the operators test (#136335)"
This reverts commit 9629835b1ccce8e72fc93bf95be13e3d53cb4871.

Reverted https://github.com/pytorch/pytorch/pull/136335 on behalf of https://github.com/ezyang due to I'll reland this, bear with me ([comment](https://github.com/pytorch/pytorch/pull/136335#issuecomment-2374183435))
2024-09-25 14:06:03 +00:00
070952aca5 [AOTI] Move stack_allocation logic from CppWrapperCpu (#136462)
Summary: Move stack_allocation logic from CppWrapperCpu to CppWrapperCpuArrayRef

Differential Revision: [D63300359](https://our.internmc.facebook.com/intern/diff/D63300359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136462
Approved by: https://github.com/chenyang78
ghstack dependencies: #136062, #136461
2024-09-25 14:03:03 +00:00
5ad5f40283 [AOTI][reland] Create another wrapper class to handle ArrayRef (#136461)
Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code.

Test Plan: CI

Differential Revision: [D63300361](https://our.internmc.facebook.com/intern/diff/D63300361)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136461
Approved by: https://github.com/angelayi, https://github.com/chenyang78
ghstack dependencies: #136062
2024-09-25 14:00:09 +00:00
25ab87c09b Add lint rule META_NO_CREATE_UNBACKED (#135870)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135870
Approved by: https://github.com/albanD
2024-09-25 13:33:56 +00:00
dd4a51b39a Fix constant propagation in builtins and UserClasses (#131354)
* Fixes https://github.com/pytorch/pytorch/issues/118675
* Replaces https://github.com/pytorch/pytorch/pull/118994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131354
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-25 13:03:40 +00:00
a0c76ea853 Make test_skip_data_serialization regex more flexible (#136580)
Some CI machines seem to throw "Can't get local object" rather than
"Can't pickle local object".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136580
Approved by: https://github.com/mikaylagawarecki
2024-09-25 11:27:23 +00:00
370c1c4297 [aotd] Fix rrelu compilation (#136008)
Issues:
https://github.com/pytorch/pytorch/issues/135083
https://github.com/pytorch/pytorch/issues/120292

rrelu decomposition contains mutation, copy_. Decompositions are executed below Functionalization, as a result AOT produces non-functional graph.

Also that decomposition is registered as python_dispatch kernel for AutogradCUDA.
Autograd dispatch happens above Functionalization, so registering it for Autograd to handle all backends makes functionalization running after this.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_rrelu
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136008
Approved by: https://github.com/bdhirsh
2024-09-25 11:26:19 +00:00
c3fdf587b5 [inductor] [cpp] fix the check of template_buffer_has_other_users if no epilogue_nodes (#136518)
The `template_buffer_has_other_users` function checks the case where there're epilogue nodes and the template output has users other than these epilogue nodes.  When there's no epilogue nodes, the function could return `False` directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136518
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
ghstack dependencies: #136418
2024-09-25 10:25:07 +00:00
cabfbef6cf [pytorch][PR] [inductor] More fixes on the keys of constants and signature dictionaries (#136514)
Summary: Previous PR forgets to change two other places that also create `constants` and `signature`.

Test Plan:
Imported from GitHub, without a `Test Plan:` line.
 {F1884584338}

Differential Revision: D63027728

Pulled By: Myrthan

Co-authored-by: Jokeren <robinho364@gmail.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136514
Approved by: https://github.com/jansel

Co-authored-by: Jokeren <robinho364@gmail.com>
2024-09-25 09:34:14 +00:00
2e30c160ef [inductor] [cpp] fix max-autotune for single-thread dynamic shapes (#136418)
Fixes the compilation error of max-autotune for `maml_omniglot` (AMP and FP32) and `soft_actor_critic` (AMP) in Torchbench for single-thread dynamic shapes case:

```
/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp: In function ‘void kernel(const bfloat16*, const bfloat16*, const bfloat16*, bfloat16*, int64_t)’:
/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:279:41: error: the value of ‘Mr_blocks’ is not usable in a constant expression
  279 |         constexpr int64_t m_block_end = Mr_blocks;
      |                                         ^~~~~~~~~
/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:237:19: note: ‘Mr_blocks’ was not initialized with a constant expression
  237 |     const int64_t Mr_blocks = (M + Mr - 1) / Mr;
      |                   ^~~~~~~~~
```

The PR also updates the UT to add a test for `BS`=512 in single thread.
The previous case has `BS`=1024 equal to the `K` and `N` value. The generated code does not have symbolic shapes thus fails to capture the above issue.
By adding a case of `BS`=512, the generated code will have symbolic shape for the M dim and is able to reproduce the issue that this PR is addressing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136418
Approved by: https://github.com/jgong5
2024-09-25 09:24:05 +00:00
a0a1873148 [Inductor] Fix Triton tests after updating pybind11 to 2.13.6 (#136280)
https://github.com/pytorch/pytorch/pull/136087 update pybind11 to 2.13.6 and that new release has the feature which is expressed by [a new function](https://pybind11.readthedocs.io/en/latest/changelog.html#version-2-13-6-september-13-2024) `_pybind11_conduit_v1_`. The presence of this function breaks the serialization mechanisms used by Titon and in PyTorch itself.

Possible errors that have been noticed due to this change:

<details>
<summary> the first error </summary>

```bash
_________ KernelTests.test_layout_constraint_needs_fixed_stride_order __________
Traceback (most recent call last):
  File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1072, in test_layout_constraint_needs_fixed_stride_order
    eager_out = f(x)
  File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1068, in f
    arange_out(x, y)
  File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1059, in arange_out
    kernel[grid](x, out, n_elements, BLOCK_SIZE=4)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 657, in run
    kernel = self.compile(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/compiler/compiler.py", line 315, in compile
    metadata_group[metadata_filename] = fn_cache_manager.put(json.dumps(metadata, default=vars), metadata_filename,
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/__init__.py", line 234, in dumps
    return cls(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
TypeError: vars() argument must have __dict__ attribute
```
</details>

<details>
<summary> the second error </summary>

```bash
________________ TestTritonWrapper.test_wrapper_using_gpu_seed _________________
Traceback (most recent call last):
  File "/cache/pytorch-c5e9d03a2da4b93481737594cbe2f5931fa569aa833f206a638189cad2c36d3c-11/test/inductor/test_triton_wrapper.py", line 40, in test_wrapper_using_gpu_seed
    out = f(x, y)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
    return fn(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1292, in __call__
    return self._torchdynamo_orig_callable(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1087, in __call__
    result = self._inner_convert(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 530, in __call__
    return _compile(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 933, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 675, in compile_inner
    return _compile_inner(code, one_graph, hooks, transform)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_utils_internal.py", line 87, in wrapper_function
    return function(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 708, in _compile_inner
    out_code = transform_code_object(code, transform)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object
    transformations(instructions, code_options)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 220, in _fn
    return fn(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 643, in transform
    tracer.run()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2776, in run
    super().run()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 979, in run
    while self.step():
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 891, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2967, in RETURN_VALUE
    self._return(inst)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2952, in _return
    self.output.compile_subgraph(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1117, in compile_subgraph
    self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph
    compiled_fn = self.call_user_compiler(gm)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler
    return self._call_user_compiler(gm)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler
    raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1446, in _call_user_compiler
    compiled_fn = compiler_fn(gm, self.example_inputs())
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__
    compiled_gm = compiler_fn(gm, example_inputs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/__init__.py", line 2235, in __call__
    return compile_fx(model_, inputs_, config_patches=self.config)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1528, in compile_fx
    return aot_autograd(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 72, in __call__
    cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified
    compiled_fn = dispatch_and_compile()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile
    compiled_fn, _ = create_aot_dispatcher_function(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function
    return _create_aot_dispatcher_function(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function
    compiled_fn, fw_metadata = compiler_fn(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 179, in aot_dispatch_base
    compiled_fw = compiler(fw_module, updated_flat_args)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1357, in fw_compiler_base
    return _fw_compiler_base(model, example_inputs, is_inference)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1428, in _fw_compiler_base
    return inner_compile(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 479, in compile_fx_inner
    return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper
    inner_compiled_fn = compiler_fn(gm, example_inputs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 665, in _compile_fx_inner
    compiled_graph = FxGraphCache.load(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1341, in load
    compiled_graph = compile_fx_fn(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 574, in codegen_and_compile
    compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 882, in fx_codegen_and_compile
    compiled_fn = graph.compile_to_fn()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1952, in compile_to_fn
    return self.compile_to_module().call
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1878, in compile_to_module
    return self._compile_to_module()
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1906, in _compile_to_module
    mod = PyCodeCache.load_by_key_path(
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path
    mod = _reload_python_module(key, path)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/tmps59zkbew/kg/ckgkb4gt5fs5pll4o7fqawppsmdezu5h52cq6nmrvi3yy6j7ddq4.py", line 45, in <module>
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/async_compile.py", line 198, in triton
    kernel = TritonCodeCache.load(kernel_name, source_code)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2916, in load
    return _module_to_triton_kernel(PyCodeCache.load(source_code), kernel_name)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2853, in load
    return cls.load_by_key_path(key, path, linemap, attrs)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path
    mod = _reload_python_module(key, path)
  File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 39, in _reload_python_module
    raise RuntimeError(
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Failed to import /tmp/tmps59zkbew/g3/cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py
SyntaxError: invalid syntax (cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py, line 14)
```
</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136280
Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang

Co-authored-by: Henry Schreiner <HenrySchreinerIII@gmail.com>
2024-09-25 08:09:46 +00:00
1cb265fafa [AILab][attempt2] Add TryExcept when decoding healthcheck port (#136574)
Summary:
## Context
The first attempt has lint error in OSS https://hud.pytorch.org/pr/pytorch/pytorch/136438#30553902641
{F1886895223}
## This Diff
Fix error message with try catch
Error Message:
```
  File "/packages/aps_models.examples.dlrm.lite/dlrm_train_app-inplace#link-tree/torch/distributed/elastic/agent/server/local_elastic_agent.py", line 224, in _setup_healthcheck
    port=int(healthcheck_port),
ValueError: invalid literal for int() with base 10: \'%port.thrift%\'
```

Test Plan:
```
arc lint
```

Reviewed By: felixsu2006

Differential Revision: D63343041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136574
Approved by: https://github.com/atalman
2024-09-25 04:43:51 +00:00
561cd5a0a6 [BE] Use C++17 convetion methods in CUDA kernels (#136575)
- `std::is_same<X, Y>::value` -> `std::is_same_v<X, Y>`
- `std::enable_if<C, T>::type` -> `std::enable_if_t<C, T>` And so on

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136575
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-09-25 04:30:01 +00:00
5340feb8aa Disable iOS workflow (#136571)
See https://github.com/pytorch/pytorch/issues/136284
It's been broken for more than a week and it does not seem like anyone cares about fixing it.
Once it's landed I'll reassigned the issue on `oncall: mobile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136571
Approved by: https://github.com/huydhn, https://github.com/kit1980
2024-09-25 04:29:34 +00:00
1c9a1a2a19 [AOTI] Support MKL linear ops in cpp wrapper (#134974)
Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support mkl linear in the ABI-compatible mode for cpp-wrapper Inductor.

Differential Revision: [D63322202](https://our.internmc.facebook.com/intern/diff/D63322202)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134974
Approved by: https://github.com/chenyang78, https://github.com/leslie-fang-intel

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
2024-09-25 03:53:11 +00:00
0200ad3457 Turn on unique kernel names (#136503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136503
Approved by: https://github.com/ezyang, https://github.com/eellison
ghstack dependencies: #136509
2024-09-25 03:39:45 +00:00
482fe186b9 Add ROCm documentation to libtorch (C++) reST. (#136378)
Fixes #126640

Added ROCm support section to libtorch (C++) reST.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136378
Approved by: https://github.com/ezyang
2024-09-25 02:30:56 +00:00
3c7edf1ec0 [Inductor][CPP] Fix int8 cvt half (#136353)
Fix the correctness issue of https://github.com/pytorch/ao/pull/884/. The current implementation for converting between `Half/BFloat16` and `int8/uint8` incorrectly assumes that 1/4 of the int8/uint8 vector lane maps to 1/2 of the Half/BFloat16 vector lane. This assumption leads to accuracy issues after the full bit-width vectorization of the Half data type was introduced. When converting between int8 weights and the half data type, the generated code is as the following:
```
#include "/tmp/torchinductor_leslie/xw/cxww3s7wxrujoyxna7mlcjktid2uu6nntixqwm542xfkd756gl3x.h"
extern "C"  void kernel(const int8_t* in_ptr0,
                       half* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2048L); x0+=static_cast<int64_t>(32L))
        {
            auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32));
            auto tmp1 = at::vec::convert<half>(tmp0);
            tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32));
        }
    }
}
```

In this PR, we address the issue by changing the implementation to convert 1/2 of the int8/uint8 vector lane into a full vector lane of Half/BFloat16.

**TestPlan**
* AO: `python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api`
* `python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_convert_int8_to_half_vec`
* Due to the CPP backend legalization pass, we are unable to create a unit test to simulate the conversion from `Half` to `int8`. Instead, we rely on a C++ test case.
  * `./build/bin/vec_test_all_types_AVX512 --gtest_filter="VecConvertTestsReducedFloat/*.ConvertReduced"`
  * `./build/bin/vec_test_all_types_AVX2 --gtest_filter="VecConvertTestsReducedFloat/*.ConvertReduced"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136353
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-09-25 02:23:43 +00:00
eqy
8225e7706e [CUDA][Expandable Segments] Account for non-gc'able memory in expandable segments tests (#136496)
Seems like some other tests are holding onto memory that is not gc'able (e.g., cuBLAS workspaces), so these tests while working in isolation fail when run as e.g., `python test/test_cuda.py -k able`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136496
Approved by: https://github.com/ezyang
2024-09-25 01:14:45 +00:00
5233b5a448 Update PyTorch/XLA CI image to Python 3.10 (#135278)
The old image used Python 3.8. Corresponding XLA PR: https://github.com/pytorch/xla/pull/7953

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135278
Approved by: https://github.com/JackCaoG, https://github.com/atalman
2024-09-25 00:53:39 +00:00
eqy
670d64a802 [SDPA][Nested Tensor] Bump grad_query fudge factor for small GPUs (#135715)
Similar to #135711, here we see a ~1/1000 mismatch with absolute value ~0.0016 when 0.001 is allowed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135715
Approved by: https://github.com/drisspg
2024-09-25 00:36:10 +00:00
8f2a4cc4b1 Tune bsr_dense_addmm for int8 inputs on A100 (#136088)
As in the title. The tuning is done for dimensions 1280 and 5120 that are used in Vit-H.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136088
Approved by: https://github.com/cpuhrsch
2024-09-25 00:24:12 +00:00
9629835b1c [ONNX] Remove the operators test (#136335)
The tests are obsolete and hard to maintain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136335
Approved by: https://github.com/xadupre
2024-09-24 23:08:48 +00:00
b57d67e263 Add isuruf to core reviewers (#136554)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136554
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-09-24 23:06:46 +00:00
210b136c07 [export] Add experimental swap API (#136190)
Prototyped the following API which takes in an ExportedProgram, a dictionary of fqn to modules to swap, and returns a (unlifted) GraphModule
```
_swap_modules(
    ep: ExportedProgram, modules_to_swap: Dict[str, torch.nn.Module]
) -> torch.fx.GraphModule:
```

Differential Revision: [D62879819](https://our.internmc.facebook.com/intern/diff/D62879819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136190
Approved by: https://github.com/avikchaudhuri
2024-09-24 22:50:44 +00:00
706eda5cd8 Revert "[RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957)"
This reverts commit 5033a1ca0dd22dae34a8939add33dbebfe0fd31d.

Reverted https://github.com/pytorch/pytorch/pull/135957 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135957#issuecomment-2372493186))
2024-09-24 22:24:26 +00:00
ae80bce496 [dynamo] refactor resume_execution.py to use bytecode templates (#136483)
Use bytecode from template instead of hardcoding bytecode in resume_execution.py. Gets rid of a lot of Python-version dependent bytecode generation. Also makes resume_execution.py easier to support in future Python version updates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136483
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-24 22:20:26 +00:00
36f0e61166 [BE] Use nested namespace in ATen/native/cuda (#136570)
It's a nice C++17 feature
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136570
Approved by: https://github.com/Skylion007
2024-09-24 22:19:10 +00:00
1d3af68202 [ROCm] install_miopen.sh exit for ROCm >= 6.3 (#136436)
Follow up to #132555.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136436
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/atalman
2024-09-24 22:15:26 +00:00
780f4debdb [ONNX] Remove _optimize_graph from public init (#136279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136279
Approved by: https://github.com/xadupre
ghstack dependencies: #136281
2024-09-24 22:00:55 +00:00
00bc17555a Don't try to evaluate sympy.Eq in replacement; we knew this wouldn't simplify since we are here (#136533)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136533
Approved by: https://github.com/isuruf, https://github.com/pianpwk
2024-09-24 21:52:25 +00:00
b1a02bf708 Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby
2024-09-24 21:34:43 +00:00
0133fbcfe7 Revert "Correctly convert Python float to float64 when passing argument as Tensor (#136413)"
This reverts commit f0f79dd8f1df6cf6342c9c23ae3a9be0f74eb9f5.

Reverted https://github.com/pytorch/pytorch/pull/136413 on behalf of https://github.com/ezyang due to forward fix is stuck, revert this ([comment](https://github.com/pytorch/pytorch/pull/136413#issuecomment-2372404873))
2024-09-24 21:20:37 +00:00
95c0f7493f [Inductor] Rename WrapperCodeGen to PythonWrapperCodegen (#136062)
Summary: Rename WrapperCodeGen to PythonWrapperCodegen to make its meaning more explicit.

Differential Revision: [D63300358](https://our.internmc.facebook.com/intern/diff/D63300358)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136062
Approved by: https://github.com/angelayi, https://github.com/chenyang78
2024-09-24 21:02:51 +00:00
da1560c49f [SymmetricMemory] add support for cuStreamWriteValue32 (#136488)
cuStreamWriteValue efficiently combines the issuing of a system-level fence with the update of a single memory location. It is highly suitable for inter-stream progress sharing (e.g., all_gather_with_progress).

Exposing it via SymmetricMemory allows users to more easily implement efficient progress-aware matmuls in triton ([xformers example](https://github.com/facebookresearch/xformers/blob/main/xformers/ops/_triton/sequence_parallel_fused_kernels.py)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136488
Approved by: https://github.com/eqy, https://github.com/Chillee
2024-09-24 20:56:29 +00:00
7c777dd587 [ONNX] Unify ONNXProgram and remove the old one (#136281)
## Note

`test_fx_to_onnx_with_onnxruntime.py` is removed for now (it has a lot of xfails anyways). A better version will be added back.

Fixes https://github.com/pytorch/pytorch/issues/136274

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136281
Approved by: https://github.com/xadupre, https://github.com/albanD
2024-09-24 20:52:19 +00:00
dbc3356655 [pipelining] fix py ref cycle in stage_backward (#136507)
TLDR; found forward activation tensors were being kept alive "forever"
(or until GC ran), and tracked it down to a cycle involving
`stage_backward.<locals>.extract_tensors_with_grads`.

The reference cycle in question is below.  (constructed using gc.get_referrers after doing a gc.collect in gc debug mode)

tensor is kept alive by
`[(<class 'cell'>, '0x7f7360234400')]`

tuple of cell objects
`(<cell at 0x7f73602343d0: function object at 0x7f734fff0ee0>, <cell at 0x7f7360234400: list object at 0x7f734e4d9a80>, <cell at 0x7f73602a4190: list object at 0x7f734eff8b00>)`
is kept alive by
`[(<class 'function'>, '0x7f734fff0ee0')]`

`<function stage_backward.<locals>.extract_tensors_with_grads at 0x7f734fff0ee0>`
is kept alive by
`[(<class 'cell'>, '0x7f73602343d0')]`

Put into more plain terms,

```

def stage_backward(...):
    ...
    stage_output_tensors = []

    # a cell object will exist that contains the variables defined in stage_backward and used by
    # both stage_backward and nested functions
    # in this case, the cell object contains 'stage_output_tensors' but

    # this function object will hold a reference to a 'cell' that contains any vars from
    # the parent scope not explicitly passed into the function as args.
    def extract_tensors_with_grads(...):
        ...
            # extract_tensors_with_grads refers to stage_output_tensors, so stage_output_tensors
            # is in the cell
            stage_output_tensors.append(output_val)
        ...
            # but extract_tensors_with_grads ALSO refers to itself (extract_tensors_with_grads),
            # so `extract_tensors_with_grads` will be in the cell
            extract_tensors_with_grads(...)
```

More debug details:
https://docs.google.com/document/d/1QPH1Lz0tnieIFPM2tyHrjVB-bjlnHuDgjx1p2am3cmE/edit?usp=sharing

In pdb:
```
gc.collect()
g = gc.garbage
g[-1]
[rank0]:(Pdb) [rank0]:<function
stage_backward.<locals>.extract_tensors_with_grads at 0x7fee5c3392d0>
g[-2]
[rank0]:(Pdb) [rank0]:(<cell at 0x7fee7abbcf40: function object at
0x7fee5c3392d0>, <cell at 0x7fee7abbcf70: list object at
0x7fee7ab68940>, <cell at 0x7fee5c3210c0: list object at 0x7fee5e1
d6340>)
g[-3]
[rank0]:(Pdb) [rank0]:[tensor([[[-4.1127e-06, -3.3826e-06,  2.6226e-06,
...,  6.4969e-06,
[rank0]:          -4.4405e-06, -4.7684e-06],
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136507
Approved by: https://github.com/awgu, https://github.com/kwen2501
2024-09-24 20:46:37 +00:00
7ff8e66140 Fix flexattention sympy expr printer issue (#136509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136509
Approved by: https://github.com/yanboliang
2024-09-24 20:10:29 +00:00
02ef5dd327 [inductor][test] Check if mkl dnn bf16 is supported when using bf16 (#136290)
Sometimes the test is run with older cpu, e.g. Intel(R) Xeon(R) CPU E5-2680 v4. If we inspect its `lscpu`, in the flags, we don't see a `avx512_bf16`. So that probably means bf16 is not supported for those hardwares, and hence the unit test can fail. So we add the check in the code.

Context: https://github.com/pytorch/pytorch/pull/135038

Differential Revision: D62984129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136290
Approved by: https://github.com/XuehaiPan, https://github.com/chenyang78
2024-09-24 19:32:48 +00:00
888744bd36 NJT binary pointwise broadcasting support via jagged <-> padded dense conversion (#133021)
Related: #132695

This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes:
* `(B, j0, D) + (1, 1, 1)`
* `(B, j0, D) + (B, 1, 1)`
* `(B, j0, D) + (B, 1, D)`
* etc.

This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021
Approved by: https://github.com/cpuhrsch
2024-09-24 19:11:49 +00:00
8ecc5f1a8f [TorchScript][tensorexpr] imbue locale for IRPrinter (#136458)
We had an internal report where the NNC-generated CUDA code had thousands separators in integer literals. Although I wasn't able to cleanly repro, I did come up with a hacky repro and verified that this fix works (see #136459).

Differential Revision: [D63278771](https://our.internmc.facebook.com/intern/diff/D63278771)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136458
Approved by: https://github.com/eellison
2024-09-24 19:00:57 +00:00
c6192f32f1 [MPS] Add upsample_bicubic2d as Metal op (#136123)
More or less literal copy-n-paste of c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)
and
c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)
Missing `uint8` implementation mimics CUDA behavior
Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk
Later refinements:
 - Switch from 2D dispatch to 1D one (to match CUDA behavior)
 - Added batch + channel loops
 - Fixed scale computation to match align corners behavior
 - Added backward implementation

Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e.
```metal
emplate <typename T>
static inline void atomic_add_helper(
    device atomic<int>* data,
    long offset,
    float value) {
  auto ptr = data + (offset >> 1);
  auto old = atomic_load_explicit(ptr, memory_order_relaxed);
  union {
    int i;
    T t[2];
  } val;
  do {
    val.i = old;
    val.t[offset & 1] += static_cast<T>(value);
  } while (!atomic_compare_exchange_weak_explicit(
      ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed));
}
```
Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123
Approved by: https://github.com/albanD
2024-09-24 18:58:11 +00:00
dacf0c4884 [dynamo] Do not treat user defined nn module attributes static for dynamic shape infra (#136516)
Fixes https://github.com/pytorch/pytorch/issues/136254

Th regression was introduced in https://github.com/pytorch/pytorch/pull/132736 where originally we were trying to fix another regression. This PR and the offending PR together say - "treat user defined nn module attributes as automatic dynamic, but for cudagraphs they will be considered static". This avoid recompilations. This can lead to a cudagraph recording, which is ok. This also maintains the state before inline_inbuilt_nn_modules flag was introduced.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136516
Approved by: https://github.com/williamwen42
2024-09-24 18:26:12 +00:00
1028cedf71 [inductor] Enable parallel compile by default in fbcode (#136246)
Summary: Now that we have subprocess parallel compile on by default, we can change the internal compile_threads default to > 1 with a killswitch. Some jankiness so we can avoid evaluating the justknob at import.

Test Plan: Ran codecache tests with JK on, then canaried locally with JK off

Differential Revision: D62913998

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136246
Approved by: https://github.com/eellison
2024-09-24 18:10:01 +00:00
9abdc62065 Allow fx graph caching higher order operators (opt-in) (#135877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877
Approved by: https://github.com/zou3519
2024-09-24 17:23:09 +00:00
efed357ef5 Add dtypes support in opinfo for Intel Gaudi (#132840)
## Motivation
This is following up on changes introduced in https://github.com/pytorch/pytorch/pull/128584
we are adding the dtype information to be picked up while executing the UTs for Intel Gaudi/HPU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132840
Approved by: https://github.com/albanD
2024-09-24 17:17:15 +00:00
064093a4d6 Revert "Increase update_hint_regression problem size to 1000 (#136434)"
This reverts commit 3116fbda0fcf9af0c3dfe1280fb7e05e30e6ad5f.

Reverted https://github.com/pytorch/pytorch/pull/136434 on behalf of https://github.com/ezyang due to whoops, this is too slow ([comment](https://github.com/pytorch/pytorch/pull/136434#issuecomment-2371847842))
2024-09-24 17:05:20 +00:00
ebfcbe0822 Move print_export_warning so lru_cache works (#136491)
Summary:
as title

move print_export_warning() out of the function so `lru_cache` actually works

Test Plan: CI

Differential Revision: D63297083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136491
Approved by: https://github.com/pianpwk
2024-09-24 16:52:22 +00:00
44ec706789 add tolerance changes for test_sdpa_autocast in test_nestedtensor.py (#136485)
Upstreaming minor unit test fix from nvidia internal CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136485
Approved by: https://github.com/soulitzer
2024-09-24 16:31:32 +00:00
eac04fe72a Increase bf32 tolerances for some cdist tests in test_torch (#136315)
- Set the new tolerances ~= N * eps(bfloat16) which should be a comfortable upper bound for tolerances. Where N is the inner dimension of the matmal.

Logic behind choice of tolerance:

The maximum error of the summation of a series of N numbers in bfloat16 should be `N * epsilon(bfloat16)` , I confirmed by sampling different random seeds that the maximum observed error doesn't exceed this value and is usually much less.

Fixes test failures on Arm® Neoverse™ V1 ( not raised as an issue as this hardware type is not currently covered by linux-aarch64 workflow )

```
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/test_torch.py", line 2478, in test_cdist_large
    self.assertEqual(expected, actual)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!

Mismatched elements: 134118 / 1000000 (13.4%)
Greatest absolute difference: 0.03829193115234375 at index (291, 726) (up to 0.005 allowed)
Greatest relative difference: 0.03519868478178978 at index (291, 726) (up to 1.3e-06 allowed)
```

@malfet @jondea

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136315
Approved by: https://github.com/albanD
2024-09-24 16:10:11 +00:00
0b667c073e Disable compiled autograd for re-entrant autograd (#135795)
Fixes #135298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795
Approved by: https://github.com/xmfan
2024-09-24 15:09:16 +00:00
33e10803c8 Fix ut in internal distributed_test.py (#136251)
I have failed with test case of **test_new_subgroups_by_enumeration_input_rank_exceeds_world_size**, and passed with this small change. The expected exception is supposed to be "ValueError" rather than "RuntimeError" according to [code](https://github.com/pytorch/pytorch/blob/v2.4.1/torch/distributed/distributed_c10d.py#L4190).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136251
Approved by: https://github.com/kwen2501
2024-09-24 15:06:20 +00:00
58274e4655 Remove onnx imports in dynamo (#136334)
Remove imports of the ``torch.onnx.operators`` module in dynamo. Since ONNX depends on dynamo, this import line causes a circular dependency. Judging from the source they are not actually needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136334
Approved by: https://github.com/xadupre, https://github.com/jansel, https://github.com/titaiwangms
2024-09-24 14:54:23 +00:00
2a178a6982 Avoid changing FTZ/DAZ flags in CPP builder (#136466)
Fixes https://github.com/pytorch/pytorch/issues/136273
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136466
Approved by: https://github.com/ezyang
2024-09-24 14:39:17 +00:00
6300eb1dc7 tf32 off for test_noncontiguous_samples in test_ops.py (#136484)
Upstreaming minor unit test fix from nvidia internal CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136484
Approved by: https://github.com/soulitzer
2024-09-24 14:26:47 +00:00
47ebb5856e Make avoid_device_init() aware of hpu device (#136194)
Added hpu to devices handled by avoid_device_init() in FakeTensorMode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136194
Approved by: https://github.com/eellison
2024-09-24 14:13:45 +00:00
54fc4f56ff [Docs fix] fix syntax error in docs :torch.blackman_window (#136354)
Fixes #ISSUE_NUMBER
https://pytorch.org/docs/stable/generated/torch.blackman_window.html

error at : equal to torch.blackman_window(L + 1, periodic=False)[:-1]).
should delete the last ).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136354
Approved by: https://github.com/soulitzer
2024-09-24 14:00:26 +00:00
9fc721d22b Add cache logs + other minor caching cleanup (#136456)
Summary:
- Added TORCH_LOGS=cache to dump cache stats on exit - supported by RemoteCache.
- Split REMOTE_CACHE_VERSION - it was used for both JKs fx_graph_memcache_version and autotune_memcache_version but they really should be separate (just in case we need to change one but not the other)
- Prepare `_ManifoldCache` for use with other subpath keys
- Move create_cache to be more public and use it in codecache
- Add _InductorMetaTy alias (still just a dict)
- Cleaned up some common cached_autotune calls in triton_heuristics

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D62648249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136456
Approved by: https://github.com/oulgen
2024-09-24 14:00:23 +00:00
342c031f0e [aotd] Fix freezing API for subclasses (#136265)
Original issue:
https://github.com/pytorch/ao/issues/890

The problem:

TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.

flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh
2024-09-24 13:15:01 +00:00
cyy
f048569c24 [Distributed] [11/N] Fix clang-tidy warnings in torch/csrc/distributed/ (#136439)
Follows #131671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136439
Approved by: https://github.com/kwen2501
2024-09-24 13:05:15 +00:00
538ee7bf60 Revert "Fix tensor.data_ptr() representation overflow (#135567)"
This reverts commit 2e8d431a8fbfdbdb07448195f16afa9e101188ac.

Reverted https://github.com/pytorch/pytorch/pull/135567 on behalf of https://github.com/etaf due to Block XPU, let's re-land with triton update. ([comment](https://github.com/pytorch/pytorch/pull/135567#issuecomment-2371200549))
2024-09-24 12:59:14 +00:00
32727b9859 Add types to _dynamo/testing.py (#136402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136402
Approved by: https://github.com/jansel
2024-09-24 10:23:54 +00:00
73c10a04f6 [dynamo][easy] support sys.intern (#136081)
Closes #134023

- #134023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136081
Approved by: https://github.com/anijain2305
2024-09-24 09:12:34 +00:00
1266be21f4 deprecated datetime.utcnow() fix and _RendezvousJoinOp module initiation bug fix (#136141)
Fix to #136140

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136141
Approved by: https://github.com/kwen2501
2024-09-24 07:26:10 +00:00
0a35986cdb Add option to configure reduced precision math backend for SDPA (#135964)
Summary: Address https://github.com/pytorch/pytorch/issues/135778 by adding a global flag to configure whether using high precision or low precision for math backend of SDPA.

Test Plan: buck2 run mode/opt //scripts/feikou/llm:run_attn_kernels

Differential Revision: D62625515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135964
Approved by: https://github.com/jbschlosser
2024-09-24 07:11:38 +00:00
44c871c34b [inductor] [cpp] add index check when fusing epilogue with GEMM template (#135661)
## Description
Fixes the accuracy failure of FP32 `jx_nest_base` of max-autotune.

The current epilogue fusion implementation in GEMM template assumes that the read of template buffer and the write of epilogue output in the epilogue node have the same index (the layout could be different but the index should be the same).

If the condition is not satisfied, the computation is wrong, leading to correctness issue for FP32 `jx_nest_base`.

This PR disabled the epilogue fusion with GEMM template when the above condition is not satisfied.

### Unsupported epilogue:
`buf1` is the template buffer and `buf2` is the epilogue output buffer.
The store of `buf2`:
401408 * d0 + 100352 * d1 + **7168 * d2** + **1792 * d3** + 128 * d4 + d5

The load of `buf1` in the epilogue node:
401408 * d0 + 100352 * d1 + **1792 * d2** + **25088 * d3** + 128 * d4 + d5

The above two indexes are different.

```
CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[25088, 128], stride=[128, 1]))
ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[8, 4, 14, 4, 14, 128], stride=[401408, 100352, 7168, 1792, 128, 1]), data=Pointwise(
  'cpu',
  torch.float32,
  def inner_fn(index):
      i0, i1, i2, i3, i4, i5 = index
      tmp0 = ops.load(arg5_1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
      tmp1 = ops.load(buf0, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
      tmp2 = tmp0 + tmp1
      tmp3 = ops.load(buf1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
      tmp4 = tmp2 + tmp3
      return tmp4
  ,
  ranges=[8, 4, 14, 4, 14, 128],
  origin_node=clone,
  origins=OrderedSet([clone])
))
```

### Supported epilogue:
`buf1` is the template buffer and `buf2` is the epilogue output buffer.
The store of `buf2`:
d0 + 576 * d1 + 32 * d2

The load of `buf1` in the epilogue node:
d0 + 576 * d1 + 32 * d2

The above two indexes are the same.

The layout of `buf2` and `buf1` are different though which is handled by the reindexer:
`buf1`: `size=[324, 32], stride=[32, 1]`
`buf2`: `size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]`

```
CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.bfloat16, size=[324, 32], stride=[32, 1]))
ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.bfloat16, size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]), data=Pointwise(
  'cpu',
  torch.bfloat16,
  def inner_fn(index):
      _, i1, i2, i3 = index
      tmp0 = ops.load(buf1, i1 + 32 * i3 + 576 * i2)
      tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16)
      tmp2 = ops.load(_frozen_param4, i1)
      tmp3 = tmp1 * tmp2
      tmp4 = ops.load(arg7_1, i1 + 32 * i3 + 576 * i2)
      tmp5 = tmp3 + tmp4
      tmp6 = ops.to_dtype(tmp5, torch.bfloat16, src_dtype=torch.float32)
      return tmp6
  ,
  ranges=[1, 32, 18, 18],
  origin_node=convert_element_type_4,
  origins=OrderedSet([add, mul, convert_element_type_4])
))
```

## TODO
Add the support for fusions when the indexes are different in a follow-up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135661
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-09-24 05:25:28 +00:00
7283530db2 [ROCm][Inductor][CK] FP8 gemm (#136337)
At the moment, lowering torch._scaled_mm with tensorwise scaling and rowwise scaling for both A and B

We probably also want to support either combination of tensorwise and rowwise for A and B, as well as bias support

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136337
Approved by: https://github.com/chenyang78
2024-09-24 05:19:45 +00:00
7f98781f84 Fix autodeps from D62049222 that pyfmt broke (#136455)
Summary: `arc lint` changed the formatting which then caused autodeps to be confused.

Test Plan:
this passes:
```
arc lint --skip AUTODEPS
fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/test/inductor/test_memory_planning.py
```

Differential Revision: D63277059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136455
Approved by: https://github.com/bobrenjc93, https://github.com/oulgen
2024-09-24 05:06:12 +00:00
797c7e2802 [Quant][PT2E]change flatten recipe for X86InductorQuantizer (#136298)
This PR modifies the flatten recipe: if none of the users of the flatten node are quantizable ops, int8 flatten will be disabled to avoid unnecessary dtype conversions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136298
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-09-24 04:30:12 +00:00
3be150653c [torch][ao] Add customizable loss function to NodeAccuracySummary (#136282)
Summary:
Add a customizable loss function callback to NodeAccuracySummary to
allow users to pass in their own loss function.

Also, fix some type errors and propagate better exception messages when
unexpected tensor comparisons occur. Finally, enhance the robustness of
`generate_numeric_debug_handle` in the case where it is called multiple
times on the same model, by avoiding reuse of the same IDs.

Test Plan: Added a test for this case in `test_numeric_debugger`.

Reviewed By: jerryzh168

Differential Revision: D62898297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282
Approved by: https://github.com/jerryzh168
2024-09-24 03:28:12 +00:00
e09c5b6046 Remove vt argument in raise_observed_exception (#136037)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136037
Approved by: https://github.com/zou3519
2024-09-24 02:36:57 +00:00
9372692c7b [FR] Make OSS fr_trace function available for internal script and improve pg filtering (#136473)
Differential Revision: [D63287384](https://our.internmc.facebook.com/intern/diff/D63287384/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136473
Approved by: https://github.com/c-p-i-o
2024-09-24 02:34:43 +00:00
4fd16dd8aa Clarify that libtorch API is C++17 compatible (#136471)
As it relies on some common C++17 primitives, such as `std::optional`
Replace all docs references from C++14 to C++17

Fixes https://github.com/pytorch/pytorch/issues/133205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136471
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-09-24 02:03:33 +00:00
e4d294221b [inductor] Log precompilation time (#136395)
This has been useful for diagnosing the long compile time issues I've seen in the Triton CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136395
Approved by: https://github.com/eellison
2024-09-24 01:47:54 +00:00
802ba79121 Inherit all secrets to inductor workflow (#135354)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135354
Approved by: https://github.com/desertfire, https://github.com/atalman, https://github.com/malfet
2024-09-24 01:30:40 +00:00
06909803cc Existing mypy issues (#136236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136236
Approved by: https://github.com/bobrenjc93, https://github.com/Skylion007
2024-09-24 01:02:07 +00:00
a14f57b126 fix the inductor tests (#136474)
Fixes https://github.com/pytorch/pytorch/issues/136464 introduced in https://github.com/pytorch/pytorch/pull/134874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136474
Approved by: https://github.com/malfet
2024-09-24 00:59:22 +00:00
9d9bc65b5e Make FlashAttentionKernel.cpp compilable for SVE with GCC-11 (#136477)
Extends https://github.com/pytorch/pytorch/pull/132434 to all minor revisions of GCC-11, as they all likely affected by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95528

Hattip to @abhishek-iitmadras  for the investigation

Fixes https://github.com/pytorch/pytorch/issues/136432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136477
Approved by: https://github.com/atalman, https://github.com/kit1980
2024-09-24 00:54:26 +00:00
e0f84f40f7 [Pipelining] Allow non-0 stages to accept kwargs (#136416)
For supporting usage case in torchchat:
all non-0 stages requires `input_pos` and `cache_lane`.
```
kwargs = {"input_pos": input_pos, "cache_lane": lane}

if pp_rank == first_pp_rank:
    output = decorder.step(new_token, **kwargs)
elif pp_rank == last_pp_rank:
    output = decorder.step(**kwargs)
else:  # middle pp ranks
    decorder.step(**kwargs)
```

The `forward_one_chunk` code today hard sets `{}` as kwarg for non-0 stages, hence cannot support the above use case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136416
Approved by: https://github.com/wconstab
2024-09-23 23:50:59 +00:00
52c917b0ba Optimize dict reconstruct to not codegen untouched values (#134876)
PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow:
(1) codegen(...) each pair of key/value
(2) create a new dictionary to hold the new items
(3) clear the original dictionary
(4) update the original dict with the one created in (2)

We do a micro optimization in the generated bytecode to:
- Only codegen the items that changed.
- Only clear the original dictionary if a key was removed.

Fixes: #133487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876
Approved by: https://github.com/zou3519
2024-09-23 21:45:44 +00:00
5033a1ca0d [RFC][torchelastic][c10d] Fix store prefix race in rendezvous (#135957)
1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case)
2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
3. Then the port be broadcasted for dynamic_rendezvous.

Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957
Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o
2024-09-23 20:32:24 +00:00
fd182b90a7 Revert "Add deterministic path for CUDA cumsum (#136224)"
This reverts commit d45b0151e5d9a9358368b9fbd7fa454edd5d9709.

Reverted https://github.com/pytorch/pytorch/pull/136224 on behalf of https://github.com/atalman due to Failing internall CI ([comment](https://github.com/pytorch/pytorch/pull/136224#issuecomment-2369244135))
2024-09-23 19:57:13 +00:00
08dba25775 [BE] Do not use deprecated APIs in SparseCsrTensorMath.cu (#136449)
- `Tensor::type()` -> `Tensor::scalar_type()`
- `Tensor::data<T>()` -> `Tensor::data_ptr<T>()`

Should fix following warnings during the compilation:
```
caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassB_f32_notaligned_k128_dropout.cu.o
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In function ‘void at::native::_GLOBAL__N__496f0b0c_22_SparseCsrTensorMath_cu_868dd545::_apply_sparse_csr_linear_solve(const at::Tensor&, const at::Tensor&, bool, const at::Tensor&)’:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:739:36: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   739 |   int* rowOffsets = crow.data<int>();
       |                                    ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:740:35: error: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   740 |   int* colIndices = col.data<int>();
       |                                   ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:44: error: ‘at::DeprecatedTypeProperties& at::Tensor::type() const’ is deprecated: Tensor.type() is deprecated. Instead use Tensor.options(), which in many cases (e.g. in a constructor) is a drop-in replacement. If you were using data from type(), that is now available from Tensor itself, so instead of tensor.type().scalar_type(), use tensor.scalar_type() instead and instead of tensor.type().backend() use tensor.device(). [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                            ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:225:1: note: declared here
   225 |   DeprecatedTypeProperties & type() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                               ^
 /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here
   109 | inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) {
       | ^~~~~~~~~~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:159: error: ‘c10::ScalarType detail::scalar_type(const at::DeprecatedTypeProperties&)’ is deprecated: passing at::DeprecatedTypeProperties to an AT_DISPATCH macro is deprecated, pass an at::ScalarType instead [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                               ^
 /var/lib/jenkins/workspace/aten/src/ATen/Dispatch.h:109:1: note: declared here
   109 | inline at::ScalarType scalar_type(const at::DeprecatedTypeProperties& t) {
       | ^~~~~~~~~~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1014: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1054: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753:1094: error: ‘T* at::Tensor::data() const [with T = double]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ^
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu: In lambda function:
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
 /var/lib/jenkins/workspace/aten/src/ATen/native/sparse/cuda/SparseCsrTensorMath.cu:753: error: ‘T* at::Tensor::data() const [with T = float]’ is deprecated: Tensor.data<T>() is deprecated. Please use Tensor.data_ptr<T>() instead. [-Werror=deprecated-declarations]
   753 |   AT_DISPATCH_FLOATING_TYPES(values.type(), "create_matrix", ([&] {
       |
 /var/lib/jenkins/workspace/build/aten/src/ATen/core/TensorBody.h:247:1: note: declared here
   247 |   T * data() const {
       | ^ ~~
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136449
Approved by: https://github.com/huydhn
2024-09-23 19:20:34 +00:00
9a1dc41de7 [AMD] Skipping 0 byte send/recv for AMD GPU (#136362)
Summary: We found jobs getting stuck by send/recv zero bytes with RDMA on AMD GPUs. So just skipping them.

Reviewed By: danzimm

Differential Revision: D63075000

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136362
Approved by: https://github.com/malfet, https://github.com/houseroad
2024-09-23 19:14:12 +00:00
3116fbda0f Increase update_hint_regression problem size to 1000 (#136434)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136434
Approved by: https://github.com/laithsakka
2024-09-23 18:51:44 +00:00
274883083d Revert "[AOTI] Create another wrapper class to handle ArrayRef (#136318)"
This reverts commit d21841d077b00350d5e621e7b74dace71849c701.

Reverted https://github.com/pytorch/pytorch/pull/136318 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/136318#issuecomment-2368957264))
2024-09-23 17:47:49 +00:00
d859fcbc61 s390x: build s390x binaries on each pull request (#125399)
Ensure that s390x keeps building for each PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125399
Approved by: https://github.com/huydhn
2024-09-23 17:39:48 +00:00
83a3ee0699 Support embedding_bag() with NJT input (#135888)
Fixes #93843

`EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888
Approved by: https://github.com/cpuhrsch
2024-09-23 17:35:19 +00:00
4649aeaebf Make AOTAutogradCache support remote FXGraphCache (#136173)
Summary:
After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache.

This does *not* implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache.

Test Plan: (Meta only tests)

Reviewed By: aorenste

Differential Revision: D62384944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173
Approved by: https://github.com/oulgen
2024-09-23 17:24:27 +00:00
c3e678382b Fix addmm silent correctness on aarch64 (#136371)
Do not dispatch to fast gemmv functions when alpha is not equal to 1

Add regression test to address the problem

Fixes https://github.com/pytorch/pytorch/issues/136299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136371
Approved by: https://github.com/swolchok
2024-09-23 17:10:34 +00:00
f0f79dd8f1 Correctly convert Python float to float64 when passing argument as Tensor (#136413)
I can't actually test the Dynamo codegen fix as it is impossible to
directly use the Tensor at the moment.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136413
Approved by: https://github.com/bobrenjc93
2024-09-23 16:48:08 +00:00
637d5c4b7e [DSD] Fix loading uneven full tensor into sharded state dict (#136365)
Fix #136228.

This is a follow up on https://github.com/pytorch/pytorch/pull/135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136365
Approved by: https://github.com/fegin
2024-09-23 16:35:58 +00:00
da51fe1c42 [FR] Fix errors in all2all check, improve some log output (#136399)
We found that we show the hashed pg name in our script output, which is not UX friendly.
Also we found a bug in our all2all check and we made a bunch of changes to error messages to make it better readable.

Differential Revision: [D63206469](https://our.internmc.facebook.com/intern/diff/D63206469)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136399
Approved by: https://github.com/c-p-i-o
2024-09-23 16:31:31 +00:00
df6a8fa1eb Revert "[aotd] Fix freezing API for subclasses (#136265)"
This reverts commit cdef760560049ebda5fb7e30b1703f345fe05cfa.

Reverted https://github.com/pytorch/pytorch/pull/136265 on behalf of https://github.com/atalman due to Breaks internal CI sorry, need to revert ([comment](https://github.com/pytorch/pytorch/pull/136265#issuecomment-2368772574))
2024-09-23 16:25:05 +00:00
9992084f38 [FSDP2] Fixed test_all_gather_extensions_monkey_patch (#136130)
I messed up the test before. The extensions were not running :/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136130
Approved by: https://github.com/weifengpy
ghstack dependencies: #136129
2024-09-23 15:12:44 +00:00
b9f53c0dce [FSDP2] Added module, mp policy to fsdp_pre_all_gather (#136129)
- Sometimes having access to the `MixedPrecisionPolicy` in the `fsdp_pre_all_gather` is useful. See [here](https://github.com/pytorch/ao/pull/748/files#r1760375325) in the torchao INT8 mixed precision training PR.
- Sometimes having access to the owning `nn.Module` allows for using it for saving state. See [here](https://github.com/pytorch/pytorch/issues/114299#issuecomment-2298692762) for an example.

The major paint point here is how to deal with backward compatibility. For now, we use `signature.inspect` to check if the user subclass follows the old vs. new signature. However, for the new signature, the `param_dtype` in the post-all-gather is redundant, as if the user needed it, the user could save it from the `mp_policy` passed in the pre-all-gather now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136129
Approved by: https://github.com/weifengpy
2024-09-23 15:12:36 +00:00
d21841d077 [AOTI] Create another wrapper class to handle ArrayRef (#136318)
Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code.

Test Plan: CI

Differential Revision: D62961885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136318
Approved by: https://github.com/frank-wei
2024-09-23 15:10:27 +00:00
0e19522122 Revert "Adds support for accelerated sorting with x86-simd-sort (#127936)"
This reverts commit 239a9ad65eebf93dcf9bb108a5129d4160b12c86.

Reverted https://github.com/pytorch/pytorch/pull/127936 on behalf of https://github.com/atalman due to test/test_sort_and_select.py::TestSortAndSelectCPU::test_sort_discontiguous_slow_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10994904767/job/30525578456) [HUD commit link](239a9ad65e) ([comment](https://github.com/pytorch/pytorch/pull/127936#issuecomment-2368522316))
2024-09-23 14:52:23 +00:00
bae427e4b1 Refactor maybe_evaluate_static into a worker function off of ShapeEnv (#135107)
By refactoring this way, I can put a non-expiring LRU cache here.
Splitting also will make it easier for me to tell who is using up all
the time.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135107
Approved by: https://github.com/aorenste
2024-09-23 14:39:20 +00:00
e9bfbf78d5 Revert "Allow fx graph caching higher order operators (opt-in) (#135877)"
This reverts commit 66d5eb64e0be91680a8531ccb24f098554610d46.

Reverted https://github.com/pytorch/pytorch/pull/135877 on behalf of https://github.com/jeanschmidt due to seems to have introduced regressions on rocm signals ([comment](https://github.com/pytorch/pytorch/pull/135877#issuecomment-2367616653))
2024-09-23 09:04:24 +00:00
cyy
75f141be62 Avoid unnecessary CMake warnings on Windows (#136393)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136393
Approved by: https://github.com/ezyang
2024-09-23 06:42:59 +00:00
663e760065 add unittest for OOM message (#129671)
Add unittest for the bug in #123984
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129671
Approved by: https://github.com/eqy
2024-09-23 04:48:01 +00:00
068fdd602f [export] enable custom tag metadata re-export test (#136048)
Improves and enables a commented out test originally introduced in #131912

In `test_custom_tag_metadata_re_export()`, we check the added "custom" metadata to given nodes is preserved and not copied to other nodes after re-exporting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136048
Approved by: https://github.com/zhxchen17
2024-09-23 04:37:58 +00:00
66d5eb64e0 Allow fx graph caching higher order operators (opt-in) (#135877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135877
Approved by: https://github.com/zou3519
2024-09-23 04:33:27 +00:00
cyy
a38e4c5e1e Enable clang-tidy warnings on aten/src/ATen/cuda/*.cpp (#134547)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134547
Approved by: https://github.com/ezyang
2024-09-23 03:44:55 +00:00
f276da7f98 Remove prims.slice_in_dim and prims.slice (#136150)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136150
Approved by: https://github.com/ezyang
2024-09-23 01:27:22 +00:00
3406ac24d9 [BE] fix circular import in torch/distributed/utils.py (#136286)
**Summary**
Fix circular import in `torch/distributed/utils.py` found when running internal test, see D62901023. Curious why this wasn't causing any issue. Is this relevant code deprecated and no longer used?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136286
Approved by: https://github.com/Skylion007
2024-09-22 20:54:12 +00:00
3bc073d728 [aoti] Fix workspace generation for triton (#135552)
Fixes #131337

- add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`.
- do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead.
- add workspace allocation generation code to `kernel_autotune_calls`. e.g.
```python
    workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8)
    workspace.zero_()
    .....
    triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0)
    del buf2, arg0_1, arg1_1, workspace
```
-  add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code.

The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `.

```cpp
    static constexpr int64_t int_array_0[] = {1280L, };
    static constexpr int64_t int_array_1[] = {1L, };
    AtenTensorHandle workspace_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda,  0, &workspace_handle));

        RAIIAtenTensorHandle workspace(workspace_handle);
        workspace.zero_();
```

- Fix handle grid_fn  for grid computation. Pass in "RBLOCK" to `split_scan_grid`
-  Fix dynamic shapes:
Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32*((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined.

The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code.

- We also generate slightly different cpp code depending on if `abi_compatible` is turned on.
```cpp
RAIIAtenTensorHandle workspace(workspace_handle);
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get()));
```
vs

```cpp
    at::Tensor workspace = at::detail::empty_strided_cuda({8L*(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA);
    workspace.zero_();
```

Test Plan:

```
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper
TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552
Approved by: https://github.com/desertfire
2024-09-22 04:51:37 +00:00
35532fc477 [Partitioner] Reuse partition to check whether nodes exist (#135317)
The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317
Approved by: https://github.com/ezyang
2024-09-21 23:52:02 +00:00
cyy
e4cdc31227 [14/N] Fix clang-tidy warnings in aten/src/ATen (#133988)
Follows  #133807
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133988
Approved by: https://github.com/ezyang
2024-09-21 22:41:40 +00:00
9731ccb9e0 Type _dynamo/variables/lazy.py (#136376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136376
Approved by: https://github.com/Skylion007
2024-09-21 22:18:02 +00:00
09715638ab Add _dynamo.config.suppress_errors logging (#136379)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136379
Approved by: https://github.com/ezyang
2024-09-21 21:00:26 +00:00
3176966732 update cache tests (#136215)
Summary:
- Clean up cache test code a bit.
- Removed patch_fbcode() - it turned out to cause flaky issues (image if it set fbcode=False and then loaded a module for the first time which had a top-level fbcode check).

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D62648248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136215
Approved by: https://github.com/bobrenjc93
2024-09-21 20:36:22 +00:00
be4b7e8131 Param fixes in docstring (#136097)
Fixes wrong param names in docstrings. cc: @kit1980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136097
Approved by: https://github.com/ezyang
2024-09-21 18:56:34 +00:00
b6ffa381e1 [BE]: Add half CUDA support nextafter (#136373)
Making CUDA support match CPU support for nextafter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136373
Approved by: https://github.com/ezyang
2024-09-21 17:13:45 +00:00
cc17d58809 Revert "S390x update builder image (#132983)"
This reverts commit 080a249fc2290602402e01bf5864d9d9a416e5b6.

Reverted https://github.com/pytorch/pytorch/pull/132983 on behalf of https://github.com/atalman due to Authenticate With PUSH is failing. Error: no registries found in registries.conf, a registry must be provided. Error: Process completed with exit code 125. ([comment](https://github.com/pytorch/pytorch/pull/132983#issuecomment-2365249249))
2024-09-21 16:46:54 +00:00
03957efa5d [inductor][scheduler] reorder scheduler nodes after fusion to reduce peak memory (#134874)
**Motivations**:
A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf).

**Solutions**:
1. implement a peak memory estimator via liveness analysis
2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory

**Results**:
On some models we can reduce the peak memory significantly:
|             model             | batch size | peak_memory baseline | peak_memory new | ratio |
|:-----------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| alexnet                       | 128        |         1.17         |       0.99      | 1.19  |
| vgg16                         | 64         |         4.10         |       3.57      | 1.15  |
| DebertaV2ForQuestionAnswering | 1          |         11.60        |      10.56      | 1.10  |

In the presence of compiler based AC, peak memory can be further reduced:
|              model             | batch size | peak_memory baseline | peak_memory new | ratio |
|:------------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| AlbertForMaskedLM              | 4          |         6.87         |       6.43      | 1.07  |
| AlbertForQuestionAnswering     | 4          |         8.69         |       7.76      | 1.12  |
| MobileBertForQuestionAnswering | 128        |         4.67         |       3.90      | 1.20  |

[Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case.

**Other infos:**
* neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_.
* minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second.
* no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874
Approved by: https://github.com/yf225
2024-09-21 16:28:38 +00:00
fb4670a1f9 fix mean_out: op does not update parameter out for BF16/FP16 dtype on CPU (#135174)
Fixes #134848

For BF16/FP16, when a tensor is specified in `out` parameter of mean, the mean kernel should use its storage for output, but that doesn't happen, since an `at::to` in the current code causes storage to be allocated again, but the `out` parameter tensor's storage doesn't get updated, resulting in it not holding the mean output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135174
Approved by: https://github.com/soulitzer
2024-09-21 14:21:42 +00:00
ea737e4e5d [Pipelining] Make PipelineStage support meta initialization (#136243)
Avoid allocating memory or dry-running the submodule during stage init.

Save user-provided input/output metadata during stage init, to allow
lazily initializing the buffers before the first step call.

Later, we plan to build on top of this to add lazy shape inference
(#130856) so that no input/output shapes are required at stage init.

For now, we require input/output tensors for stage init, but these
should be on meta device and stage should not allocate any real memory.

Note: this needs more thorough testing and review, but it worked on the
torchtitan 3d test.

TODO:
- delete 'device' arg from PipelineStage ctor? (move it to inferred from
  args tensors passed to first step call? separate PR.
- delete 'output_args' from PipelineStage ctor? we don't actually need
  it, but we use it to do shape validation, which is why I didn't remove
  it in this PR.  Proposal: leave it until we add lazy shape inference?

Fixes #136225, #136226

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136243
Approved by: https://github.com/H-Huang, https://github.com/kwen2501
2024-09-21 09:47:22 +00:00
cyy
c459430558 Pass Werror to CUDA host compiler (#130213)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130213
Approved by: https://github.com/ezyang
2024-09-21 08:01:06 +00:00
e18439113e [PT2][Inductor][Optmus] fix test_pad_mm_bf16 and reland to fix long computation kernel (#136349)
Summary: see D62220158

Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pad_mm -- --exact 'caffe2/test/inductor:pad_mm - test_pad_mm_bf16 (caffe2.test.inductor.test_pad_mm.PadMMTest)' --run-disabled
```

### H100

Buck UI: https://www.internalfb.com/buck2/e5d85802-cab7-41a5-aacc-95f541796a99
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9570149258587374
Network: Up: 9.1KiB  Down: 0B  (reSessionID-b339b51b-6a0e-4347-9414-1ba38f26a5d0)
Jobs completed: 9. Time elapsed: 1:15.7s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 1. Build failure 0

### A100

Buck UI: https://www.internalfb.com/buck2/1082ad6e-56b0-4eb5-8092-ce507ca9a70e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/8444249533824784
Network: Up: 9.2KiB  Down: 0B  (reSessionID-2b3056ac-f29e-4de4-b6f5-9d994acf566b)
Jobs completed: 9. Time elapsed: 1:36.9s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3)
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0

# E2E

see D62220158

Differential Revision: D63040455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136349
Approved by: https://github.com/dshi7
2024-09-21 06:35:50 +00:00
cyy
02871461f7 Fix clang-tidy warnings in torch/csrc/lazy (#134655)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134655
Approved by: https://github.com/ezyang
2024-09-21 02:59:35 +00:00
0b91e7e2dc Remove duplicate line (#136383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136383
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-09-21 01:35:13 +00:00
eqy
29f7b8d483 [TF32] Account for TF32 in test_conv_double_backward (#135716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135716
Approved by: https://github.com/Skylion007
2024-09-21 01:06:22 +00:00
7936584a88 Fix Vectorized<double>::next_after SVE compilation (#136388)
Should have called [`Sleef_nextafterdx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-double-precision-function-for-obtaining-the-next-representable-fp-value) rather than [`Sleef_nextafterfx_sve`](https://sleef.org/2-references/libm/aarch64#vectorized-single-precision-function-for-obtaining-the-next-representable-fp-value) to get vectorized `nextafter` for double precision rather than single precision values

This fixes a compilation issue introduced by https://github.com/pytorch/pytorch/pull/119571 and exposed by https://github.com/pytorch/pytorch/pull/133339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136388
Approved by: https://github.com/kit1980
2024-09-20 23:54:17 +00:00
067d203b22 Upgrade pybind11 API calls for 3.13t (#136370)
This is a modified version of https://github.com/pytorch/pytorch/pull/130341 that preserve support for older pybind version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136370
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-09-20 23:09:55 +00:00
1a10751731 [AOTI][Tooling] Filter out kernels based off lowercase names (#135395)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135395
Approved by: https://github.com/YUNQIUGUO
2024-09-20 21:56:08 +00:00
0c936c3ecb Add decomps for max_unpool (#133146)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133146
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-20 21:35:25 +00:00
293fccf86d add TORCH_CUDA_CPP_API for AutoNcclGroup (#130012)
`torch::cuda::nccl` is an option for developers to depend only on torch but not nccl. But to use `torch::cuda::nccl::send`/`torch::cuda::nccl::recv`, `ncclGroupStart()`/`ncclGroupEnd()` is needed,  `torch::cuda::nccl::AutoNcclGroup` can be used.  but `torch::cuda::nccl::AutoNcclGroup` is not exported and is LOCAL symbol, which can't be used from outside of libtorch.

<img width="1618" alt="image" src="https://github.com/pytorch/pytorch/assets/1913192/25b0bd54-2da6-480f-876d-b05acfecfe62">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130012
Approved by: https://github.com/kwen2501, https://github.com/eqy
2024-09-20 21:20:25 +00:00
239a9ad65e Adds support for accelerated sorting with x86-simd-sort (#127936)
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

<details>
<summary><b>Contiguous Benchmarks</b></summary>

```
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323

int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952
```

</details>

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

<details>
<summary><b>Discontiguous Benchmarks</b></summary>

```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621

int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-20 21:19:33 +00:00
cyy
d2455b99fb Use cpython declaration of _PyWeakref_ClearRef (#136300)
To avoid the DLL inconsistency warning by MSVC:
```
torch/csrc/utils/python_compat.h(38): warning C4273: '_PyWeakref_ClearRef': inconsistent dll linkage
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136300
Approved by: https://github.com/Skylion007
2024-09-20 18:58:58 +00:00
7f9c06462f fix mypi in utils/_sympy/functions.py (#136339)
Signed-off-by: Bob Ren <bobren@fb.com>

Turns out older versions of python, in particular 3.8 shows errors that 3.12 doesn't. For posterity these are the steps I took to reproduce:

```
conda create -n py38 python=3.8
conda activate py38
pip install -r requirements.txt
lintrunner init
dmypy restart && lintrunner --all-files --take MYPY
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136339
Approved by: https://github.com/Skylion007
ghstack dependencies: #136205
2024-09-20 18:39:16 +00:00
f53a0f9cc1 [Inductor] Fix test_profiler_mark_wrapper_call_cuda_cuda_wrapper (#136356)
Summary: Internal profiler behaves differently after turning on triton.autotune_at_compile_time. Needs more investigation but turning it off for this test for now.

Reviewed By: henrylhtsang

Differential Revision: D63035855

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136356
Approved by: https://github.com/henrylhtsang
2024-09-20 18:35:09 +00:00
5997354151 Add more distributed examples (#130427)
1. Add `gather` example
2. Add device to `scatter` example
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130427
Approved by: https://github.com/kwen2501
2024-09-20 18:27:27 +00:00
df1eef9779 Revert "[torch][ao] Add customizable loss function to NodeAccuracySummary (#136282)"
This reverts commit f3c54ccf8f6139807f4623037c0174964a286652.

Reverted https://github.com/pytorch/pytorch/pull/136282 on behalf of https://github.com/huydhn due to This breaks OSS, let revert it and land the revert internally then ([comment](https://github.com/pytorch/pytorch/pull/136282#issuecomment-2364219252))
2024-09-20 17:49:06 +00:00
15dba021bb [ROCm][CI] upgrade CI to ROCm 6.2 (#132555)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132555
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-09-20 17:39:31 +00:00
29affa6b95 return instead of using skipTest (#136244)
Summary:
Return from functions instead of using `skipTest`.
This is mostly to make our test report happier.
Skipped tests still show up in our  Broken test report.

```
OK (skipped=1)
I0917 16:14:24.749060 1018907 StorageDemandControl.cpp:572] Flushing Demand Control ODS counters

Skipped: Store doesn't support extended APIs
```

Test Plan:
Tested locally.
Test shows up as passed instead of skipped.

```
Cache hits: 99%. Commands: 125048 (cached: 124961, remote: 10, local: 77)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D62912065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136244
Approved by: https://github.com/XilunWu
2024-09-20 17:36:28 +00:00
d7a6980078 [inductor] Make DtypeView work with cpp_wrapper without abi_compatible (#136233)
Fixes #136159

Prior to this PR, using cpp_wrapper without abi_compatible could result in incorrect dtypes.

The following block of code implements cpp_wrapper codegen for reinterpret_view for abi_compatible mode, but not for non-abi_compatible mode.

f6f1504d39/torch/_inductor/codegen/cpp_wrapper_cpu.py (L1678-L1814)

Added a test that verifies that we keep the view behavior, but returned tensors also have correct dtypes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136233
Approved by: https://github.com/FindHao, https://github.com/eellison, https://github.com/jansel
2024-09-20 17:30:35 +00:00
080a249fc2 S390x update builder image (#132983)
S390x update builder image
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132983
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-09-20 17:26:26 +00:00
783c5ba80a Revert "[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765)"
This reverts commit 0b81f700aa7eb20d4b9f20e9627dd1208e50ea58.

Reverted https://github.com/pytorch/pytorch/pull/132765 on behalf of https://github.com/ezyang due to implementation is not correct, needs full rewrite ([comment](https://github.com/pytorch/pytorch/pull/132765#issuecomment-2364160452))
2024-09-20 17:10:27 +00:00
cdef760560 [aotd] Fix freezing API for subclasses (#136265)
Original issue:
https://github.com/pytorch/ao/issues/890

The problem:

TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.

flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.

Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh
2024-09-20 16:32:49 +00:00
4842f0fac6 Enable torch build with SLEEF on ARM by default (#133339)
**Scope:** Enable PyTorch build with SLEEF on Arm by default. Enable codegen kernels compilation with SLEEF on ARM platform.

Enabling the build with SLEEF by default and setting `AT_BUILD_ARM_VEC256_WITH_SLEEF` as the default for Arm  improves performance for some models. I have benchmarked several networks on `Neoverse-V1` using `torch.compile` with the `inductor` backend.
On models  like `hf_Bert_Large` , `hf_GPT_fast`, we're seeing a **~1.2x speedup** (with 16 threads).

The below results are run with `Batch_Size=1` and `Cores=8, 16`

![Screenshot 2024-08-27 at 17 04 23](https://github.com/user-attachments/assets/319c7ef7-1202-4145-a51a-7a80dfd5f1f6)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133339
Approved by: https://github.com/malfet, https://github.com/kimishpatel

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-20 16:02:32 +00:00
f3c54ccf8f [torch][ao] Add customizable loss function to NodeAccuracySummary (#136282)
Summary:
Add a customizable loss function callback to NodeAccuracySummary to
allow users to pass in their own loss function.

Also, fix some type errors and propagate better exception messages when
unexpected tensor comparisons occur. Finally, enhance the robustness of
`generate_numeric_debug_handle` in the case where it is called multiple
times on the same model, by avoiding reuse of the same IDs.

Test Plan: Added a test for this case in `test_numeric_debugger`.

Reviewed By: jerryzh168

Differential Revision: D62898297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282
Approved by: https://github.com/jerryzh168
2024-09-20 07:34:52 +00:00
687e5cf8c5 [inductor] Relax the conditions for loop split (#135335)
Summary
This PR Relaxes the conditions for loop split to support dynamic shape cases.
Now the conditions that need to be met to apply loop split optimization are as follows:

1. No reduction and no mudular index for all nodes.
2. The indexing_exprs of all nodes contain only one (or more, but all the same) division, where the divisor is an integer, the dividend is one of the iter_vars, and this var, i.e. the dimension that needs to be split, is contiguous in all other indexing_exprs.

Example:
```
import torch
import torch.nn as nn

class GN(torch.nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GN, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return self.gn(x)

input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last)
m = GN(32, 960).eval()
compiled_m = torch.compile(m, dynamic=True)

with torch.no_grad():
    compiled_m(input)
```

Before loop split, the node's var_ranges: `{z0: s0, z1: s2, z2: s2, z3: 960}` and indexing_exprs: `{'index0': 960*s2**2*z0 + 960*s2*z1 + 960*z2 + z3, 'index1': 32*z0 + (z3//30), 'index2': 30*s2**2, 'index3': z3, 'index4': 960*s2*z0*((s2**2//s2)) + 960*z1*((s2**2//s2)) + 960*z2 + z3}`. After loop split `z3` will split to `30*z3 + z4`, then the node's var_ranges will be changed to `{z0: s0, z1: s2, z2: s2, z3: 32, z4: 30}` and indexing_exprs will be changed to `{'index0': 960*s2**2*z0 + 960*s2*z1 + 960*z2 + 30*z3 + z4, 'index1': 32*z0 + z3, 'index2': 30*s2**2, 'index3': 30*z3 + z4, 'index4': 960*s2*z0*((s2**2//s2)) + 960*z1*((s2**2//s2)) + 960*z2 + 30*z3 + z4}`

Generated code:

- Before:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'const int64_t', 'const int64_t'], '''
#include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       const int64_t ks0,
                       const int64_t ks1)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(8L))));
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1*ks1)); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(16));
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(14L));
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L))
                {
                    #pragma GCC ivdep
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L))
                    {
                        #pragma GCC ivdep
                        for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(960L); x3+=static_cast<int64_t>(1L))
                        {
                            auto tmp0 = in_ptr0[static_cast<int64_t>(x3 + (960L*x2) + (960L*ks1*x1) + (960L*x0*(static_cast<int64_t>(ks1*ks1))))];
                            auto tmp1 = out_ptr0[static_cast<int64_t>((32L*x0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))];
                            auto tmp3 = out_ptr1[static_cast<int64_t>((32L*x0) + (c10::div_floor_integer(static_cast<int64_t>(x3), static_cast<int64_t>(30L))))];
                            auto tmp11 = in_ptr1[static_cast<int64_t>(x3)];
                            auto tmp13 = in_ptr2[static_cast<int64_t>(x3)];
                            auto tmp2 = decltype(tmp0)(tmp0 - tmp1);
                            auto tmp4 = 30L*(static_cast<int64_t>(ks1*ks1));
                            auto tmp5 = c10::convert<float>(tmp4);
                            auto tmp6 = tmp3 / tmp5;
                            auto tmp7 = static_cast<float>(1e-05);
                            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                            auto tmp9 = 1 / std::sqrt(tmp8);
                            auto tmp10 = decltype(tmp2)(tmp2 * tmp9);
                            auto tmp12 = decltype(tmp10)(tmp10 * tmp11);
                            auto tmp14 = decltype(tmp12)(tmp12 + tmp13);
                            out_ptr2[static_cast<int64_t>(x3 + (960L*x2) + (960L*x1*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))) + (960L*ks1*x0*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))))] = tmp14;
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    s0 = arg2_1
    s2 = arg3_1
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg4_1, (s0, 960, s2, s2), (960*(s2*s2), 1, 960*s2, 960))
    buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf3 = empty_strided_cpu((s0, 960, s2, s2), (960*s2*((s2*s2) // s2), 1, 960*((s2*s2) // s2), 960), torch.float32)
    cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2)
    del arg0_1
    del arg1_1
    del arg4_1
    return (buf3, )
```

After:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*', 'const int64_t', 'const int64_t'], '''
#include "/tmp/torchinductor_jiayisun/32/c32dcqa3qidvmunis4lucp3dhoicleq5qjfjfgvpiadbbzfp6ofy.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2,
                       const int64_t ks0,
                       const int64_t ks1)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(32L); x1+=static_cast<int64_t>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(c10::div_floor_integer(static_cast<int64_t>((15L*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(8L))));
                        for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(static_cast<int64_t>(ks1*ks1)); x2+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(16L); x3+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(16));
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(int64_t x3=static_cast<int64_t>(16L); x3<static_cast<int64_t>(30L); x3+=static_cast<int64_t>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x3 + (30L*x1) + (960L*x2) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(14L));
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, static_cast<int64_t>(14L), &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<int64_t>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(ks0); x0+=static_cast<int64_t>(1L))
            {
                for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(ks1); x1+=static_cast<int64_t>(1L))
                {
                    #pragma GCC ivdep
                    for(int64_t x2=static_cast<int64_t>(0L); x2<static_cast<int64_t>(ks1); x2+=static_cast<int64_t>(1L))
                    {
                        #pragma GCC ivdep
                        for(int64_t x3=static_cast<int64_t>(0L); x3<static_cast<int64_t>(32L); x3+=static_cast<int64_t>(1L))
                        {
                            for(int64_t x4=static_cast<int64_t>(0L); x4<static_cast<int64_t>(16L); x4+=static_cast<int64_t>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*ks1*x1) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(16));
                                auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(16));
                                auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(16));
                                auto tmp2 = at::vec::Vectorized<float>(tmp1);
                                auto tmp3 = tmp0 - tmp2;
                                auto tmp5 = 30L*(static_cast<int64_t>(ks1*ks1));
                                auto tmp6 = c10::convert<float>(tmp5);
                                auto tmp7 = tmp4 / tmp6;
                                auto tmp8 = static_cast<float>(1e-05);
                                auto tmp9 = decltype(tmp7)(tmp7 + tmp8);
                                auto tmp10 = 1 / std::sqrt(tmp9);
                                auto tmp11 = at::vec::Vectorized<float>(tmp10);
                                auto tmp12 = tmp3 * tmp11;
                                auto tmp14 = tmp12 * tmp13;
                                auto tmp16 = tmp14 + tmp15;
                                tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*x1*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))) + (960L*ks1*x0*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1))))));
                            }
                            for(int64_t x4=static_cast<int64_t>(16L); x4<static_cast<int64_t>(30L); x4+=static_cast<int64_t>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*ks1*x1) + (960L*x0*(static_cast<int64_t>(ks1*ks1)))), static_cast<int64_t>(14L));
                                auto tmp1 = out_ptr0[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp4 = out_ptr1[static_cast<int64_t>(x3 + (32L*x0))];
                                auto tmp13 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(14L));
                                auto tmp15 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<int64_t>(x4 + (30L*x3)), static_cast<int64_t>(14L));
                                auto tmp2 = at::vec::Vectorized<float>(tmp1);
                                auto tmp3 = tmp0 - tmp2;
                                auto tmp5 = 30L*(static_cast<int64_t>(ks1*ks1));
                                auto tmp6 = c10::convert<float>(tmp5);
                                auto tmp7 = tmp4 / tmp6;
                                auto tmp8 = static_cast<float>(1e-05);
                                auto tmp9 = decltype(tmp7)(tmp7 + tmp8);
                                auto tmp10 = 1 / std::sqrt(tmp9);
                                auto tmp11 = at::vec::Vectorized<float>(tmp10);
                                auto tmp12 = tmp3 * tmp11;
                                auto tmp14 = tmp12 * tmp13;
                                auto tmp16 = tmp14 + tmp15;
                                tmp16.store(out_ptr2 + static_cast<int64_t>(x4 + (30L*x3) + (960L*x2) + (960L*x1*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1)))) + (960L*ks1*x0*(c10::div_floor_integer(static_cast<int64_t>((static_cast<int64_t>(ks1*ks1))), static_cast<int64_t>(ks1))))), static_cast<int64_t>(14L));
                            }
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1, arg3_1, arg4_1 = args
    args.clear()
    s0 = arg2_1
    s2 = arg3_1
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg4_1, (s0, 960, s2, s2), (960*(s2*s2), 1, 960*s2, 960))
    buf0 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf1 = empty_strided_cpu((s0, 32, 1, 1), (32, 1, 32*s0, 32*s0), torch.float32)
    buf3 = empty_strided_cpu((s0, 960, s2, s2), (960*s2*((s2*s2) // s2), 1, 960*((s2*s2) // s2), 960), torch.float32)
    cpp_fused_native_group_norm_0(arg4_1, arg0_1, arg1_1, buf0, buf1, buf3, s0, s2)
    del arg0_1
    del arg1_1
    del arg4_1
    return (buf3, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135335
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-09-20 05:42:52 +00:00
cf31724db7 Fix and improvements to toward 3.13t (#136319)
Small part of https://github.com/pytorch/pytorch/pull/130689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136319
Approved by: https://github.com/malfet, https://github.com/Skylion007
2024-09-20 04:22:18 +00:00
e3ea5429f2 Implement GetAttrVariable.as_python_constant() (#134216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134216
Approved by: https://github.com/amjames, https://github.com/williamwen42
2024-09-20 03:44:43 +00:00
d9aca9914b Remove duplicated words in library.rst (#136340)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136340
Approved by: https://github.com/svekars
2024-09-20 03:30:54 +00:00
fe0e9fb385 Fix flaky SIGSEGV crash in test_profile_memory (#136304)
Fixes https://github.com/pytorch/pytorch/issues/132331

We need another barrier here to ensure that the main thread doesn't stop the profiler while other threads are still using it (and crash).  I can reliably reproduce the issue with `pytest -v test/profiler/test_cpp_thread.py -k test_profile_memory --flake-finder`.

### Testing

`pytest -v test/profiler/test_cpp_thread.py --flake-finder` all passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136304
Approved by: https://github.com/briancoutinho
2024-09-20 02:56:49 +00:00
d45b0151e5 Add deterministic path for CUDA cumsum (#136224)
Change `cumsum` to call its decomposition when `use_deterministic_algorithms(True)` and input is CUDA.

Fixes #89492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136224
Approved by: https://github.com/ezyang, https://github.com/justinchuby
2024-09-20 02:41:56 +00:00
1dfa07e885 passing FileTimerRequests.to_json() to log_debug_info_for_expired_timers for a better debugging experience (#135913)
Summary: The change involves passing the expired timers to the log_debug_info_for_expired_timers function after to_json() has been applied . This change is made to provide a better debugging experience for the user.

Test Plan: unit tests

Reviewed By: gag1jain

Differential Revision: D62408767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135913
Approved by: https://github.com/gag1jain
2024-09-20 00:54:02 +00:00
bebf5302ba TCPStoreLibUvBackend: trace operations (#136320)
Summary:
This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values

This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead

Test Plan:
```
TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo"
```

```
I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500.
I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running
I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500.
I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500).
I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500
I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646
I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646
I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646
I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646
I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646
I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646
I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646
I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646
```

Differential Revision: D62924454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320
Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu
2024-09-20 00:53:21 +00:00
9b424aac1d [CI][CUSPARSELT] Extend cusparselt installation script to support cuda 12.6 (#136321)
To prepare for future cuda updates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136321
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-09-19 23:45:57 +00:00
172ecf78b7 DTensor: dont hash symint tensor input in propagate_tensor_meta (#136266)
This fixes a subset of issues for dynamic shapes + DTensor.

It's pretty easy to run into other issues - it's likely that we need https://github.com/pytorch/pytorch/pull/125941 to land for DTensor + dynamic shapes to work more generally. I ended up writing a test that had dynamic shape inputs but not dynamic shape outputs in order to properly test this fix

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136266
Approved by: https://github.com/ezyang, https://github.com/yf225
2024-09-19 20:39:36 +00:00
cyy
7bbdf87517 [22/N] Fix clang-tidy warnings in jit (#134829)
Follows  #134537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134829
Approved by: https://github.com/ezyang
2024-09-19 19:24:42 +00:00
b71802fa79 add basic_modules_ListOfLinears_inductor_gpu_force_shape_pad (#136175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136175
Approved by: https://github.com/ezyang
2024-09-19 19:15:50 +00:00
8cba0ec958 [AOTI][Tooling][8/n] Add option to pinpoint kernel names in debug printer (#136182)
Summary:
Add a third mode where we only print kernel names without dumping any intermediate actual tensor value info.

It can be helpful in quickly identifying the troublesome kernels in CUDA IMA issues.

thanks ColinPeppler and henrylhtsang for this "feature request".

Test Plan:
The output can look like this if set the `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`:

{F1871629091}

Differential Revision: D62791371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136182
Approved by: https://github.com/henrylhtsang
2024-09-19 18:51:57 +00:00
49723a8ff3 fix stride compare failed when size value equal to one in ForeachUtils.h (#134546)
When size value equal to one, tensor strides value need be skipped to compare.
@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134546
Approved by: https://github.com/janeyx99
2024-09-19 18:43:41 +00:00
ccca3de0cd [ROCm] Enable Flex attention tests on AMD gpus (#136245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136245
Approved by: https://github.com/malfet
2024-09-19 18:02:41 +00:00
8d9c42735a Type _sympy/functions.py [1/n] (#136205)
Signed-off-by: Bob Ren <bobren@fb.com>

I was chatting with @jamesjwu about strategies to learn the code and he suggested adding types to some files. This stack of PRs adds types to _sympy/functions.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136205
Approved by: https://github.com/Skylion007, https://github.com/jamesjwu
2024-09-19 17:15:53 +00:00
803ce507f1 Log structured logging overhead to dynamo compile (kinda) (#136142)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2454

This adds structured logging overhead at a per compile basis to compilation metrics.

To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table.

Implementation notes:
- If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis.
- We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small.
- I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though.

Test Plan:
Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table:

https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq

Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278  = 6%, which seems reasonable as the overhead for a small compilation like this.

You can also look at samples for a more detailed log of this.

Reviewed By: oulgen

Differential Revision: D62643611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136142
Approved by: https://github.com/bobrenjc93
2024-09-19 16:11:38 +00:00
65df26f615 [FSDP2] Fixed 2D mismatched grad placements (#136237)
```
CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer
```

Differential Revision: [D62964658](https://our.internmc.facebook.com/intern/diff/D62964658)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136237
Approved by: https://github.com/weifengpy
2024-09-19 14:35:15 +00:00
4ea741d24f Revert "Reland D62220158 (#136213)"
This reverts commit 083c9149b75cd918b6fb2795050d7173923a3629.

Reverted https://github.com/pytorch/pytorch/pull/136213 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in rocm signals ([comment](https://github.com/pytorch/pytorch/pull/136213#issuecomment-2360885064))
2024-09-19 12:44:54 +00:00
bce52d0b60 [CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288)
Summary:
To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop.
In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error:
```counterexample
 Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters.
```

Test Plan: Sandcastle plus visual inspection

Differential Revision: D62977370

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288
Approved by: https://github.com/kit1980
2024-09-19 12:40:36 +00:00
908a5689eb Return unsafe_view instead of view from matmul when folding occurs (#134568)
When tensor folding occurs during matmul operation returned tensor is a view. This can cause issues when matmul is used inside a custom function and such view is then returned as output. Then it cannot be modified inplace and causes errors.
It can be especially problematic when after such function inplace allreduce is performed.
Issue is resolved when unsafe_view is returned from matmul instead. This solution aligns matmul decomposition with eager implementation in such a way that a non view tensor is returned.

Test included in this PR reproduces the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134568
Approved by: https://github.com/zou3519
2024-09-19 11:52:16 +00:00
db80b98ec4 XFAIL test_segfault (#136252)
Fixes https://github.com/pytorch/pytorch/issues/128551

As this has been failing in trunk for a while and there is no owner yet to fix it properly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136252
Approved by: https://github.com/andrewkho
2024-09-19 04:17:06 +00:00
775517693a Add type checks for Tensor.add_ (#135864)
Fixes  #127049

There's already a meta func in `meta_registrations.py` for `add_` and `sub_` methods. I added a second meta function for error checking, i.e `int.add/sub_(float)` and `bool.add/sub_(other types)` .

Also the corresponding test with Dynamo passes, removed `@xfailIfTorchDynamo`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135864
Approved by: https://github.com/williamwen42
2024-09-19 03:09:36 +00:00
e037bb326f [dynamo] fix crash in InspectSignatureVariable (#136010)
Fix crash that was happening in https://github.com/pytorch/pytorch/issues/128095, because we were trying to extract a constant incorrectly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136010
Approved by: https://github.com/yanboliang, https://github.com/anijain2305, https://github.com/jansel
2024-09-19 00:23:00 +00:00
f2b0fc89f2 Add uint16 support for observer (#136238)
Summary:
att

Test Plan:
python test/test_quantization.py -k TestObserver

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D62909821](https://our.internmc.facebook.com/intern/diff/D62909821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136238
Approved by: https://github.com/tarun292
2024-09-18 23:52:18 +00:00
068c80e6b6 [BE][MPS] Fix deprecation warnings on MacOS 15.0 (#136292)
[reverseSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reversesquareroot(with:name:)?changes=__8&language=objc) were deprecated in favor of [reciprocalSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reciprocalsquareroot(_:name:)?changes=__8&language=objc)

Without it, following warnings are generated if compiled on recently released MacOS Sequoia:
```
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:720:35: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations]
  720 |           rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil];
      |                                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                   reciprocalSquareRootWithTensor
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:341:10: note: in instantiation of function template specialization 'at::native::batch_norm_backward_mps(const Tensor &, const Tensor &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, bool, double, std::array<bool, 3>)::(anonymous class)::operator()<MPSGraph *, CachedGraph *>' requested here
  341 | decltype(std::declval<_Fp>()(std::declval<_Args>()...))
      |          ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:351:19: note: while substituting deduced template arguments into function template '__invoke' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _Args = <MPSGraph *, CachedGraph *>]
  351 |   static decltype(std::__invoke(std::declval<_XFp>(), std::declval<_XArgs>()...)) __try_call(int);
      |                   ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:357:28: note: while substituting deduced template arguments into function template '__try_call' [with _XFp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _XArgs = (no value)]
  357 |   using _Result = decltype(__try_call<_Fp, _Args...>(0));
      |                            ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:27:32: note: in instantiation of template class 'std::__invokable_r<void, (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, MPSGraph *, CachedGraph *>' requested here
   27 | __expand_to_true<__enable_if_t<_Pred::value>...> __and_helper(int);
      |                                ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:38:39: note: while substituting explicitly-specified template arguments into function template '__and_helper'
   38 | using _And _LIBCPP_NODEBUG = decltype(std::__and_helper<_Pred...>(0));
      |                                       ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:828:20: note: (skipping 1 context in backtrace; use -ftemplate-backtrace-limit=0 to see all)
  828 |             bool = _And< _IsNotSame<__remove_cvref_t<_Fp>, function>, __invokable<_Fp, _ArgTypes...> >::value>
      |                    ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:841:49: note: in instantiation of default argument for '__callable<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &>' required here
  841 |   using _EnableIfLValueCallable = __enable_if_t<__callable<_Fp&>::value>;
      |                                                 ^~~~~~~~~~~~~~~~
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:851:32: note: in instantiation of template type alias '_EnableIfLValueCallable' requested here
  851 |   template <class _Fp, class = _EnableIfLValueCallable<_Fp>>
      |                                ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:852:25: note: in instantiation of default argument for 'function<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68)>' required here
  852 |   _LIBCPP_HIDE_FROM_ABI function(_Fp);
      |                         ^~~~~~~~~~~~~
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68: note: while substituting deduced template arguments into function template 'function' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68), $1 = (no value)]
  623 |     auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
      |                                                                    ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:24: note: while substituting deduced template arguments into function template 'LookUpOrCreateCachedGraph' [with T = CachedGraph]
  623 |     auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) {
      |                        ^
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here
  123 | -(MPSGraphTensor *) reverseSquareRootWithTensor:(MPSGraphTensor *) tensor
      | ^
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:745:37: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations]
  745 |             rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil];
      |                                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                     reciprocalSquareRootWithTensor
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here
  123 | -(MPSGraphTensor *) reverseSquareRootWithTensor:(MPSGraphTensor *) tensor
      | ^
2 warnings generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136292
Approved by: https://github.com/kit1980
2024-09-18 23:38:31 +00:00
b9a197df77 [BE][MPS] Delete duplicated code in View.mm (#136295)
After https://github.com/pytorch/pytorch/pull/135706 `getGatherScatterScalarType` returns exactly the same results as `scalarToMetalTypeString` , so delete the function and call `scalarToMetalTypeString`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136295
Approved by: https://github.com/kit1980
2024-09-18 22:44:43 +00:00
f1ad680818 [dynamo]Remove stream hardcoding in dynamo VariableBuilder (#131763)
Fixes #ISSUE_NUMBER

Recent change from PR#123487 used torch.cuda.Stream directly and this causes failure for other backends. This PR will generalize the stream handling for all backends like cuda/hpu/xpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131763
Approved by: https://github.com/yanboliang, https://github.com/yf225
2024-09-18 22:32:34 +00:00
bc9597b7d8 [Traceable FSDP2] Minor refactor to traceable FSDP2 unit tests (#136219)
Changes in this PR:
- Monkey-patching `F.scaled_dot_product_attention` with a lambda seems to not work in some cases. This PR avoids using a lambda.
- Running `fullgraph=True` and `fullgraph=False` in the same unit test seems to cause the two cases to interfere with each other and causes error. This PR splits them into two separate unit tests.
- The checks in the unit tests might not work with compile cache. This PR turns off the cache in order to have a more predictable compile behavior to do unit test on.

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_True`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_False`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136219
Approved by: https://github.com/yifuwang
2024-09-18 22:30:23 +00:00
1a86d8aa29 Fix calling Add._from_args and Mul._from_args (#136143)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136143
Approved by: https://github.com/ezyang
2024-09-18 20:51:04 +00:00
aae68e2976 Add wait counter for nccl abort (#136067)
Summary:
Quite a few times, we see the NCCL PG abort taking too long. There's no easy way to measure this, so let's add a counter to measure this across the stack.

This will help us measure how much time we take the NCCL abort.
Test Plan:
Unit tests

Reviewed By: c-p-i-o

Differential Revision: D62675010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136067
Approved by: https://github.com/fduwjj
2024-09-18 20:14:10 +00:00
eqy
68a7246f13 [cuDNN][conv][A100] Bump tolerances for vmap_autograd_grad conv2d on A100 (#136178)
Likely due to  a cuDNN heuristics update

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136178
Approved by: https://github.com/Skylion007
2024-09-18 19:42:13 +00:00
5a6ddbcc3b Extending the Pytorch vec backend for SVE (ARM) (#119571)
**Motivation:**
In Pytorch, Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It provides a generic implementation of Vector (Vec) type that allows the programmer to write code packing various primitives (such as floats) within 256bit & 512bits registers. It can be extended to support other ISAs easily by adding more VecISA sub-classes.

**Reference Link:** https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cpu/vec

**This PR:**

* Our goal with this contribution is to add support for SVE backend for Vec in the Aten vectorization for CPU backend which can be benefitted by any ARM architecture supported CPU's that supports SVE.

* More about SVE ISA for ARM: [https://developer.arm.com/Architectures/Scalable Vector Extensions](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions)

* We are using the ARM C Language Extensions for SVE (https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics ) to accelerate performance for various operators in the SVE backend for Vec.

* Currently we are adding support only for SVE ISA with the vector length of 256 bits (SVE 256). In future, we plan to extend this SVE support for other vector lengths as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119571
Approved by: https://github.com/malfet, https://github.com/snadampal

Co-authored-by: Divya Kotadiya <divya.kotadiya@fujitsu.com>
2024-09-18 18:59:10 +00:00
bad69044d8 [ROCm] upgrade ROCm CI builds to py3.10 (#134108)
Upgrade ROCm CI builds to py3.10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134108
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/atalman
2024-09-18 17:39:34 +00:00
3efaa016b1 [c10d] Make test compatible for new pytest (#136158)
Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517.

Short-term fix following CPython: 51aefc5bf9/Lib/unittest/case.py (L419-L426)

Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158
Approved by: https://github.com/fegin
2024-09-18 17:10:55 +00:00
605f2d802a [PyTorch] Remove unnecessary include of c10/util/Exception.h in irange.h (#136202)
Manually audited and can't figure out why this would be needed.

Differential Revision: [D62879500](https://our.internmc.facebook.com/intern/diff/D62879500/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136202
Approved by: https://github.com/malfet
2024-09-18 16:57:15 +00:00
6a6f5b20c5 Add _addmm_activation to lower precision cast policy on AutocastCPU (#135936)
Fixes #132613.
Add `_addmm_activation` to lower precision cast policy on AutocastCPU.
`_addmm_activation`  https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/transformer.cpp#L39 of `transformer_encoder_layer_forward` may throw `RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float` when autocast is enabled, as `_native_multi_head_attention` is put in lower data type cast policy https://github.com/pytorch/pytorch/pull/107674 and `_addmm_activation` may encounter mixed data types.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135936
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-09-18 16:31:27 +00:00
c8d152cb0e Fix fast_expand recursion error (#136163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136163
Approved by: https://github.com/ezyang
2024-09-18 13:58:45 +00:00
701ba5203f [Inductor] Increase multiplier to 3 for Inductor AMP FP16 benchmark correctness check (#135932)
Fix https://github.com/pytorch/pytorch/issues/135657.
Aligned with AMP BF16, using multiplier 3 for Inductor AMP FP16 benchmark correctness check

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135932
Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel
2024-09-18 13:03:45 +00:00
b5be4d8c05 Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161)
skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs.

To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161
Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin
2024-09-18 11:01:23 +00:00
083c9149b7 Reland D62220158 (#136213)
Summary: We fix the unit test test_pad_mm and reland the diff

Test Plan: See in D62220158

Differential Revision: D62891584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136213
Approved by: https://github.com/dshi7
2024-09-18 07:33:41 +00:00
a0207c8471 [dynamo] Fix support for classmethod(property(...)) (#134968)
Fixes #134451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968
Approved by: https://github.com/yanboliang
2024-09-18 04:47:51 +00:00
9aa22eabe7 [CI] Make linux-aarch64 shards actually running different tests (#136208)
Non-functional sharding was introduced in https://github.com/pytorch/pytorch/pull/125255 but each shard in that case were running the same tests...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136208
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/atalman
2024-09-18 03:10:21 +00:00
8895f69d12 [torch/numpy][numpy2.0 compat] Additional changes for tests to run under numpy-2.0 (#136152)
Continuation of https://github.com/pytorch/pytorch/pull/131909. This PR makes numpy tests compatible with numpy>=2.0.0. Specifically it deals with APIs that have been removed from numpy-2.0.

Changes in this PR:
1. Use `numpy.exceptions.ComplexWarning` if `numpy.exceptions` namespace is present. In numpy-2.0 `numpy.ComplexWarning` has been removed in favor of using `numpy.exceptions.ComplexWarning` (see [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#changes-to-namespaces)). Note that `numpy.exceptions` was introduced in numpy-1.25.0 hence does not exist in numpy<=1.24.x.
2. Do the same for `numpy.exceptions.VisibleDeprecationWarning`
3. Use `np.sort(...,axis=0)` over `np.msort()`(`np.msort()` removed in numpy-2.0)
4. Use `np.pad()` over `np.lib.pad()` (`np.lib` removed in numpy-2.0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136152
Approved by: https://github.com/atalman
2024-09-18 02:11:22 +00:00
6682327c75 [BE] Make NestedTensorTransformerFunctions.cu compilable without warnings (#136222)
Before the change compilation produced following warnings:
```
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In function ‘std::tuple<dim3, dim3, at::native::StackArray<long int> > at::native::check_shape_and_partition_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&)’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:584:22: warning: comparison of integer expressions of different signedness: ‘const int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare]
  584 |   TORCH_CHECK(num_jagged_dim <= kStackArrayMaxDims);
      |       ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1061: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare]
 1224 |   AT_DISPATCH_INDEX_TYPES(
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     ^
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1985: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare]
 1224 |   AT_DISPATCH_INDEX_TYPES(
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ^
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In instantiation of ‘void at::native::jagged_dense_elementwise_jagged_output_opt_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&, const at::Tensor&, F) [with scalar_t = c10::Half; F = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor (*)(const at::Tensor&, c10::ArrayRef<at::Tensor>, std::optional<c10::SymInt>), at::native::_fbgemm_dense_to_jagged_forward_symint, c10::Half, 1> >]’:
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1515:1:   required from here
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2006: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare]
 1336 |     AT_DISPATCH_INDEX_TYPES(
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      ^
/home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2113: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare]
 1336 |     AT_DISPATCH_INDEX_TYPES(
      |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 ^
```
after it compiled without a warning

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136222
Approved by: https://github.com/PaliC, https://github.com/kit1980
2024-09-18 01:24:05 +00:00
b18ba9419e [AO][Inductor] Enable WOQ fusion pattern with permute (#135928)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/135831 and https://github.com/pytorch/ao/issues/890. The root cause of the numerical failure was that the customized woq-int8 kernel was not triggered due to changes in the pattern. After re-adding the fusion pattern, the accuracy check now passes. I will open a separate TorchAO PR to enable these unit tests in TorchAO.

**Test Plan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int8
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135928
Approved by: https://github.com/jgong5, https://github.com/eellison
2024-09-18 00:56:16 +00:00
cccf500193 [c10d] remove sleep from watchdogHandler (#135760)
Summary:
Remove sleep from the `watchdogHandler` function. This sleep unnecessary slows things down during a NCCL timeout.
Flight recorder is configured to take a minute, at most, to dump out it's buffer.
This sleep ends up waiting for `8` minutes before destroy is called.

Test Plan: Unit tests.

Differential Revision: D62529875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135760
Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang
2024-09-18 00:55:01 +00:00
f6f1504d39 [MPS] Fix 5D+ reductions over negative dimentions (#136198)
This fixes bug introduced by https://github.com/pytorch/pytorch/pull/99856 that attempts to speed-up reduction for 5D+ tensor if trailing dimensions are all ones, but introduces crashes/off-by-one errors for wrapped dimensions

Added regresion test case to `TestMPS.test_sum`

Fixes https://github.com/pytorch/pytorch/issues/136132

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136198
Approved by: https://github.com/albanD
2024-09-17 21:53:31 +00:00
a575ce0dc6 [PyTorch Pinned Allocator] Add support of background thread to process events (#135524)
Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list.

Differential Revision: D62396585

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524
Approved by: https://github.com/zyan0
2024-09-17 21:08:10 +00:00
48d18fbd4c [PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174)
Summary:
This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments.

For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc.

Differential Revision: D62758758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174
Approved by: https://github.com/zyan0
2024-09-17 19:08:44 +00:00
eqy
e3aa5e2f64 [NCCL] Don't override waitUntilInitialized's setting of comm->initialized_ (#136155)
#133630 sets `initialized_` to `true` which causes previous wait codepaths to skip necessary waits, see also #https://github.com/pytorch/pytorch/issues/136151

CC @shuqiangzhang @wconstab

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136155
Approved by: https://github.com/fduwjj, https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/shuqiangzhang
2024-09-17 18:50:12 +00:00
a4e9a1c90b [TorchRec][PT2 IR][APF] short circuit the flatten/unflatten between EBC and KTRegroupAsDict modules (#136045)
Summary:
# context
* for the root cause and background please refer to this [post](https://fb.workplace.com/groups/1028545332188949/permalink/1042204770823005/)
* basica idea of this diff is to **short circuit the pytree flatten-unflatten function pairs** between two preserved modules, i.e., EBC/fpEBC and KTRegroupAsDict.
NOTE: There could be multiple EBCs and one single KTRegroupAsDict as shown in the [pic](https://fburl.com/gslide/lcyt8eh3) {F1864810545}
* short-circuiting the EBC-KTRegroupAsDict pairs are very special and a must in most of the cases due to the EBC key-order issue with distributed table lookup.
* hide all the operations behind a control flag `short_circuit_pytree_ebc_regroup` to the torchrec main api call `decapsulate_ir_modules`, which should only be visible to the infra layer, not to the users.

# details
* The `_short_circuit_pytree_ebc_regroup` function finds all the EBCs/fpEBC and KTRegroupAsDict modules in an unflattened module.  Retrieve their fqns and sort to in_fqns (regroup_fqns) and out_fqns (ebc_fqns). Because currently the fpEBC is swapped as a whole, so we do some extra fqn logic to filter out the EBC that belongs to an up-level fpEBC.
* a util function `prune_pytree_flatten_unflatten` removes the in-coming and out-going pytree flatten/unflatten function calls in the graph module, based on the given fqns.

WARNING: The flag `short_circuit_pytree_ebc_regroup` should be turned on if EBCs are used and EBC sharding is needed. Assertions are also added if can't find a `KTRegroupAsDict` module, or `finalize_interpreter_modules` is not `True`.

# additional changes
* absorb the `finalize_interpreter_modules` process inside the torchrec main api `decapsulate_ir_modules`.
* set `graph.owning_module` in export.unflatten as required by the graph modification
* add one more layer of `sparse_module` for closely mimicing the APF model structure.

Test Plan:
# run test
* serializer
```
buck2 run fbcode//mode/opt fbcode//torchrec/ir/tests:test_serializer
```
* apf
```
buck2 run fbcode//mode/opt fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text 'test_mtml_instagram_model_562438350_single_gpu_with_ir'
```
* local mp run
```
==== Finished E2E deterministic test for mtml_instagram_model_gmp_474023725_non_kjt_unary ====
finished
  test_mtml_instagram_model_562438350_single_gpu_with_ir
Imports took: 6.0s! Profile with --import-profiler.            --_ |""---__
Executed 1 example in 203.1s:                               |'.|  ||  .    """|
  Successful: 1                                             | ||  || /|\""-.  |
  Failed: 0                                                 | ||  ||  |    |  |
  Skipped: 0                                                | ||  ||  |   \|/ |
  Not executed: 8                                           |."|  ||  --"" '__|
https://testslide.readthedocs.io/                              --" |__---"""
```

Differential Revision: D62606738

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136045
Approved by: https://github.com/angelayi
2024-09-17 18:42:56 +00:00
ea10c072f3 [export] Deserialize args with python keyword names (#136036)
Currently when we deserialize inputs to nodes, we deserialize arguments with default values as kwargs. So deserializing `aten.uniform`, which has the signature `uniform(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!)`, will get become `uniform(x, from=0, to=1)`. However, this fails when running in python because `from` is a python keyword. So the solution here is to not deserialize it as a kwarg.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136036
Approved by: https://github.com/zhxchen17
2024-09-17 18:13:14 +00:00
a8382847f4 Support rms_norm() for NJT (#135872)
`rms_norm()` is a nice-to-have for ViT :)

This PR:
* SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp.
* Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side.
* Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #125947
2024-09-17 18:09:20 +00:00
785e98783b Delete links to non-existing run_plan_mpi.cc (#136204)
That were deleted by https://github.com/pytorch/pytorch/pull/125092

Fixes https://github.com/pytorch/pytorch/issues/136199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136204
Approved by: https://github.com/albanD, https://github.com/seemethere
2024-09-17 17:51:56 +00:00
cc365fdd7b [MTIA] Support torch.cuda.get_device_capability equivalent API on MTIA (#135889)
Summary:
Mirror `get_device_capability` on MTIA per https://fburl.com/gdoc/p4lo5avn

At the moment, both the major and minor version are just 0

Test Plan:
Unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api`

https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688850109958190/

Differential Revision: D62595296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135889
Approved by: https://github.com/egienvalue
2024-09-17 17:42:56 +00:00
8e5bb356e0 [PT2] Port merge_concats_pass to PT2 pre_grad passes (#135527)
Summary: as title

Test Plan: new UT

Differential Revision: D62398390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135527
Approved by: https://github.com/frank-wei
2024-09-17 17:26:53 +00:00
63dc5dff10 [Fix]: Update CPUINFO submodule to fix support for NON-SVE ARM Hardware (#135857)
Regression PR : https://github.com/pytorch/cpuinfo/pull/255

Change-Id: I56cec061072be11ec33ccb661114360b979fc7aa

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135857
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-09-17 16:50:17 +00:00
67b14ce8bd [ONNX] Fix numpy method to return the correct type (#136162)
Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior.

This needs to be cherry-picked into torch 2.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136162
Approved by: https://github.com/gramalingam, https://github.com/xadupre
2024-09-17 15:51:00 +00:00
ece8267d2c Add back optim type hints that were lost when *.pyi files were removed (#136185)
When stub files (`*.pyi`) were removed from `optim` (#125556, #125452), some types that existed are no longer available. This pull request adds them back.

Just for reference, these types are used in `pytorch-lightning`'s `LightningCLI`. Command line interfaces are created automatically, and having type hints make them nicer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136185
Approved by: https://github.com/janeyx99
2024-09-17 15:45:15 +00:00
913f97e878 Don't run reshape pattern match on dynamic shape size tensor (#136100)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136100
Approved by: https://github.com/mengluy0125
2024-09-17 15:08:55 +00:00
462b727d1e Revert "Add decomposition for permute_copy (#130944)"
This reverts commit ab9a7eadd34aee59fc67e29237610b7562cc4ff0.

Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))
2024-09-17 13:42:55 +00:00
2c4ae81494 Revert "Add decomposition for squeeze_copy (#130941)"
This reverts commit c33b0580e6a702be0cd5be691b3b465da012aa34.

Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))
2024-09-17 13:39:07 +00:00
3b5e2689a1 Revert "Optimize dict reconstruct to not codegen untouched values (#134876)"
This reverts commit a1a57a424dc992f4dc2d44bdc1e4e7e500881a9c.

Reverted https://github.com/pytorch/pytorch/pull/134876 on behalf of https://github.com/jeanschmidt due to new introduced test test_reconstruct.py::ReconstructTest::test_functional_call_reconstruct is breaking internally. @zou3519 may you help get those changes merged back to main? ([comment](https://github.com/pytorch/pytorch/pull/134876#issuecomment-2355697685))
2024-09-17 13:00:01 +00:00
e248c1d7eb Update real device in FSDP state_dict_utils (#134994)
## Motivation
The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor.

```
[rank3]   File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical
[rank3]     sharded_tensor_sd = ref_model.state_dict()
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict
[rank3]     hook_result = hook(self, destination, prefix, local_metadata)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3]     return func(*args, **kwargs)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook
[rank3]     tensor.device,
[rank3]   File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper
[rank3]     return arg(*args, **kwargs)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__
[rank3]     return dispatch(st_instance, func)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch
[rank3]     return _SHARDED_OPS[func](types, args, kwargs, st._process_group)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper
[rank3]     return wrapped_func(types, args, kwargs, process_group)
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device
[rank3]     dev = torch.device(torch.cuda.current_device())
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device
[rank3]     _lazy_init()
[rank3]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init
[rank3]     raise AssertionError("Torch not compiled with CUDA enabled")
[rank3] AssertionError: Torch not compiled with CUDA enabled
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134994
Approved by: https://github.com/fegin
2024-09-17 04:39:08 +00:00
408fe41a45 [DSD][EZ] Minor update in _state_dict_utils.py (#136165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136165
Approved by: https://github.com/kwen2501
ghstack dependencies: #135725, #135763
2024-09-17 04:32:43 +00:00
dc82d274e6 make view.dtype always return an alias (#136074)
Fixes https://github.com/pytorch/pytorch/issues/136064

In the linked repro, this issue was that there was some code like this:
```
# x has dtype torch.float32
def f(x):
    y = x.view(torch.float32)
    y.copy_(...)
```

Where because `view.dtype` is implemented today to potentially directly return its input, we would end up directly clobbering the proxy for our graph input (replacing its FX proxy value from `arg0_1` to `view_1`). This is not desirable, because we have careful assertions in AOTDispatcher that mutations only ever happen on graph inputs - but this clobbering caused the mutation to appear, from the perspective of the FX graph, like it was happening on a view of the input.

Why is this normally not a problem? Ordinarily, the `ADInplaceOrView` kernel for `view.dtype` will take the output of the view kernel, [and detach() it](https://github.com/pytorch/pytorch/blob/main/tools/autograd/gen_inplace_or_view_type.py#L466) (properly creating a fresh `TensorImpl`).

This does **not** happen, though, if you are executing the kernel from with a `__torch_dispatch__` region: the `ADInplaceOrView` logic has already run above you, so that key will be in the TLS exclude set.

This PR changes eager behavior - at first I considered trying to only change behavior under compile. But this problem isn't technically specific to PT2: if you ever rely on tensor identity from inside of a __torch_dispatch__ call, then we need to make sure the raw `view.dtype` kernel doesn't directly return the input.

I am also making the assumption that "`view.dtype` no-op'ing when the dtype is the same" is not a case worth optimizing in eager mode, and that the overhead of the `TensorImpl` creation is relatively negligible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136074
Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/albanD
ghstack dependencies: #136041
2024-09-17 03:40:54 +00:00
d463a81c27 inductor: dont use default_dtype during rng functionalization (#136041)
Fixes https://github.com/pytorch/pytorch/issues/119162

See context at https://github.com/pytorch/pytorch/issues/119162#issuecomment-2349849469

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136041
Approved by: https://github.com/eellison
2024-09-17 03:40:54 +00:00
3f74310784 Back out "Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581)" (#136160)
Test Plan: make train-hstu-cint-publish-bf16-tgif-local

Differential Revision: D62766335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136160
Approved by: https://github.com/muchulee8
2024-09-17 01:06:10 +00:00
37a08b33bb Revert "fix compiled_autograd deadlock throw (#135795)"
This reverts commit 00dc7d435652ad66e9d2feb2660928b632281a98.

Reverted https://github.com/pytorch/pytorch/pull/135795 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135795#issuecomment-2354233619))
2024-09-16 23:59:56 +00:00
071da87cd7 use csv extention for test report in order for it to be uploaded to s3 (#136128)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136128
Approved by: https://github.com/clee2000
2024-09-16 21:47:46 +00:00
c12536b3c0 [ONNX] Treat CompositeImplicitAutograd ops as normal ops in decomp (#136153)
Since https://github.com/pytorch/pytorch/pull/135080, the CompositeImplicitAutograd (CIA) ops are only decomposed when a decomp function is provided in a table. There is no longer a need to distinguish CIA ops like Upsample and preserve them explicitly. On the ONNX Script torchlib side I will unregister some ops from the following list to make sure some CIA ops are still decomposed.

```
<OpOverload(op='aten.__and__', overload='Scalar')>,
 <OpOverload(op='aten.__and__', overload='Tensor')>,
 <OpOverload(op='aten.__or__', overload='Scalar')>,
 <OpOverload(op='aten.__or__', overload='Tensor')>,
 <OpOverload(op='aten.__xor__', overload='Scalar')>,
 <OpOverload(op='aten.__xor__', overload='Tensor')>,
 <OpOverload(op='aten._add_batch_dim', overload='default')>,
 <OpOverload(op='aten._assert_tensor_metadata', overload='default')>,
 <OpOverload(op='aten._backward', overload='default')>,
 <OpOverload(op='aten._batch_norm_impl_index_backward', overload='default')>,
 <OpOverload(op='aten._cast_Byte', overload='default')>,
 <OpOverload(op='aten._cast_Char', overload='default')>,
 <OpOverload(op='aten._cast_Double', overload='default')>,
 <OpOverload(op='aten._cast_Float', overload='default')>,
 <OpOverload(op='aten._cast_Half', overload='default')>,
 <OpOverload(op='aten._cast_Int', overload='default')>,
 <OpOverload(op='aten._cast_Long', overload='default')>,
 <OpOverload(op='aten._cast_Short', overload='default')>,
 <OpOverload(op='aten._choose_qparams_per_tensor', overload='default')>,
 <OpOverload(op='aten._convolution', overload='deprecated')>,
 <OpOverload(op='aten._convolution_double_backward', overload='default')>,
 <OpOverload(op='aten._convolution_mode', overload='default')>,
 <OpOverload(op='aten._cufft_clear_plan_cache', overload='default')>,
 <OpOverload(op='aten._cufft_get_plan_cache_max_size', overload='default')>,
 <OpOverload(op='aten._cufft_get_plan_cache_size', overload='default')>,
 <OpOverload(op='aten._cufft_set_plan_cache_max_size', overload='default')>,
 <OpOverload(op='aten._debug_has_internal_overlap', overload='default')>,
 <OpOverload(op='aten._dim_arange', overload='default')>,
 <OpOverload(op='aten._embedding_bag_sparse_backward', overload='default')>,
 <OpOverload(op='aten._gather_sparse_backward', overload='default')>,
 <OpOverload(op='aten._grid_sampler_2d_cpu_fallback_backward', overload='default')>,
 <OpOverload(op='aten._has_compatible_shallow_copy_type', overload='default')>,
 <OpOverload(op='aten._is_zerotensor', overload='default')>,
 <OpOverload(op='aten._lu_with_info', overload='default')>,
 <OpOverload(op='aten._nnpack_available', overload='default')>,
 <OpOverload(op='aten._pack_padded_sequence_backward', overload='default')>,
 <OpOverload(op='aten._pad_circular', overload='default')>,
 <OpOverload(op='aten._pad_enum', overload='default')>,
 <OpOverload(op='aten._pad_packed_sequence', overload='default')>,
 <OpOverload(op='aten._propagate_xla_data', overload='default')>,
 <OpOverload(op='aten._remove_batch_dim', overload='default')>,
 <OpOverload(op='aten._reshape_from_tensor', overload='default')>,
 <OpOverload(op='aten._rowwise_prune', overload='default')>,
 <OpOverload(op='aten._saturate_weight_to_fp16', overload='default')>,
 <OpOverload(op='aten._scaled_dot_product_attention_math', overload='default')>,
 <OpOverload(op='aten._shape_as_tensor', overload='default')>,
 <OpOverload(op='aten._sobol_engine_draw', overload='default')>,
 <OpOverload(op='aten._sparse_bsc_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_bsr_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_compressed_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_coo_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_csc_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_csr_tensor_unsafe', overload='default')>,
 <OpOverload(op='aten._sparse_log_softmax', overload='Dimname')>,
 <OpOverload(op='aten._sparse_log_softmax', overload='int')>,
 <OpOverload(op='aten._sparse_mm', overload='default')>,
 <OpOverload(op='aten._sparse_mm', overload='reduce')>,
 <OpOverload(op='aten._sparse_softmax', overload='Dimname')>,
 <OpOverload(op='aten._sparse_softmax', overload='int')>,
 <OpOverload(op='aten._sparse_sum', overload='default')>,
 <OpOverload(op='aten._sparse_sum', overload='dim_dtype')>,
 <OpOverload(op='aten._sparse_sum', overload='dtype')>,
 <OpOverload(op='aten._test_ambiguous_defaults', overload='a')>,
 <OpOverload(op='aten._test_ambiguous_defaults', overload='b')>,
 <OpOverload(op='aten._test_autograd_multiple_dispatch', overload='ntonly')>,
 <OpOverload(op='aten._test_check_tensor', overload='default')>,
 <OpOverload(op='aten._test_serialization_subcmul', overload='default')>,
 <OpOverload(op='aten._test_string_default', overload='default')>,
 <OpOverload(op='aten._thnn_differentiable_gru_cell_backward', overload='default')>,
 <OpOverload(op='aten._thnn_differentiable_lstm_cell_backward', overload='default')>,
 <OpOverload(op='aten._thnn_fused_lstm_cell_backward', overload='default')>,
 <OpOverload(op='aten._to_cpu', overload='default')>,
 <OpOverload(op='aten._upsample_bicubic2d_aa', overload='vec')>,
 <OpOverload(op='aten._upsample_bilinear2d_aa', overload='vec')>,
 <OpOverload(op='aten._upsample_nearest_exact1d', overload='default')>,
 <OpOverload(op='aten._upsample_nearest_exact1d', overload='vec')>,
 <OpOverload(op='aten._upsample_nearest_exact2d', overload='default')>,
 <OpOverload(op='aten._upsample_nearest_exact2d', overload='vec')>,
 <OpOverload(op='aten._upsample_nearest_exact3d', overload='default')>,
 <OpOverload(op='aten._upsample_nearest_exact3d', overload='vec')>,
 <OpOverload(op='aten._use_cudnn_rnn_flatten_weight', overload='default')>,
 <OpOverload(op='aten._validate_sparse_bsc_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_bsr_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_compressed_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_coo_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_csc_tensor_args', overload='default')>,
 <OpOverload(op='aten._validate_sparse_csr_tensor_args', overload='default')>,
 <OpOverload(op='aten._version', overload='default')>,
 <OpOverload(op='aten._weight_norm', overload='default')>,
 <OpOverload(op='aten._weight_norm_differentiable_backward', overload='default')>,
 <OpOverload(op='aten.absolute', overload='default')>,
 <OpOverload(op='aten.adaptive_avg_pool1d', overload='default')>,
 <OpOverload(op='aten.adaptive_avg_pool2d', overload='default')>,
 <OpOverload(op='aten.adaptive_avg_pool3d', overload='default')>,
 <OpOverload(op='aten.adaptive_max_pool1d', overload='default')>,
 <OpOverload(op='aten.affine_grid_generator_backward', overload='default')>,
 <OpOverload(op='aten.align_as', overload='default')>,
 <OpOverload(op='aten.align_tensors', overload='default')>,
 <OpOverload(op='aten.all', overload='dimname')>,
 <OpOverload(op='aten.any', overload='dimname')>,
 <OpOverload(op='aten.arccos', overload='default')>,
 <OpOverload(op='aten.arccosh', overload='default')>,
 <OpOverload(op='aten.arcsin', overload='default')>,
 <OpOverload(op='aten.arcsinh', overload='default')>,
 <OpOverload(op='aten.arctan', overload='default')>,
 <OpOverload(op='aten.arctan2', overload='default')>,
 <OpOverload(op='aten.arctanh', overload='default')>,
 <OpOverload(op='aten.argsort', overload='default')>,
 <OpOverload(op='aten.argsort', overload='dimname')>,
 <OpOverload(op='aten.argsort', overload='stable')>,
 <OpOverload(op='aten.argwhere', overload='default')>,
 <OpOverload(op='aten.atleast_1d', overload='Sequence')>,
 <OpOverload(op='aten.atleast_2d', overload='Sequence')>,
 <OpOverload(op='aten.atleast_3d', overload='Sequence')>,
 <OpOverload(op='aten.avg_pool1d', overload='default')>,
 <OpOverload(op='aten.bilinear', overload='default')>,
 <OpOverload(op='aten.broadcast_tensors', overload='default')>,
 <OpOverload(op='aten.can_cast', overload='default')>,
 <OpOverload(op='aten.cat', overload='names')>,
 <OpOverload(op='aten.cdist', overload='default')>,
 <OpOverload(op='aten.chain_matmul', overload='default')>,
 <OpOverload(op='aten.chalf', overload='default')>,
 <OpOverload(op='aten.choose_qparams_optimized', overload='default')>,
 <OpOverload(op='aten.clip', overload='Tensor')>,
 <OpOverload(op='aten.clip', overload='default')>,
 <OpOverload(op='aten.column_stack', overload='default')>,
 <OpOverload(op='aten.combinations', overload='default')>,
 <OpOverload(op='aten.concat', overload='default')>,
 <OpOverload(op='aten.concat', overload='names')>,
 <OpOverload(op='aten.concatenate', overload='default')>,
 <OpOverload(op='aten.concatenate', overload='names')>,
 <OpOverload(op='aten.conv1d', overload='default')>,
 <OpOverload(op='aten.conv1d', overload='padding')>,
 <OpOverload(op='aten.conv2d', overload='default')>,
 <OpOverload(op='aten.conv2d', overload='padding')>,
 <OpOverload(op='aten.conv3d', overload='default')>,
 <OpOverload(op='aten.conv3d', overload='padding')>,
 <OpOverload(op='aten.conv_tbc_backward', overload='default')>,
 <OpOverload(op='aten.conv_transpose1d', overload='default')>,
 <OpOverload(op='aten.conv_transpose2d', overload='input')>,
 <OpOverload(op='aten.conv_transpose3d', overload='input')>,
 <OpOverload(op='aten.corrcoef', overload='default')>,
 <OpOverload(op='aten.cosine_embedding_loss', overload='default')>,
 <OpOverload(op='aten.cosine_similarity', overload='default')>,
 <OpOverload(op='aten.cov', overload='default')>,
 <OpOverload(op='aten.cross', overload='default')>,
 <OpOverload(op='aten.cross_entropy_loss', overload='default')>,
 <OpOverload(op='aten.ctc_loss', overload='IntList')>,
 <OpOverload(op='aten.ctc_loss', overload='Tensor')>,
 <OpOverload(op='aten.cudnn_is_acceptable', overload='default')>,
 <OpOverload(op='aten.cummax', overload='dimname')>,
 <OpOverload(op='aten.cummaxmin_backward', overload='default')>,
 <OpOverload(op='aten.cummin', overload='dimname')>,
 <OpOverload(op='aten.cumprod', overload='dimname')>,
 <OpOverload(op='aten.cumprod_backward', overload='default')>,
 <OpOverload(op='aten.cumsum', overload='dimname')>,
 <OpOverload(op='aten.cumulative_trapezoid', overload='dx')>,
 <OpOverload(op='aten.cumulative_trapezoid', overload='x')>,
 <OpOverload(op='aten.data', overload='default')>,
 <OpOverload(op='aten.det', overload='default')>,
 <OpOverload(op='aten.diag', overload='default')>,
 <OpOverload(op='aten.diagflat', overload='default')>,
 <OpOverload(op='aten.diff', overload='default')>,
 <OpOverload(op='aten.divide', overload='Scalar')>,
 <OpOverload(op='aten.divide', overload='Scalar_mode')>,
 <OpOverload(op='aten.divide', overload='Tensor')>,
 <OpOverload(op='aten.divide', overload='Tensor_mode')>,
 <OpOverload(op='aten.dstack', overload='default')>,
 <OpOverload(op='aten.einsum', overload='default')>,
 <OpOverload(op='aten.embedding_backward', overload='default')>,
 <OpOverload(op='aten.embedding_bag', overload='default')>,
 <OpOverload(op='aten.embedding_bag', overload='padding_idx')>,
 <OpOverload(op='aten.embedding_sparse_backward', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_channel_affine', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_channel_affine_cachemask_backward', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='default')>,
 <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='tensor_qparams')>,
 <OpOverload(op='aten.fake_quantize_per_tensor_affine_cachemask_backward', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_fp16_weight', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_fp16_weight_fp32_activation', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_int8_weight', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_int8_weight_fp32_activation', overload='default')>,
 <OpOverload(op='aten.fbgemm_linear_quantize_weight', overload='default')>,
 <OpOverload(op='aten.fbgemm_pack_gemm_matrix_fp16', overload='default')>,
 <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='KN')>,
 <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='default')>,
 <OpOverload(op='aten.fft_fft', overload='default')>,
 <OpOverload(op='aten.fft_fft2', overload='default')>,
 <OpOverload(op='aten.fft_fftn', overload='default')>,
 <OpOverload(op='aten.fft_fftshift', overload='default')>,
 <OpOverload(op='aten.fft_hfft', overload='default')>,
 <OpOverload(op='aten.fft_hfft2', overload='default')>,
 <OpOverload(op='aten.fft_hfftn', overload='default')>,
 <OpOverload(op='aten.fft_ifft', overload='default')>,
 <OpOverload(op='aten.fft_ifft2', overload='default')>,
 <OpOverload(op='aten.fft_ifftn', overload='default')>,
 <OpOverload(op='aten.fft_ifftshift', overload='default')>,
 <OpOverload(op='aten.fft_ihfft', overload='default')>,
 <OpOverload(op='aten.fft_ihfft2', overload='default')>,
 <OpOverload(op='aten.fft_ihfftn', overload='default')>,
 <OpOverload(op='aten.fft_irfft', overload='default')>,
 <OpOverload(op='aten.fft_irfft2', overload='default')>,
 <OpOverload(op='aten.fft_irfftn', overload='default')>,
 <OpOverload(op='aten.fft_rfft', overload='default')>,
 <OpOverload(op='aten.fft_rfft2', overload='default')>,
 <OpOverload(op='aten.fft_rfftn', overload='default')>,
 <OpOverload(op='aten.fix', overload='default')>,
 <OpOverload(op='aten.flatten_dense_tensors', overload='default')>,
 <OpOverload(op='aten.fliplr', overload='default')>,
 <OpOverload(op='aten.flipud', overload='default')>,
 <OpOverload(op='aten.float_power', overload='Scalar')>,
 <OpOverload(op='aten.float_power', overload='Tensor_Scalar')>,
 <OpOverload(op='aten.float_power', overload='Tensor_Tensor')>,
 <OpOverload(op='aten.frobenius_norm', overload='dim')>,
 <OpOverload(op='aten.gather', overload='dimname')>,
 <OpOverload(op='aten.gather_backward', overload='default')>,
 <OpOverload(op='aten.ger', overload='default')>,
 <OpOverload(op='aten.gradient', overload='array')>,
 <OpOverload(op='aten.gradient', overload='scalararray')>,
 <OpOverload(op='aten.gradient', overload='scalarint')>,
 <OpOverload(op='aten.gradient', overload='scalarrayarray')>,
 <OpOverload(op='aten.gradient', overload='scalarrayint')>,
 <OpOverload(op='aten.gradient', overload='tensorarray')>,
 <OpOverload(op='aten.gradient', overload='tensorarrayint')>,
 <OpOverload(op='aten.greater', overload='Scalar')>,
 <OpOverload(op='aten.greater', overload='Tensor')>,
 <OpOverload(op='aten.greater_equal', overload='Scalar')>,
 <OpOverload(op='aten.greater_equal', overload='Tensor')>,
 <OpOverload(op='aten.grid_sampler', overload='default')>,
 <OpOverload(op='aten.group_norm', overload='default')>,
 <OpOverload(op='aten.gru', overload='data')>,
 <OpOverload(op='aten.gru', overload='input')>,
 <OpOverload(op='aten.gru_cell', overload='default')>,
 <OpOverload(op='aten.hinge_embedding_loss', overload='default')>,
 <OpOverload(op='aten.histogramdd', overload='TensorList_bins')>,
 <OpOverload(op='aten.histogramdd', overload='default')>,
 <OpOverload(op='aten.histogramdd', overload='int_bins')>,
 <OpOverload(op='aten.hstack', overload='default')>,
 <OpOverload(op='aten.index_add', overload='dimname')>,
 <OpOverload(op='aten.index_copy', overload='dimname')>,
 <OpOverload(op='aten.index_fill', overload='Dimname_Scalar')>,
 <OpOverload(op='aten.index_fill', overload='Dimname_Tensor')>,
 <OpOverload(op='aten.index_select', overload='dimname')>,
 <OpOverload(op='aten.index_select_backward', overload='default')>,
 <OpOverload(op='aten.infinitely_differentiable_gelu_backward', overload='default')>,
 <OpOverload(op='aten.inner', overload='default')>,
 <OpOverload(op='aten.instance_norm', overload='default')>,
 <OpOverload(op='aten.inverse', overload='default')>,
 <OpOverload(op='aten.is_complex', overload='default')>,
 <OpOverload(op='aten.is_conj', overload='default')>,
 <OpOverload(op='aten.is_distributed', overload='default')>,
 <OpOverload(op='aten.is_floating_point', overload='default')>,
 <OpOverload(op='aten.is_inference', overload='default')>,
 <OpOverload(op='aten.is_leaf', overload='default')>,
 <OpOverload(op='aten.is_neg', overload='default')>,
 <OpOverload(op='aten.is_nonzero', overload='default')>,
 <OpOverload(op='aten.is_signed', overload='default')>,
 <OpOverload(op='aten.is_vulkan_available', overload='default')>,
 <OpOverload(op='aten.isclose', overload='default')>,
 <OpOverload(op='aten.isfinite', overload='default')>,
 <OpOverload(op='aten.isreal', overload='default')>,
 <OpOverload(op='aten.istft', overload='default')>,
 <OpOverload(op='aten.item', overload='default')>,
 <OpOverload(op='aten.kl_div', overload='default')>,
 <OpOverload(op='aten.kron', overload='default')>,
 <OpOverload(op='aten.kthvalue', overload='dimname')>,
 <OpOverload(op='aten.l1_loss', overload='default')>,
 <OpOverload(op='aten.layer_norm', overload='default')>,
 <OpOverload(op='aten.ldexp', overload='Tensor')>,
 <OpOverload(op='aten.less', overload='Scalar')>,
 <OpOverload(op='aten.less', overload='Tensor')>,
 <OpOverload(op='aten.less_equal', overload='Scalar')>,
 <OpOverload(op='aten.less_equal', overload='Tensor')>,
 <OpOverload(op='aten.linalg_cholesky', overload='default')>,
 <OpOverload(op='aten.linalg_cond', overload='default')>,
 <OpOverload(op='aten.linalg_cond', overload='p_str')>,
 <OpOverload(op='aten.linalg_det', overload='default')>,
 <OpOverload(op='aten.linalg_eigh', overload='default')>,
 <OpOverload(op='aten.linalg_eigvals', overload='default')>,
 <OpOverload(op='aten.linalg_eigvalsh', overload='default')>,
 <OpOverload(op='aten.linalg_inv', overload='default')>,
 <OpOverload(op='aten.linalg_ldl_factor', overload='default')>,
 <OpOverload(op='aten.linalg_lu_factor', overload='default')>,
 <OpOverload(op='aten.linalg_matmul', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_norm', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_norm', overload='str_ord')>,
 <OpOverload(op='aten.linalg_matrix_power', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_float')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_tensor')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='default')>,
 <OpOverload(op='aten.linalg_matrix_rank', overload='tol_tensor')>,
 <OpOverload(op='aten.linalg_multi_dot', overload='default')>,
 <OpOverload(op='aten.linalg_norm', overload='default')>,
 <OpOverload(op='aten.linalg_norm', overload='ord_str')>,
 <OpOverload(op='aten.linalg_pinv', overload='atol_rtol_float')>,
 <OpOverload(op='aten.linalg_pinv', overload='default')>,
 <OpOverload(op='aten.linalg_pinv', overload='rcond_tensor')>,
 <OpOverload(op='aten.linalg_slogdet', overload='default')>,
 <OpOverload(op='aten.linalg_solve', overload='default')>,
 <OpOverload(op='aten.linalg_solve_ex', overload='default')>,
 <OpOverload(op='aten.linalg_svd', overload='default')>,
 <OpOverload(op='aten.linalg_svdvals', overload='default')>,
 <OpOverload(op='aten.linalg_tensorinv', overload='default')>,
 <OpOverload(op='aten.linalg_tensorsolve', overload='default')>,
 <OpOverload(op='aten.linalg_vander', overload='default')>,
 <OpOverload(op='aten.linalg_vecdot', overload='default')>,
 <OpOverload(op='aten.linear', overload='default')>,
 <OpOverload(op='aten.log_sigmoid', overload='default')>,
 <OpOverload(op='aten.log_softmax', overload='Dimname')>,
 <OpOverload(op='aten.log_softmax', overload='int')>,
 <OpOverload(op='aten.logcumsumexp', overload='dimname')>,
 <OpOverload(op='aten.logdet', overload='default')>,
 <OpOverload(op='aten.logsumexp', overload='names')>,
 <OpOverload(op='aten.lstm', overload='data')>,
 <OpOverload(op='aten.lstm', overload='input')>,
 <OpOverload(op='aten.lstm_cell', overload='default')>,
 <OpOverload(op='aten.lu_solve', overload='default')>,
 <OpOverload(op='aten.margin_ranking_loss', overload='default')>,
 <OpOverload(op='aten.masked_select_backward', overload='default')>,
 <OpOverload(op='aten.matmul', overload='default')>,
 <OpOverload(op='aten.matrix_exp', overload='default')>,
 <OpOverload(op='aten.matrix_exp_backward', overload='default')>,
 <OpOverload(op='aten.matrix_power', overload='default')>,
 <OpOverload(op='aten.max', overload='names_dim')>,
 <OpOverload(op='aten.max', overload='other')>,
 <OpOverload(op='aten.max_pool1d', overload='default')>,
 <OpOverload(op='aten.max_pool1d_with_indices', overload='default')>,
 <OpOverload(op='aten.max_pool2d', overload='default')>,
 <OpOverload(op='aten.max_pool3d', overload='default')>,
 <OpOverload(op='aten.mean', overload='names_dim')>,
 <OpOverload(op='aten.median', overload='names_dim')>,
 <OpOverload(op='aten.meshgrid', overload='default')>,
 <OpOverload(op='aten.meshgrid', overload='indexing')>,
 <OpOverload(op='aten.min', overload='names_dim')>,
 <OpOverload(op='aten.min', overload='other')>,
 <OpOverload(op='aten.mish_backward', overload='default')>,
 <OpOverload(op='aten.mode', overload='dimname')>,
 <OpOverload(op='aten.msort', overload='default')>,
 <OpOverload(op='aten.multilabel_margin_loss', overload='default')>,
 <OpOverload(op='aten.multiply', overload='Scalar')>,
 <OpOverload(op='aten.multiply', overload='Tensor')>,
 <OpOverload(op='aten.nanmean', overload='default')>,
 <OpOverload(op='aten.nanmedian', overload='names_dim')>,
 <OpOverload(op='aten.nanquantile', overload='default')>,
 <OpOverload(op='aten.nanquantile', overload='scalar')>,
 <OpOverload(op='aten.native_channel_shuffle', overload='default')>,
 <OpOverload(op='aten.negative', overload='default')>,
 <OpOverload(op='aten.nested_to_padded_tensor', overload='default')>,
 <OpOverload(op='aten.nll_loss', overload='default')>,
 <OpOverload(op='aten.nll_loss2d', overload='default')>,
 <OpOverload(op='aten.nll_loss_nd', overload='default')>,
 <OpOverload(op='aten.nonzero_numpy', overload='default')>,
 <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim')>,
 <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim_dtype')>,
 <OpOverload(op='aten.norm_except_dim', overload='default')>,
 <OpOverload(op='aten.not_equal', overload='Scalar')>,
 <OpOverload(op='aten.not_equal', overload='Tensor')>,
 <OpOverload(op='aten.nuclear_norm', overload='default')>,
 <OpOverload(op='aten.nuclear_norm', overload='dim')>,
 <OpOverload(op='aten.one_hot', overload='default')>,
 <OpOverload(op='aten.orgqr', overload='default')>,
 <OpOverload(op='aten.outer', overload='default')>,
 <OpOverload(op='aten.output_nr', overload='default')>,
 <OpOverload(op='aten.pad', overload='default')>,
 <OpOverload(op='aten.pad_sequence', overload='default')>,
 <OpOverload(op='aten.pairwise_distance', overload='default')>,
 <OpOverload(op='aten.pdist', overload='default')>,
 <OpOverload(op='aten.pinverse', overload='default')>,
 <OpOverload(op='aten.poisson_nll_loss', overload='default')>,
 <OpOverload(op='aten.prelu', overload='default')>,
 <OpOverload(op='aten.prod', overload='dim_Dimname')>,
 <OpOverload(op='aten.promote_types', overload='default')>,
 <OpOverload(op='aten.qr', overload='default')>,
 <OpOverload(op='aten.quantile', overload='default')>,
 <OpOverload(op='aten.quantile', overload='scalar')>,
 <OpOverload(op='aten.quantized_gru_cell', overload='default')>,
 <OpOverload(op='aten.quantized_lstm_cell', overload='default')>,
 <OpOverload(op='aten.quantized_rnn_relu_cell', overload='default')>,
 <OpOverload(op='aten.quantized_rnn_tanh_cell', overload='default')>,
 <OpOverload(op='aten.relu6', overload='default')>,
 <OpOverload(op='aten.repeat_interleave', overload='self_Tensor')>,
 <OpOverload(op='aten.repeat_interleave', overload='self_int')>,
 <OpOverload(op='aten.result_type', overload='Scalar')>,
 <OpOverload(op='aten.result_type', overload='Scalar_Scalar')>,
 <OpOverload(op='aten.result_type', overload='Scalar_Tensor')>,
 <OpOverload(op='aten.result_type', overload='Tensor')>,
 <OpOverload(op='aten.retains_grad', overload='default')>,
 <OpOverload(op='aten.rms_norm', overload='default')>,
 <OpOverload(op='aten.rnn_relu', overload='data')>,
 <OpOverload(op='aten.rnn_relu', overload='input')>,
 <OpOverload(op='aten.rnn_relu_cell', overload='default')>,
 <OpOverload(op='aten.rnn_tanh', overload='data')>,
 <OpOverload(op='aten.rnn_tanh', overload='input')>,
 <OpOverload(op='aten.rnn_tanh_cell', overload='default')>,
 <OpOverload(op='aten.row_stack', overload='default')>,
 <OpOverload(op='aten.rrelu', overload='default')>,
 <OpOverload(op='aten.scaled_dot_product_attention', overload='default')>,
 <OpOverload(op='aten.scatter', overload='dimname_src')>,
 <OpOverload(op='aten.scatter', overload='dimname_value')>,
 <OpOverload(op='aten.scatter_add', overload='dimname')>,
 <OpOverload(op='aten.selu', overload='default')>,
 <OpOverload(op='aten.silu_backward', overload='default')>,
 <OpOverload(op='aten.size', overload='Dimname')>,
 <OpOverload(op='aten.size', overload='int')>,
 <OpOverload(op='aten.slogdet', overload='default')>,
 <OpOverload(op='aten.slow_conv3d', overload='default')>,
 <OpOverload(op='aten.smm', overload='default')>,
 <OpOverload(op='aten.softmax', overload='Dimname')>,
 <OpOverload(op='aten.softmax', overload='int')>,
 <OpOverload(op='aten.sort', overload='dimname')>,
 <OpOverload(op='aten.sort', overload='dimname_stable')>,
 <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value')>,
 <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value_size')>,
 <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value')>,
 <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value_size')>,
 <OpOverload(op='aten.sparse_coo_tensor', overload='indices')>,
 <OpOverload(op='aten.sparse_coo_tensor', overload='indices_size')>,
 <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value')>,
 <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value_size')>,
 <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value')>,
 <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value_size')>,
 <OpOverload(op='aten.special_digamma', overload='default')>,
 <OpOverload(op='aten.special_erf', overload='default')>,
 <OpOverload(op='aten.special_erfc', overload='default')>,
 <OpOverload(op='aten.special_erfinv', overload='default')>,
 <OpOverload(op='aten.special_exp2', overload='default')>,
 <OpOverload(op='aten.special_expit', overload='default')>,
 <OpOverload(op='aten.special_expm1', overload='default')>,
 <OpOverload(op='aten.special_gammainc', overload='default')>,
 <OpOverload(op='aten.special_gammaincc', overload='default')>,
 <OpOverload(op='aten.special_gammaln', overload='default')>,
 <OpOverload(op='aten.special_i0', overload='default')>,
 <OpOverload(op='aten.special_log1p', overload='default')>,
 <OpOverload(op='aten.special_log_softmax', overload='default')>,
 <OpOverload(op='aten.special_logit', overload='default')>,
 <OpOverload(op='aten.special_logsumexp', overload='default')>,
 <OpOverload(op='aten.special_multigammaln', overload='default')>,
 <OpOverload(op='aten.special_ndtr', overload='default')>,
 <OpOverload(op='aten.special_polygamma', overload='default')>,
 <OpOverload(op='aten.special_psi', overload='default')>,
 <OpOverload(op='aten.special_round', overload='default')>,
 <OpOverload(op='aten.special_sinc', overload='default')>,
 <OpOverload(op='aten.special_softmax', overload='default')>,
 <OpOverload(op='aten.special_xlogy', overload='default')>,
 <OpOverload(op='aten.special_xlogy', overload='other_scalar')>,
 <OpOverload(op='aten.special_xlogy', overload='self_scalar')>,
 <OpOverload(op='aten.square', overload='default')>,
 <OpOverload(op='aten.sspaddmm', overload='default')>,
 <OpOverload(op='aten.std', overload='correction_names')>,
 <OpOverload(op='aten.std', overload='default')>,
 <OpOverload(op='aten.std', overload='dim')>,
 <OpOverload(op='aten.std', overload='names_dim')>,
 <OpOverload(op='aten.std_mean', overload='correction_names')>,
 <OpOverload(op='aten.std_mean', overload='default')>,
 <OpOverload(op='aten.std_mean', overload='dim')>,
 <OpOverload(op='aten.std_mean', overload='names_dim')>,
 <OpOverload(op='aten.stft', overload='center')>,
 <OpOverload(op='aten.stft', overload='default')>,
 <OpOverload(op='aten.stride', overload='Dimname')>,
 <OpOverload(op='aten.stride', overload='int')>,
 <OpOverload(op='aten.subtract', overload='Scalar')>,
 <OpOverload(op='aten.subtract', overload='Tensor')>,
 <OpOverload(op='aten.sum', overload='dim_DimnameList')>,
 <OpOverload(op='aten.sum_to_size', overload='default')>,
 <OpOverload(op='aten.svd', overload='default')>,
 <OpOverload(op='aten.sym_size', overload='int')>,
 <OpOverload(op='aten.sym_stride', overload='int')>,
 <OpOverload(op='aten.take_along_dim', overload='default')>,
 <OpOverload(op='aten.tensordot', overload='default')>,
 <OpOverload(op='aten.thnn_conv2d', overload='default')>,
 <OpOverload(op='aten.tile', overload='default')>,
 <OpOverload(op='aten.to_dense', overload='default')>,
 <OpOverload(op='aten.to_dense_backward', overload='default')>,
 <OpOverload(op='aten.to_mkldnn_backward', overload='default')>,
 <OpOverload(op='aten.to_sparse', overload='default')>,
 <OpOverload(op='aten.to_sparse', overload='sparse_dim')>,
 <OpOverload(op='aten.to_sparse_bsc', overload='default')>,
 <OpOverload(op='aten.to_sparse_bsr', overload='default')>,
 <OpOverload(op='aten.to_sparse_csc', overload='default')>,
 <OpOverload(op='aten.to_sparse_csr', overload='default')>,
 <OpOverload(op='aten.trace_backward', overload='default')>,
 <OpOverload(op='aten.trapezoid', overload='dx')>,
 <OpOverload(op='aten.trapezoid', overload='x')>,
 <OpOverload(op='aten.trapz', overload='dx')>,
 <OpOverload(op='aten.trapz', overload='x')>,
 <OpOverload(op='aten.triplet_margin_loss', overload='default')>,
 <OpOverload(op='aten.true_divide', overload='Scalar')>,
 <OpOverload(op='aten.true_divide', overload='Tensor')>,
 <OpOverload(op='aten.type_as', overload='default')>,
 <OpOverload(op='aten.unflatten_dense_tensors', overload='default')>,
 <OpOverload(op='aten.upsample_bicubic2d', overload='vec')>,
 <OpOverload(op='aten.upsample_bilinear2d', overload='vec')>,
 <OpOverload(op='aten.upsample_linear1d', overload='vec')>,
 <OpOverload(op='aten.upsample_nearest1d', overload='default')>,
 <OpOverload(op='aten.upsample_nearest1d', overload='vec')>,
 <OpOverload(op='aten.upsample_nearest2d', overload='default')>,
 <OpOverload(op='aten.upsample_nearest2d', overload='vec')>,
 <OpOverload(op='aten.upsample_nearest3d', overload='default')>,
 <OpOverload(op='aten.upsample_nearest3d', overload='vec')>,
 <OpOverload(op='aten.upsample_trilinear3d', overload='vec')>,
 <OpOverload(op='aten.value_selecting_reduction_backward', overload='default')>,
 <OpOverload(op='aten.vander', overload='default')>,
 <OpOverload(op='aten.var', overload='correction_names')>,
 <OpOverload(op='aten.var', overload='default')>,
 <OpOverload(op='aten.var', overload='dim')>,
 <OpOverload(op='aten.var', overload='names_dim')>,
 <OpOverload(op='aten.var_mean', overload='correction_names')>,
 <OpOverload(op='aten.var_mean', overload='default')>,
 <OpOverload(op='aten.var_mean', overload='dim')>,
 <OpOverload(op='aten.var_mean', overload='names_dim')>,
 <OpOverload(op='aten.vstack', overload='default')>,
 <OpOverload(op='aten.where', overload='Scalar')>,
 <OpOverload(op='aten.where', overload='ScalarOther')>,
 <OpOverload(op='aten.where', overload='ScalarSelf')>,
 <OpOverload(op='aten.where', overload='default')>,
 <OpOverload(op='aten.wrapped_linear_prepack', overload='default')>,
 <OpOverload(op='aten.wrapped_quantized_linear_prepacked', overload='default')>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136153
Approved by: https://github.com/xadupre, https://github.com/gramalingam
2024-09-16 21:28:54 +00:00
b76d1b79e6 Add scaling arguments to bsr_dense_addmm (#136104)
As in the title.

Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413

The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task.

Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous.

Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136104
Approved by: https://github.com/cpuhrsch
2024-09-16 20:26:54 +00:00
bfbcdf4967 Revert "[dynamo] Fix support for classmethod(property(...)) (#134968)"
This reverts commit c64ae601ba9eb3ad2cd3402a14f6ac83c0ab7eba.

Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, we need to skip the new tests on py3.10 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2353909010))
2024-09-16 20:26:35 +00:00
3c97b0ab00 Use ncclAlltoAllv and ncclAlltoAll API when supported (#134499)
NCCL does not have an api for ncclAllToAll and ncclAllToAllv, so PyTorch does point to point send/recv. Expose this API if it is supported.

Differential Revision: [D61683836](https://our.internmc.facebook.com/intern/diff/D61683836/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134499
Approved by: https://github.com/shuqiangzhang, https://github.com/eqy
2024-09-16 20:08:06 +00:00
abd16a8c64 [torch/multiprocessing] Use multiprocessing.reduction.register ForkingPickler.register to register custom tensor and storage reductions (#135030)
Right now `multiprocessing.reduction.register()` is simply an alias to `multiprocessing.reduction.ForkingPickler.register()`
https://github.com/python/cpython/blame/main/Lib/multiprocessing/reduction.py#L56, but the top-level `register()` function exposes less of the internal details of `multiprocessing.reduction` module.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135030
Approved by: https://github.com/albanD
2024-09-16 20:07:29 +00:00
a0c7029a75 [c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931) (#135653)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

This is try to reland D62008954 by fixing internal errors.

Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653
Approved by: https://github.com/wz337, https://github.com/H-Huang
2024-09-16 19:56:42 +00:00
7537f74277 Refactor FxGraphCache.load into separate functions, so that AOTAutogradCache may access it correctly later (#135491)
Summary:
We refactor FxGraphCache.load into three phases:
- prepare_key, which checks that an inductor input is cacheable and bypasses otherwise
- load_with_key, which tries to lookup the key in the cache
- post compile, where we do some logging and run post compile steps

Splitting it along these lines will allow AOTAutogradCache to use load_with_key and still get access to all of the observability + remote cache logic when accessing FxGraphCache, without needing to pass key components, etc.

Differential Revision: D62314862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135491
Approved by: https://github.com/oulgen
2024-09-16 19:48:08 +00:00
31715be72a [BE]: Update mypy to 1.11.2 (#133816)
Updates mypy to 1.11.1 to improve type inference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816
Approved by: https://github.com/ezyang
2024-09-16 19:44:11 +00:00
38caf10411 [EZ] Fix spelling typo (#136157)
s/toosl/tools/ (spotted by @louie-tsai)
Also, capitalize CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136157
Approved by: https://github.com/kit1980
2024-09-16 19:30:30 +00:00
c977bb7d03 [Distributed] fix FileSystemWriter __init__ (#136135)
Fixes #135608.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136135
Approved by: https://github.com/Skylion007
2024-09-16 19:11:08 +00:00
717fca2cac Drop outdated section 'Running clang-tidy' in CONTRIBUTING.md (#136146)
Fixes #125920

[Running clang-tidy](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#running-clang-tidy) section is misleading and outdated. C++ lint is done with lintrunner and covered in [local-linting](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#local-linting) section.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136146
Approved by: https://github.com/janeyx99
2024-09-16 19:02:21 +00:00
f89ce4dfbb torch.nn.MultiheadAttention: docs: improvement (#136111)
`torch.nn.MultiheadAttention`: docs: improvement
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136111
Approved by: https://github.com/janeyx99
2024-09-16 18:52:20 +00:00
d3647d15e6 Remove accidentally committed code (#136154)
Accidentally left out during rebase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136154
Approved by: https://github.com/kit1980, https://github.com/albanD
2024-09-16 18:34:20 +00:00
d0cebedb31 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit e498b02b472e45cfd6b7a08db0d6c1babec655c5.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
7fe004f7cf Revert "Add CI for Triton CPU backend (#135342)"
This reverts commit 426580a67db15ec17b2b861a09667bf59927e033.

Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
23c0d2689e [BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091)
Testing if op info coverage has issues

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091
Approved by: https://github.com/ezyang
2024-09-16 18:22:16 +00:00
5193f23469 [Pytorch] Cleanup Strobelight URL and shorten for readability (#136102)
Summary:
- Converted strobelight URL prefix to more readable and editable json
- Dump shortened URLs when possible for easier readability

Test Plan:
```
python ./torch/_strobelight/examples/compile_time_profile_example.py
python torch/_strobelight/examples/cli_function_profiler_example.py
```

Differential Revision: D62690292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136102
Approved by: https://github.com/laithsakka
2024-09-16 18:10:33 +00:00
0199fd4d7e Revert "[inductor] More fixes on the keys of constants and signature dictionaries (#135406)"
This reverts commit e54b559e8860e343692bb5534777b2384a57a613.

Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))
2024-09-16 17:58:02 +00:00
b491e2974c [BE][Ez]: Add full half/bfloat16 dtype for unique and isin (#136114)
Fixes #136090

* Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches).
* Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique.
* This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114
Approved by: https://github.com/malfet
2024-09-16 17:49:12 +00:00
0aa41eb52f [ONNX] Run type promotion test in CI and update the table (#135915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135915
Approved by: https://github.com/gramalingam, https://github.com/xadupre
2024-09-16 16:46:13 +00:00
090046b936 [effects] Turn off dtype promotion for with_effects lowering (#136039)
By default inductor promotes arguments to the common highest dtype.
Having empty token with dtype=torch.float32 results in dtype promotion for effectful ops during lowering of with_effects.

Disabling dtype promotion for this lowering.

Removing previous workaround making token dtype torch.bool.

Testing:

```
python test/distributed/test_c10d_functional_native.py -k test_inductor_dtypeview_memory_lea
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136039
Approved by: https://github.com/bdhirsh, https://github.com/eellison, https://github.com/zou3519
2024-09-16 16:14:05 +00:00
c33b0580e6 Add decomposition for squeeze_copy (#130941)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-16 15:46:57 +00:00
13bd1256f9 Delete stable prototype (#135911)
This project ended up going in an entirely different direction, so we can close out all this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135911
Approved by: https://github.com/izaitsevfb, https://github.com/malfet
2024-09-16 15:32:17 +00:00
d833f49602 [reland][Inductor] Rename cpp_wrapper_cuda.py as cpp_wrapper_gpu.py (#136046)
Summary: Reland https://github.com/pytorch/pytorch/pull/135313 after fixing internal build issues

Test Plan: CI

Differential Revision: D62658837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136046
Approved by: https://github.com/chenyang78, https://github.com/etaf, https://github.com/jansel
2024-09-16 14:35:19 +00:00
a803cb0531 [AOTI] Refactor how cpp_wrapper specific options are set (#136035)
Summary:
1) When cpp-wrapper is turned on, certain triton specific options need to be set, both for forward and backward. This PR considate the settings in one place.
2) Change config.triton.autotune_at_compile_time to default to None. If the flag is not explicitly set by user, default it to True for cpp-wrapper.

Differential Revision: [D62689940](https://our.internmc.facebook.com/intern/diff/D62689940)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136035
Approved by: https://github.com/chenyang78
2024-09-16 14:32:13 +00:00
bbc3fdbbde Add python 3.13.0t build to Docker images (#136001)
Adds 3.13t python to Docker images
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136001
Approved by: https://github.com/albanD
2024-09-16 12:49:36 +00:00
3117f2cf67 Revert "[BE]: Update mypy to 1.11.2 (#133816)"
This reverts commit 55299cfc223fa838aadd8d6d6fa3ed541fa5acd1.

Reverted https://github.com/pytorch/pytorch/pull/133816 on behalf of https://github.com/jeanschmidt due to seems to have broken https://github.com/pytorch/pytorch/actions/runs/10865710499/job/30155699792 on main ([comment](https://github.com/pytorch/pytorch/pull/133816#issuecomment-2352377684))
2024-09-16 09:11:16 +00:00
951c21d679 [dynamo] simplify implementation for builtins.sum (#133779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #133778
2024-09-16 04:53:06 +00:00
9961aaa601 [dynamo] simplify implementation for functools.reduce (#133778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-16 04:53:06 +00:00
d2207c57f7 [Distributed] add pack-check method for float8_e5m2 (#136115)
Add support for Float8_e5m2, following similar algorithm used for Float8_e4m3fn (i.e. overflow check).

Made `HasNanFP8x8` a template so that it is extendable based on dtype.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136115
Approved by: https://github.com/Skylion007
ghstack dependencies: #135891, #135961
2024-09-15 21:37:43 +00:00
e501ed71d4 Update link in distributed.tensor.parallel.rst (#136103)
dtensor folder was moved

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136103
Approved by: https://github.com/kwen2501, https://github.com/fegin
2024-09-15 19:36:29 +00:00
ab9a7eadd3 Add decomposition for permute_copy (#130944)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-15 19:35:14 +00:00
a141c6bb0d [pytorch][monitoring] Dynamic backend for WaitCounter (#135967)
Summary: This implements a default backend proxy that tries to look up a backend via dlsym. What this enables is dynamically loading a module with a backend implementation without having it statically linked with the application.

Differential Revision: D62549295

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135967
Approved by: https://github.com/c-p-i-o
2024-09-15 18:07:49 +00:00
dec3403b24 Add some doc for export_for_training (#135918)
Differential Revision: [D62610491](https://our.internmc.facebook.com/intern/diff/D62610491)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135918
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #135080, #135912
2024-09-15 17:08:12 +00:00
1904b09e61 Create export_for_inference API and expose core_aten as public facing API (#135912)
Differential Revision: [D62606908](https://our.internmc.facebook.com/intern/diff/D62606908)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135912
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #135080
2024-09-15 17:05:07 +00:00
382fad58b3 Deprecate _preserve_ops and consolidate with decomp_table (#135080)
In this PR, we deprecate _preserve_ops feature in run_decomposition API. We can't kill this API completely because Executorch team depends on it. As the syncing between two repos is non-trivial, I just leave this argument as deprecated for now. In the next PR, i will immediately remove it.

After this PR, run_decompositions will only decompose what's inside the decomp table and preserve the rest by default. Note that this feature is only rolled out to OSS for now. Old code path is protected under IS_FBCODE flag.

Differential Revision: [D62163161](https://our.internmc.facebook.com/intern/diff/D62163161/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135080
Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri, https://github.com/bdhirsh
2024-09-15 17:01:58 +00:00
357b7fb579 Revert "[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953)"
This reverts commit b8637503c036abb898f6b880b325aeffe6f09c03.

Reverted https://github.com/pytorch/pytorch/pull/135953 on behalf of https://github.com/kollasb due to Broke internal module factory compatibility, revert from Phabricator failed ([comment](https://github.com/pytorch/pytorch/pull/135953#issuecomment-2351381777))
2024-09-15 05:32:38 +00:00
cyy
31e42a45dd Fix redundant move warnings by g++ (#134987)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134987
Approved by: https://github.com/ezyang
2024-09-15 05:28:19 +00:00
e1abd346a3 [audio hash update] update the pinned audio hash (#136106)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136106
Approved by: https://github.com/pytorchbot
2024-09-15 04:31:35 +00:00
386884e553 [Traceable FSDP2] Ignore FSDP2 forward hook side-effects in AC; Support FSDP2 + AC (#134997)
> Ignore FSDP2 forward hook side-effects in AC

Under AC, FSDP2 does not rely on forward hook to all-gather weights to do recomputation, instead it relies on pre-backward hook to do this job:
451eaf0ff2/torch/distributed/_composable/fsdp/_fsdp_state.py (L219-L220)

So when we use `speculate_subgraph` to trace the utils.checkpoint AC region, we don't actually need to worry about FSDP2 forward hook's side effects and can safely ignore it, because we are not and we don't expect to re-run the FSDP2 forward hook during backward recomputation.

----

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134997
Approved by: https://github.com/zou3519
ghstack dependencies: #135727
2024-09-15 02:00:17 +00:00
8072ebc36c SKIP llama for dynamic size testing (#135960)
Running Torchbench llama with dynamic size failed with
```
  File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32).
```
Skip this model for marking dynamic dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960
Approved by: https://github.com/ezyang
2024-09-15 00:06:49 +00:00
a1a57a424d Optimize dict reconstruct to not codegen untouched values (#134876)
PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow:
(1) codegen(...) each pair of key/value
(2) create a new dictionary to hold the new items
(3) clear the original dictionary
(4) update the original dict with the one created in (2)

We do a micro optimization in the generated bytecode to:
- Only codegen the items that changed.
- Only clear the original dictionary if a key was removed.

Fixes: #133487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876
Approved by: https://github.com/zou3519
2024-09-14 23:25:28 +00:00
a5eb43d8b4 Add TensorReferenceAnalysis and some tests (#135886)
Split out and modified from https://github.com/pytorch/pytorch/pull/130228. There were a bunch of subtle bugs eg. sometimes we need to use torch.ops.aten.{operator}.Tensor vs other times using torch.ops.aten.{operator}.default. Or in the case of pow we need to use Tensor_Tensor. I figured it'd be easier to split out adding TensorReferenceAnalysis and add some tests and do the actual integration in a separate diff.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135886
Approved by: https://github.com/ezyang
2024-09-14 23:09:40 +00:00
391f2d6d50 use a fast expand algorithm (#135999)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135999
Approved by: https://github.com/ezyang
2024-09-14 23:09:34 +00:00
5b21d91197 Fix dividing Mul by factor (#136079)
Fixes https://github.com/pytorch/pytorch/issues/136032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136079
Approved by: https://github.com/ezyang
2024-09-14 22:14:27 +00:00
426580a67d Add CI for Triton CPU backend (#135342)
Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips.

Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342
Approved by: https://github.com/jansel
ghstack dependencies: #133408
2024-09-14 21:45:19 +00:00
e498b02b47 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel
2024-09-14 21:45:19 +00:00
55299cfc22 [BE]: Update mypy to 1.11.2 (#133816)
Updates mypy to 1.11.1 to improve type inference

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816
Approved by: https://github.com/ezyang
2024-09-14 21:40:36 +00:00
c64ae601ba [dynamo] Fix support for classmethod(property(...)) (#134968)
Fixes #134451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968
Approved by: https://github.com/yanboliang
2024-09-14 21:00:41 +00:00
7f5abb44af [BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087)
Updates pybind11 submodule. The major patchnote is an experimental new function that is added to all pybind11 objects that will make them more compatible across pybind11 version, settings, and frameworks (such as nanobind) called cpp_conduit. No code changes needed on our end except to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136087
Approved by: https://github.com/malfet
2024-09-14 20:48:44 +00:00
8df01c8258 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502
2024-09-14 18:52:22 +00:00
860838e9be [Dynamo] Remove ignored modes workaround (#135502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422
2024-09-14 18:52:22 +00:00
1b9daeb240 [Dynamo] Trace enter/exit of TorchFunctionModes (#135422)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
2024-09-14 18:52:22 +00:00
06caa2d560 [Dynamo] Simplify torch function mode stack guard (#135444)
The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422.  The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443
2024-09-14 18:52:22 +00:00
14cabdf626 [Dynamo] Support thread local setattr (#135443)
In preparation for tracing through DeviceContext (defb515306/torch/utils/_device.py (L66))
This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137
2024-09-14 18:52:22 +00:00
5c5c33ac32 [Dynamo] Trace torch function modes entered outside of torch.compile (#133137)
This PR adds initial tracing for torch function modes.

Details:
In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call.
This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers.

Previously landed:
https://github.com/pytorch/pytorch/pull/133135
https://github.com/pytorch/pytorch/pull/133136
https://github.com/pytorch/pytorch/pull/133134
https://github.com/pytorch/pytorch/pull/133133
https://github.com/pytorch/pytorch/pull/133132
https://github.com/pytorch/pytorch/pull/133131
https://github.com/pytorch/pytorch/pull/133729
https://github.com/pytorch/pytorch/pull/133130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #134732
2024-09-14 18:52:22 +00:00
228760b945 [Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
2024-09-14 18:52:22 +00:00
b4c84c3167 [AOTI] Fix a fallback op returning None issue (#135997)
Summary: Fixes https://github.com/pytorch/pytorch/issues/135781. In some cases, a fallback can return None in the place of a tensor.

Differential Revision: [D62659039](https://our.internmc.facebook.com/intern/diff/D62659039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135997
Approved by: https://github.com/chenyang78
2024-09-14 18:12:06 +00:00
b82122beef Only keep ListOfLinears module in basic_modules_benchmarks and add gpu version. (#135730)
All of the previous benchmarks are similar, ListOfLinears should be representative enough.
I copied the previous benchmarks from unit tests without an intention, was just trying to create a large
number of benchmarks to better observe noise.

This PR keeps only one, we can add more as we see value and regressions in the future.
Also this diff adds a GPU version.
```
collecting compile time instruction count for basic_modules_ListOfLinears_eager
compile time instruction count for iteration 0 is 6479525851
compile time instruction count for iteration 1 is 1024432680
compile time instruction count for iteration 2 is 1019417317
compile time instruction count for iteration 3 is 1013603566
compile time instruction count for iteration 4 is 1008853980
compile time instruction count for iteration 5 is 1009541481
compile time instruction count for iteration 6 is 1005025533
compile time instruction count for iteration 7 is 1004116323
compile time instruction count for iteration 8 is 1000828633
compile time instruction count for iteration 9 is 999788323
collecting compile time instruction count for basic_modules_ListOfLinears_inductor
compile time instruction count for iteration 0 is 40837529730
compile time instruction count for iteration 1 is 18411921909
compile time instruction count for iteration 2 is 18383665161
compile time instruction count for iteration 3 is 18348983522
compile time instruction count for iteration 4 is 18349276590
compile time instruction count for iteration 5 is 18353046274
compile time instruction count for iteration 6 is 18346818581
compile time instruction count for iteration 7 is 18340057998
compile time instruction count for iteration 8 is 18331267320
compile time instruction count for iteration 9 is 18328381338
collecting compile time instruction count for basic_modules_ListOfLinears_inductor_gpu
compile time instruction count for iteration 0 is 15408870979
compile time instruction count for iteration 1 is 10949520859
compile time instruction count for iteration 2 is 11058786167
compile time instruction count for iteration 3 is 11003606719
compile time instruction count for iteration 4 is 10896406770
compile time instruction count for iteration 5 is 10982875189
compile time instruction count for iteration 6 is 10931848275
compile time instruction count for iteration 7 is 10956345008
compile time instruction count for iteration 8 is 11045384499
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135730
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-14 16:45:52 +00:00
b8637503c0 [Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953)
Summary:
Move towards consolidating strobelight profiler implementations between OSS and fbcode. This change is a first step towards that.

- Created a new function to abstract out compile time profiling enablement. This function allows profiler to switch between different function profilers (e.g. Thrift based or CLI based)
- Both OSS and Fbcode now use one compile time profiler in torch/_strobelight

Test Plan:
Tested OSS with following commands:
```
python torch/_strobelight/examples/compile_time_profile_example.py
python torch/_strobelight/examples/cli_function_profiler_example.py

TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp  --only XLNetLMHeadModel
```

See test commands for fbcode in comments.

Differential Revision: D62444551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135953
Approved by: https://github.com/laithsakka
2024-09-14 16:35:22 +00:00
f97cccf62a [3.13] fix 3.13 pickle error in torch/package (#136049)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136049
Approved by: https://github.com/albanD
ghstack dependencies: #136034
2024-09-14 14:28:09 +00:00
db393fb95e Add Half support for reflection and replication padding on CPU (#135931)
Fixes #135680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931
Approved by: https://github.com/Skylion007
2024-09-14 14:18:55 +00:00
23dec79cef Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)"
This reverts commit 731b178b56c83966d6e8cdfb0015d22d8f91b4d2.

Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
8c8a3086a7 Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137)"
This reverts commit 4528777e034b157a8329d1879daf52290eea199a.

Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
46f5037007 Revert "[Dynamo] Support thread local setattr (#135443)"
This reverts commit 149d0b716173787df4543186ff74b605aca54e3e.

Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
7975ec3a29 Revert "[Dynamo] Simplify torch function mode stack guard (#135444)"
This reverts commit ce3c74f2744cbc134b95cf8bd53ae5e3fbc67c29.

Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
f3180f0088 Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)"
This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266.

Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
838c912502 Revert "[Dynamo] Remove ignored modes workaround (#135502)"
This reverts commit 5c67cf180ee53d696f95d7c45dd99a35399e4450.

Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:55 +00:00
72b868d034 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503)"
This reverts commit e77bd0ebd20e96990ccd40518e68bbcfe7fda855.

Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))
2024-09-14 10:02:54 +00:00
41b58a1bec OpenReg: Fix issue when copying on the same device (#135956)
Current copy gets wrong value when src and dst are both openreg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135956
Approved by: https://github.com/albanD
2024-09-14 09:57:45 +00:00
f96a073c9d Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)
Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232
Approved by: https://github.com/ezyang
2024-09-14 09:53:17 +00:00
a815611db9 [Traceable FSDP2][Partitioner] Must save AC output if output has a backward hook (#135727)
If node is AC region output and has a backward hook on it, we intentionally choose to save it.
This is to work around circular dependencies in Traceable FSDP2+AC.
Example:
```
out = fully_shard(utils.checkpoint(module))(x)
norm_out = layer_norm(out)
```
and there is a circular dependency:
1. In backward, grad_input of layer_norm aka. `out_grad` is actually dependent on `out`.
2. `out` depends on `out`'s backward hook created by FSDP2 (which does all-gather for `module` weights) in order to be recomputed.
3. `out`'s FSDP2 backward hook, as is the case for all eager backward hooks, depends on `out_grad`  -> circular dependency with (1)!

Solution: check whether `out` has a backward hook, and if so, intentionally save `out` in forward graph outputs. With this, we can break the above circular dependency.

----

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135727
Approved by: https://github.com/Chillee
2024-09-14 08:45:58 +00:00
3352c9ac94 Add higher order operator name to the cache bypass exception (#135876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135876
Approved by: https://github.com/jamesjwu, https://github.com/zou3519
2024-09-14 07:05:29 +00:00
5a2be192d1 [Traceable FSDP2] Don't register RegisterPostBackwardFunction if user intends to use Traceable FSDP2, and assert that compiled autograd is not used when entering RegisterPostBackwardFunction (#135824)
During enablement of Traceable FSDP2 on internal models, sometimes the user only applies torch.compile to some of the FSDP2 instances but not all of them. Such mixed usage pattern is not supported by compiled autograd. Here we try to catch and throw error at such usage pattern, so that the user can fix the usage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135824
Approved by: https://github.com/awgu
2024-09-14 06:30:12 +00:00
a9bef85263 [CI] Increase open file handles limit to 16K on MacOS (#136061)
May be it will help with flaky failures tracked in https://github.com/pytorch/pytorch/issues/135885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136061
Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/huydhn, https://github.com/ZainRizvi
2024-09-14 06:16:12 +00:00
44dd218a61 Disable garbage collection during compile_time_instructions count in benchmark base by default. (#135768)
When we measure compile time instruction count, probably we do want in most cases to measure gc instructions
disabling it here by default.
if it is needed we can add an option to allow it, or someone can use the regular total instruction count instead of compile time instruction count.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135768
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-14 06:15:28 +00:00
1a67e2b680 [MPS] Add native im2col (#135706)
It's called from `torch.unfold` and one of the few remaining vestiges in `MPSFallback.mm`

Strongly inspired by CUDA implementation from 09519eb195/aten/src/ATen/native/cuda/im2col.cuh (L40-L61)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135706
Approved by: https://github.com/albanD
2024-09-14 06:09:36 +00:00
b9b6094793 [ROCm] Skip pointwise associative scan tests due to regression (#135995)
https://github.com/pytorch/pytorch/pull/133012 caused a regression on ROCm causing pointwise scan tests to fail

```
ERROR: test_pointwise_associative_scan_tuple_reverse_True_combine_mode_pointwise_cuda
ERROR: test_pointwise_associative_scan_tuple_reverse_False_combine_mode_pointwise_cuda
ERROR: test_pointwise_associative_scan_complex_pytree_reverse_True_combine_mode_pointwise_cuda
ERROR: test_pointwise_associative_scan_complex_pytree_reverse_False_combine_mode_pointwise_cuda
ERROR: test_pointwise_associative_scan_binary_operator_reverse_True_combine_mode_pointwise_cuda
ERROR: test_pointwise_associative_scan_binary_operator_reverse_False_combine_mode_pointwise_cuda
```

Skipping temporarily while triage is underway.

Full log: https://ossci-raw-job-status.s3.amazonaws.com/log/30067645445

```
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/graph.py", line 1020, in call_function
    out = lowerings[target](*args, **kwargs)  # type: ignore[index]
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 363, in wrapped
    out = decomp_fn(*args, **kwargs)
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 6245, in associative_scan
    raise RuntimeError("Unable to generate code for associative_scan op")
torch._inductor.exc.LoweringException: RuntimeError: Unable to generate code for associative_scan op
```

NOTE: even "eager" backend fails
```
  File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_higher_order_ops/associative_scan.py", line 338, in associative_scan_op_dense
    raise NotImplementedError("associative_scan is not implemented for eager")
NotImplementedError: associative_scan is not implemented for eager
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135995
Approved by: https://github.com/malfet
2024-09-14 05:40:10 +00:00
911a43f930 [TCPStore] Remove deprecated constructor (#136004)
While looking at TCPStore code again and found it confusing that we still keep the deprecated constructor for TCPStore in cpp while we don't expose it in python via pybind already. I checked both internal and external, all use cases in cpp (aside from unit test fixed in this PR) already moved to using option. So let's remove this legacy constructor to avoid confusion.

Differential Revision: [D62653634](https://our.internmc.facebook.com/intern/diff/D62653634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136004
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2024-09-14 04:25:47 +00:00
e77bd0ebd2 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502
2024-09-14 02:41:16 +00:00
5c67cf180e [Dynamo] Remove ignored modes workaround (#135502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422
2024-09-14 02:41:16 +00:00
7743149b2b [Dynamo] Trace enter/exit of TorchFunctionModes (#135422)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
2024-09-14 02:41:08 +00:00
ce3c74f274 [Dynamo] Simplify torch function mode stack guard (#135444)
The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422.  The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443
2024-09-14 02:40:59 +00:00
149d0b7161 [Dynamo] Support thread local setattr (#135443)
In preparation for tracing through DeviceContext (defb515306/torch/utils/_device.py (L66))
This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137
2024-09-14 02:40:52 +00:00
4528777e03 [Dynamo] Trace torch function modes entered outside of torch.compile (#133137)
This PR adds initial tracing for torch function modes.

Details:
In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call.
This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers.

Previously landed:
https://github.com/pytorch/pytorch/pull/133135
https://github.com/pytorch/pytorch/pull/133136
https://github.com/pytorch/pytorch/pull/133134
https://github.com/pytorch/pytorch/pull/133133
https://github.com/pytorch/pytorch/pull/133132
https://github.com/pytorch/pytorch/pull/133131
https://github.com/pytorch/pytorch/pull/133729
https://github.com/pytorch/pytorch/pull/133130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #134732
2024-09-14 02:40:43 +00:00
731b178b56 [Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
2024-09-14 02:40:32 +00:00
1786a17fed Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)"
This reverts commit 51c52061339069a2162e921e5b464fad5a411522.

Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))
2024-09-14 02:31:06 +00:00
51c5206133 Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232)
Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232
Approved by: https://github.com/ezyang
2024-09-14 02:20:58 +00:00
2e8d431a8f Fix tensor.data_ptr() representation overflow (#135567)
# Motivation
fix https://github.com/pytorch/pytorch/issues/135550
In PyTorch, [`tensor.data_ptr()`](e889252493/tools/autograd/templates/python_variable_methods.cpp (L204)) is reinterpreted by a [signed int64](e889252493/torch/csrc/autograd/utils/wrap_outputs.h (L50)) data type, which could result in an **overflow issue**, like below:
```python
import torch
a = torch.randn(2).to('xpu')
a.data_ptr()
# one possible output is
-23453392437248
# this is inconsistent with storage.data_ptr()
a.untyped_storage().data_ptr()
# one possible output is
18446720620317114368
```
This PR aims to fix this representation overflow issue to make `tensor.data_ptr()` consistent with [`tensor.untyped_storage().data_ptr()`](c0d2f991b1/torch/csrc/StorageMethods.cpp (L62)). With this PR, the output will become:
```python
import torch
a = torch.randn(2).to('xpu')
a.data_ptr()
# one possible output is
18446720620317114368
# this is consistent with storage.data_ptr()
a.untyped_storage().data_ptr()
# one possible output is
18446720620317114368
```

# Solution
Use `PyLong_FromVoidPtr` to prevent the overflow issue and fit the semantic of `wrap`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135567
Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/albanD
2024-09-14 01:52:04 +00:00
95496e4855 [CI] Check that PyTorch is built with OpenMP (#136060)
Restriction for x86 only builds should have been removed long time ago

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136060
Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/ZainRizvi
2024-09-14 01:51:36 +00:00
5de4cb8cd8 [Inductor UT] Generalize inductor UT for intel GPU (Part 3) (#135827)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_compiled_autograd.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135827
Approved by: https://github.com/etaf, https://github.com/desertfire
2024-09-14 01:43:05 +00:00
06bc717410 Fix sum() forward for NJT (#131945)
This PR solves two problems with `sum()` support in NJT:
* `sum()` over a dim with `keepdim=True` returns the wrong shape (i.e. it'll keep the wrong dim). This is a long-standing bug from way back in #112519.
* Historically, we've only supported `sum()` over a dim and not a full reduction. This PR adds the full reduction form (forward only, backward still fails).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131945
Approved by: https://github.com/davidberard98, https://github.com/jananisriram
2024-09-14 00:58:03 +00:00
081c4a966d [BE] Use squeeze/unsqueeze in im2col (#136006)
And move unsqeeze out of the dispatch, as it's dtype agnostic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136006
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-09-14 00:35:37 +00:00
4237592b8f [Distributed] add pack-check method for float8_e4m3fn (#135961)
We check 8 x FP8 simultaneously, at size of 8 bytes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135961
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
ghstack dependencies: #135891
2024-09-14 00:32:27 +00:00
a00faf4408 [3.13] fix 3.13 pickle error in serialization.py (#136034)
Error encountered when adding dynamo 3.13 support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136034
Approved by: https://github.com/albanD
2024-09-14 00:02:40 +00:00
b608ff3bea [Easy] Dont match to mm_plus_mm if not in max autotune (#135929)
It's only an optimization when we tune the triton template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135929
Approved by: https://github.com/FindHao
2024-09-13 23:38:02 +00:00
b8eef500a6 Fix attr check for quantization spec (#135736)
Summary:
Previously we only checked dtype and is_dynamic to decide if two quantization spec are equivalent
this may not work in some cases, e.g. when people use different qscheme or quant_min/quant_max

This PR added checks for other fields as well

Test Plan:
regression tests

Reviewers:

Subscribers:

Tasks:

Tags:

Differential Revision: [D62530974](https://our.internmc.facebook.com/intern/diff/D62530974)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135736
Approved by: https://github.com/sxu
2024-09-13 23:01:22 +00:00
aad556a0b5 [PT2][Inductor][Optimus] Fix a corner case in remove_split_with_size_one (#135962)
Summary: see context in https://fb.workplace.com/groups/1075192433118967/permalink/1501768230461383/

Test Plan:
# local reproduce
```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "mai" --flow_id 642153776
```
P1586356950

# e2e

before fix

f642153776

after fix

Differential Revision: D62625318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135962
Approved by: https://github.com/jackiexu1992
2024-09-13 22:53:08 +00:00
3c5d44dda5 Cleanup unused runner variants (#136058)
Cleaning up unused runner variants, leaving behind only the few that are actually referenced by workflows

For more details see description in the PR that generated these code changes:
- https://github.com/pytorch/test-infra/pull/5665
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136058
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-09-13 22:50:07 +00:00
e2d3af405f [ONNX] Remove logging apis from public (#133825)
Remove

- torch.onnx.enable_log
- torch.onnx.disable_log
- torch.onnx.set_log_stream
- torch.onnx.log

Because they are not meant for public consumption and has been marked for deprecation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133825
Approved by: https://github.com/titaiwangms
2024-09-13 22:19:52 +00:00
baff86dafb [MTIA tensor] allow shallow copy between CPU and MTIA tensors (#135871)
Reviewed By: egienvalue, hanzlfs

Differential Revision: D61662214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135871
Approved by: https://github.com/egienvalue, https://github.com/nautsimon
2024-09-13 22:13:58 +00:00
db5e1b44d2 Fix inductor-micro-benchmark results upload (take 2) (#136052)
I had a brain freeze when I wrote the original fix.  The parameters were in the wrong order.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136052
Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/malfet
2024-09-13 22:05:10 +00:00
a30d5ba16c Fix bug in split-build workflows codegen (#136043)
By just deleting a few rogue lines left out in https://github.com/pytorch/pytorch/pull/135510
If file in workflows folder does not have a `.yml` extensions it will not be launched at all, will it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136043
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-09-13 21:29:06 +00:00
46935c8241 Reduce default iterations to 5 . (#135773)
running all benchmarks takes around 15 mins rn, this is the data
https://www.internalfb.com/phabricator/paste/view/P1583590240
the data looks mostly stable, and 5 iterations should be good, specially with our 1.5% threshold.
that said, the diff also add a way to increase the number of iterations for a specific benchmark.

after the change results
https://www.internalfb.com/phabricator/paste/view/P1583618969
time is down to half (7 mins)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135773
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-13 21:16:38 +00:00
4f407c1884 Only measure compile time instruction count for sum_floordiv benchmark (#135785)
there was a recent strange noise +5%, -5%.
using only compile time :
1) avoid gc time .
2) avoid other operations that are not what we try to measure by this. ==> less probable noise.
```
collecting compile time instruction count for sum_floordiv_regression
compile time instruction count for iteration 0 is 8899290248
compile time instruction count for iteration 1 is 1188830489
compile time instruction count for iteration 2 is 1180579615
compile time instruction count for iteration 3 is 1176263131
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135785
Approved by: https://github.com/avikchaudhuri, https://github.com/anijain2305
2024-09-13 21:14:10 +00:00
2e461e54e8 Add gpu and gpu_dynamic versions of add_loop (#135809)
I am thinking maybe 3 iterations are enough for this one?
- so I am keeping eager and inductor since inductor is 2X eager time
- Eager dynamic is 2X eager so keeping this as well.
- inductor have three tests. (dynamic gpu, gpu and cpu)
I am unsure if am over profiling here happy to trim if anyone have suggestions.
```
collecting compile time instruction count for add_loop_eager
compile time instruction count for iteration 0 is 8213664211
compile time instruction count for iteration 1 is 2798628246
compile time instruction count for iteration 2 is 2796811362
compile time instruction count for iteration 3 is 2794438188
compile time instruction count for iteration 4 is 2794634117
collecting compile time instruction count for add_loop_eager_dynamic
compile time instruction count for iteration 0 is 5724108021
compile time instruction count for iteration 1 is 5499908609
compile time instruction count for iteration 2 is 5569101366
compile time instruction count for iteration 3 is 5493806364
compile time instruction count for iteration 4 is 5493169851
collecting compile time instruction count for add_loop_inductor
compile time instruction count for iteration 0 is 49789381222
compile time instruction count for iteration 1 is 25769347393
compile time instruction count for iteration 2 is 25772594322
compile time instruction count for iteration 3 is 25768695952
compile time instruction count for iteration 4 is 25768032314
collecting compile time instruction count for add_loop_inductor_gpu
compile time instruction count for iteration 0 is 23966942581
compile time instruction count for iteration 1 is 23771950919
compile time instruction count for iteration 2 is 23770784286
compile time instruction count for iteration 3 is 23780160875
compile time instruction count for iteration 4 is 23774634465
collecting compile time instruction count for add_loop_inductor_dynamic_gpu
compile time instruction count for iteration 0 is 41505055086
compile time instruction count for iteration 1 is 41293654089
compile time instruction count for iteration 2 is 41301016100
compile time instruction count for iteration 3 is 41306056207
compile time instruction count for iteration 4 is 41308171566
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135809
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-13 20:42:31 +00:00
a3d827a28c Use python 3.11 for Large Wheel build (#136042)
Use Python 3.11 in nightly Large wheel builds. Required for Colab testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136042
Approved by: https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Sergii Dymchenko <kit1980@gmail.com>
2024-09-13 20:27:11 +00:00
4312794b92 [reland][export] fix re-export custom metadata (#135720)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/134778

The previous D62304294 broke some executorch tests. It has already been reverted.

In this diff, `_collect_param_buffer_metadata()` is modified in a way that when a `call_function` node is encountered and its input nodes include `get_attr`. We skip the fields that have been collected previously and only collect rest of the fields. This prevents over-writing.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//executorch/backends/xnnpack/test:test_xnnpack_ops

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_re_export_preserve_handle

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_run_decompositions_preserve_handle
```

Differential Revision: D62514208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135720
Approved by: https://github.com/zhxchen17, https://github.com/jerryzh168
2024-09-13 20:15:15 +00:00
b856f3539b Fix script name in the comments (#135507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135507
Approved by: https://github.com/atalman
2024-09-13 19:59:47 +00:00
835e7bb077 fix requirements.txt installation failure issue on Windows (#134567)
Fixes #134564

Root cause:

The `lintrunner` wheel released on [pypi.org](https://pypi.org/project/lintrunner/#files) only supports Windows 32bit and Linux 64bit. Since compilation of pytorch requires a 64bit env, on windows, the `lintrunner` has to be compiled from source distribution. `Rust` is its dependency for compilation, as indicated in the error message. Meanwhile, Visual Studio environment is needed for linking libraries..

![image](https://github.com/user-attachments/assets/180cd899-8886-43b5-b42f-031f41e81683)

Issue when performing `pip install lintrunner` without a Visual Studio environment activated is shown below.

```bash
>python -m pip install lintrunner
Collecting lintrunner
  Downloading lintrunner-0.12.5.tar.gz (62 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: lintrunner
  Building wheel for lintrunner (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for lintrunner (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [137 lines of output]
      Running `maturin pep517 build-wheel -i C:\Users\\miniforge3\envs\py310\python.exe --compatibility off`
      📡 Using build options bindings from pyproject.toml
         Compiling proc-macro2 v1.0.79
         Compiling unicode-ident v1.0.12
         Compiling version_check v0.9.4
         Compiling windows_x86_64_msvc v0.52.4
         Compiling winapi v0.3.9
         Compiling serde v1.0.197
         Compiling autocfg v1.2.0
         Compiling syn v1.0.109
         Compiling lazy_static v1.4.0
         Compiling libc v0.2.153
         Compiling equivalent v1.0.1
         Compiling hashbrown v0.14.3
         Compiling memchr v2.7.2
         Compiling yansi v1.0.1
         Compiling unicode-width v0.1.11
         Compiling regex-syntax v0.8.3
         Compiling encode_unicode v0.3.6
         Compiling cfg-if v1.0.0
         Compiling winnow v0.6.5
         Compiling cc v1.0.92
      error: could not compile `windows_x86_64_msvc` (build script) due to 2 previous errors
      warning: build failed, waiting for other jobs to finish...
      error: could not compile `serde` (build script) due to 2 previous errors
      error: could not compile `proc-macro2` (build script) due to 2 previous errors
      error: could not compile `syn` (build script) due to 2 previous errors
      error: could not compile `libc` (build script) due to 2 previous errors
      error: could not compile `winapi` (build script) due to 2 previous errors
      💥 maturin failed
        Caused by: Failed to build a native library through cargo
        Caused by: Cargo build finished with "exit code: 101": `cargo rustc --manifest-path Cargo.toml --message-format json --release --bins --`
      📦 Including license file "LICENSE"
      🔗 Found bin bindings
      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      Error: command ['maturin', 'pep517', 'build-wheel', '-i', 'C:\\Users\\\\miniforge3\\envs\\py310\\python.exe', '--compatibility', 'off'] returned non-zero exit status 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for lintrunner
Failed to build lintrunner
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (lintrunner)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134567
Approved by: https://github.com/malfet
2024-09-13 18:43:55 +00:00
b6d6aa49b8 Revert "Validate input types for torch.nn.Linear and torch.nn.Bilinear (#135596)"
This reverts commit e157ce3ebbb3f30d008c15914e82eb74217562f0.

Reverted https://github.com/pytorch/pytorch/pull/135596 on behalf of https://github.com/malfet due to It's too restrictive, should allow other int-like types, such as `numpy.int64` ([comment](https://github.com/pytorch/pytorch/pull/135596#issuecomment-2349714104))
2024-09-13 18:06:56 +00:00
deee21cb78 Revert "[Inductor] Rename cpp_wrapper_cuda.py as cpp_wrapper_gpu.py (#135313)"
This reverts commit 16b37b309f64ddd4e498c57a99191e1d9b3dfdac.

Reverted https://github.com/pytorch/pytorch/pull/135313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/135313#issuecomment-2349662091))
2024-09-13 17:53:21 +00:00
3f69410976 [gpu-profiler] Expose active and repeat in os env var (#135757)
Summary: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/1855136444971825/

Test Plan:
`buck2 test mode/opt caffe2/test:profiler -- -r test_kineto_profiler_api `

eyes

Differential Revision: D62529249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135757
Approved by: https://github.com/Yuzhen11
2024-09-13 17:48:27 +00:00
18f9331e5d Revert "[aoti] Fix workspace generation for triton (#135552)"
This reverts commit d3833253928f29ed760b2dccac2b730028a868ca.

Reverted https://github.com/pytorch/pytorch/pull/135552 on behalf of https://github.com/izaitsevfb due to blocks revert of #135313, internal failures, see D62511427 ([comment](https://github.com/pytorch/pytorch/pull/135552#issuecomment-2349641372))
2024-09-13 17:47:36 +00:00
bc0f330169 [trymerge] Manually close merged PR when Github fails (#135890)
Manually close merged PR when Github fails to do it.

Consequences of current design:
Sleeping for 1 min uses up the machine, might result in race conditions, results in merging label to removed a bit later, pr still left open if this api fails too (ie no async clean up job)

Tested in https://github.com/malfet/deleteme/pull/92 by removing the part of the commit message that has "resolved #pr num"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135890
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-09-13 17:29:24 +00:00
7834c0bb2c [AOTI][Tooling] Add stats summary (mean/min/max, etc) for jit inductor tensor value printing (#135887)
Summary:
As title. Follow up to add stats summary (mean/min/max, etc) for jit inductor tensor value printing as well.

The inductor python wrapper code level printing would look something like this:

 {F1859224287}

Test Plan: CI

Reviewed By: chenyang78

Differential Revision: D62415575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135887
Approved by: https://github.com/chenyang78
2024-09-13 17:19:25 +00:00
6ef49fe8f1 Revert "Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058)"
This reverts commit 3d2431380999252d5401f83d5010b398a32e7597.

Reverted https://github.com/pytorch/pytorch/pull/135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](https://github.com/pytorch/pytorch/pull/135058#issuecomment-2349480861))
2024-09-13 17:09:45 +00:00
a15774563b [ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663)
As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm.
```
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
>>> torch.cuda.get_device_properties(0).regs_per_multiprocessor
65536
```

With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094

Leaving this in draft until following PRs have landed:
- https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin
- https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-09-13 16:45:39 +00:00
564d00f364 Revert "Fix clang-tidy warnings in Caffe2 code (#134935)"
This reverts commit 7cfd23636c8fa6fcbb8bf3ea34e15b847ec9ad9d.

Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))
2024-09-13 16:42:37 +00:00
ae02d663cd [FlexAttention] Fix output layout (#135882)
We previously only supported the same v_head dim and + qk_head dim. When allowed for different head-dims I accidently kept the same query strides for the output. This PR fixes this bug as well it ensures that we always produce output in the same stride order as the input query.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135882
Approved by: https://github.com/yanboliang, https://github.com/Chillee
2024-09-13 16:36:05 +00:00
ad2f0e9f81 Add remote cache time saved to compilation metrics (#135490)
Summary:
Record remote cache time saved via frame_phase_timing

We add to the "phase" when remote cache hits and saves us time, so that we have a 1:1 correspondence between a frame and time saved.

Test Plan:
Internally run benchmark, see that it's populated in sandbox table after previous diff lands and logger config is actualized.

Show that column exists in table:

https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/fp2te0ff

Note that an earlier version of D62105258 had the column as a string so the staging table is a bit messed up. But you can see the most recent samples have the column populates as a float.

Reviewed By: aorenste

Differential Revision: D62106921

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135490
Approved by: https://github.com/aorenste
2024-09-13 16:35:51 +00:00
21ffa18ad1 Fix "expand: SymIntArrayRef expected to contain only concrete integers" in AOTInductor (#135933)
Internal xref:
https://fb.workplace.com/groups/1075192433118967/permalink/1501860707118802/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135933
Approved by: https://github.com/angelayi
2024-09-13 15:23:42 +00:00
eqy
2519e5a8de [CUDA][FP8] Skip rowwise scaling test on sm89 (#135718)
Same reason as #https://github.com/pytorch/pytorch/pull/133612, rowwise scaling implementation is sm90+ specific (e.g., uses TMA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135718
Approved by: https://github.com/Skylion007
2024-09-13 15:07:20 +00:00
ba6e0f31ab Remove cycle dependency by localizing the import. (#135926)
Summary:
Since https://www.internalfb.com/diff/D62215095 landed there has been many silence errors due to the dependency between functional_tensor and config.

```
 File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/__init__.py", line 64, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/dynamic_shapes.py", line 23, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/exported_program.py", line 26, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/__init__.py", line 1, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/cond.py", line 6, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_subclasses/functional_tensor.py", line 9, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_inductor/config.py", line 44, in <module>
```

https://fburl.com/logarithm/ol5kx0ee
complaining about a cycle dependency

this fix it.

Test Plan: buck test multipy/runtime:test_deploy_embedded_cuda_interp_without_cuda_available -- --run-disabled TorchpyTest.AcquireMultipleSessionsInDifferentPackages

Reviewed By: aorenste

Differential Revision: D62616765

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135926
Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/Skylion007
2024-09-13 15:05:41 +00:00
7ed0563cad Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)"
This reverts commit e504fb70693d4a3741c3380b6a989d441e84f737.

Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:58 +00:00
eb7dd91dd1 Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137)"
This reverts commit fafdd588f27e1d56090c6d260d0382c255eaf9eb.

Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:58 +00:00
3f30360d05 Revert "[Dynamo] Support thread local setattr (#135443)"
This reverts commit 30b007bea329f512af3dc4fd4e6c7d145e807b71.

Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:58 +00:00
4734e356d6 Revert "[Dynamo] Simplify torch function mode stack guard (#135444)"
This reverts commit 0c080cb2c78a85a5320fbeadbbb9a2cc640fd89d.

Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
ac169795a9 Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)"
This reverts commit 2af3b8ffd84e36b91279174e9106f84b2d2a11f2.

Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
fca58bfda1 Revert "[Dynamo] Remove ignored modes workaround (#135502)"
This reverts commit 7d5e0dd4b1a8d20fc8624b3085a6f5ddedd89a2e.

Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
dc71e7a7d4 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503)"
This reverts commit c56728b643e2b7d796abd7ec45803319e1c5967d.

Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
1cdf658f4a Revert "[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167)"
This reverts commit eb0fe029337b31bcb3d4b2d1e539895393975d68.

Reverted https://github.com/pytorch/pytorch/pull/135167 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097957154 ([comment](https://github.com/pytorch/pytorch/pull/135167#issuecomment-2348847595))
2024-09-13 12:35:05 +00:00
b5c52e96e8 Revert "[dynamo] Fix support for classmethod(property(...)) (#134968)"
This reverts commit bf68e16e94fc05f10d434cdc162a14d02c6ad23c.

Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI: eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097956613 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2348837553))
2024-09-13 12:29:03 +00:00
ea2ecab15b [AOTI][reland] Fix assert_function call in cpu autotune template (#135920)
Summary: Reland https://github.com/pytorch/pytorch/pull/135086. In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK.

Test Plan: CI

Differential Revision: D62500592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135920
Approved by: https://github.com/chenyang78
2024-09-13 12:21:57 +00:00
2f53d570fe Update document for autocast on CPU (#135299)
Update document for autocast on CPU due to the support of float16 and changes in the operator list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135299
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/svekars
2024-09-13 09:11:47 +00:00
31007cf200 [Distributed] add FP8 support to NaN checker (#135891)
Adding support for `torch.float8_e4m3fn` and `torch.float8_e5m2`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135891
Approved by: https://github.com/wconstab
2024-09-13 08:43:54 +00:00
c56728b643 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502
2024-09-13 08:41:32 +00:00
7d5e0dd4b1 [Dynamo] Remove ignored modes workaround (#135502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422
2024-09-13 08:41:32 +00:00
2af3b8ffd8 [Dynamo] Trace enter/exit of TorchFunctionModes (#135422)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
2024-09-13 08:41:24 +00:00
0c080cb2c7 [Dynamo] Simplify torch function mode stack guard (#135444)
The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422.  The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443
2024-09-13 08:41:17 +00:00
30b007bea3 [Dynamo] Support thread local setattr (#135443)
In preparation for tracing through DeviceContext (defb515306/torch/utils/_device.py (L66))
This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137
2024-09-13 08:41:07 +00:00
fafdd588f2 [Dynamo] Trace torch function modes entered outside of torch.compile (#133137)
This PR adds initial tracing for torch function modes.

Details:
In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call.
This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers.

Previously landed:
https://github.com/pytorch/pytorch/pull/133135
https://github.com/pytorch/pytorch/pull/133136
https://github.com/pytorch/pytorch/pull/133134
https://github.com/pytorch/pytorch/pull/133133
https://github.com/pytorch/pytorch/pull/133132
https://github.com/pytorch/pytorch/pull/133131
https://github.com/pytorch/pytorch/pull/133729
https://github.com/pytorch/pytorch/pull/133130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #134732
2024-09-13 08:41:00 +00:00
e504fb7069 [Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
2024-09-13 08:40:50 +00:00
b346e99376 remove fast_flush arguments (#135387)
I've removed them from upstream Triton in https://github.com/triton-lang/triton/pull/4485. It looks like most places in the code use the default value of `fast_flush=True` anyway, though there are two PRs from @pearu that use `False`. To my knowledge, there's no reason to use the `False` value.

Differential Revision: [D62325778](https://our.internmc.facebook.com/intern/diff/D62325778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135387
Approved by: https://github.com/nmacchioni, https://github.com/jansel
2024-09-13 08:13:46 +00:00
7dc1788396 [inductor] Remove the batch fusion passes from being a default (#135922)
Ads team do a search internally to figure out which fusion passes to use.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135922
Approved by: https://github.com/eellison, https://github.com/yanboliang
ghstack dependencies: #135819
2024-09-13 06:07:33 +00:00
9fd54d787d [Inductor UT] Generalize device-bias code in test_triton_kernels.py introduced in #135530 (#135656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135656
Approved by: https://github.com/EikanWang, https://github.com/zou3519
2024-09-13 05:27:56 +00:00
b38be727eb [Inductor UT] Generalize inductor UT for intel GPU (Part 2) (#134556)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_torchinductor_opinfo.py`
Reuse `test/inductor/test_minifier_isolate.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134556
Approved by: https://github.com/etaf, https://github.com/eellison
2024-09-13 05:16:28 +00:00
e54b559e88 [inductor] More fixes on the keys of constants and signature dictionaries (#135406)
Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406
Approved by: https://github.com/jansel
2024-09-13 04:10:41 +00:00
eea5e6ff0f [DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model (#135763)
Fix https://github.com/pytorch/pytorch/issues/134095

This is a workaround for loading full state dict into a FSDP1+TP 2D model.
Since named_parameters() in FSDP1 does not return DTensor, we don't have the information to shard the full_state_dict and load it directly into the 2d model. In order to load a full state dict in FSDP1+TP 2D model, we need to do:
- load the full state dict into a 1D FSDP model
- dcp.save the full/shard state dict into storage
- initialize a 2D FSDP1+TP model
- get the default sharded state dict for the 2D model (full_state_dict=False)
- dcp.load the state dict from storage
- load the state dict into the 2D model
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135763
Approved by: https://github.com/fegin
ghstack dependencies: #135725
2024-09-13 03:51:14 +00:00
6df91b5917 real tensor prop for composite ops (#135717)
Fixes #135632

Adds real tensor propagation for decompositions, checking any symbols on their outputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135717
Approved by: https://github.com/ezyang
2024-09-13 03:35:16 +00:00
0cdc6a8dcd [DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)
Fix https://github.com/pytorch/pytorch/issues/134095
This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective).  This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725
Approved by: https://github.com/fegin
2024-09-13 03:26:36 +00:00
6cdc70bccd [ROCm] skip test_fp8_cast_and_t on non-MI300 machines (#135917)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135917
Approved by: https://github.com/malfet
2024-09-13 02:46:48 +00:00
e6b68359d7 Fix xpu memory stats error (#135818)
# Motivation
fix https://github.com/pytorch/pytorch/issues/135726
After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size.

# Additional Context
Add a UT to guard this scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135818
Approved by: https://github.com/EikanWang
2024-09-13 02:41:21 +00:00
1c04cbfba6 [BE] Use C10_UNUSED (#135914)
Instead of `(void)foo; // Suppress unused variable`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135914
Approved by: https://github.com/huydhn, https://github.com/eqy
2024-09-13 02:27:07 +00:00
062681a0ed [Profiler] Torch Profiler distributed info is not JSON serializable (#135548)
Summary: To fix https://github.com/pytorch/pytorch/issues/133308 we must create an encoder for numpy values so we can serialize the distributed metadata to JSON.

Test Plan: Added unit test to check that numpy values can be serialized

Differential Revision: D62411619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135548
Approved by: https://github.com/aaronenyeshi, https://github.com/albanD
2024-09-13 02:22:33 +00:00
8c356ce3da Fix lint errors in fbcode (#135614)
Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps.  After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports.

Test Plan:
```
fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS
```
Before:
```
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib.  Some things to try:
```

Differential Revision: D62049222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614
Approved by: https://github.com/oulgen, https://github.com/laithsakka
2024-09-13 02:04:34 +00:00
bf68e16e94 [dynamo] Fix support for classmethod(property(...)) (#134968)
Fixes #134451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968
Approved by: https://github.com/yanboliang
2024-09-13 01:14:18 +00:00
eqy
d732df7e56 [Inductor] Disable TF32 in test_slice_scatter_reinplace (#135709)
TF32 linear/matmul numerics seem unrelated to test functionality so disabling it here to abate noisy failures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135709
Approved by: https://github.com/eellison
2024-09-13 00:30:45 +00:00
c9de2efde6 [Docs] fix inconsistent docs in conv1d, conv2d, and conv3d (#135894)
Addresses https://github.com/pytorch/pytorch/issues/135880
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135894
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
2024-09-13 00:19:42 +00:00
1f15c0c7a5 [fx] Replace _snake_case with a regexp (#135822)
~2x speedup on this function, though saves <0.5s overall

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135822
Approved by: https://github.com/oulgen
ghstack dependencies: #135787, #135788, #135820, #135821
2024-09-13 00:18:41 +00:00
a72124add9 [fx] Minor optimization in create_arg (#135821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135821
Approved by: https://github.com/oulgen
ghstack dependencies: #135787, #135788, #135820
2024-09-13 00:18:41 +00:00
10ca4c0564 [inductor] Use TracerBase directly in LoopBody (#135820)
This skips some unneeded work in the subclass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135820
Approved by: https://github.com/oulgen
ghstack dependencies: #135787, #135788
2024-09-13 00:18:41 +00:00
d3aab9642b [inductor] Optimize can_fuse_vertical() (#135788)
An O(n^2) to O(n) improvement by not comparing all pairs of deps.

Before:
![image](https://github.com/user-attachments/assets/797cd1bd-5d53-4374-8e76-ffce4232d7f9)

After:
![image](https://github.com/user-attachments/assets/1e61bf29-adba-41a4-839e-f028130fa979)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135788
Approved by: https://github.com/oulgen
ghstack dependencies: #135787
2024-09-13 00:18:41 +00:00
67a929eea8 [inductor] Remove unused check (#135787)
I think this is unreachable code because mode is always None on reads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135787
Approved by: https://github.com/oulgen
2024-09-13 00:18:41 +00:00
f576960bbc do not expand in replace/simplify if no changes (#135863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135863
Approved by: https://github.com/ezyang
2024-09-13 00:12:01 +00:00
1aba224cfd Update nightly PyTorch version to 2.6.0 (#135916)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135916
Approved by: https://github.com/kit1980
2024-09-13 00:08:52 +00:00
d383325392 [aoti] Fix workspace generation for triton (#135552)
Fixes #131337

- add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`.
- do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead.
- add workspace allocation generation code to `kernel_autotune_calls`. e.g.
```python
    workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8)
    workspace.zero_()
    .....
    triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0)
    del buf2, arg0_1, arg1_1, workspace
```
-  add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code.

The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `.

```cpp
    static constexpr int64_t int_array_0[] = {1280L, };
    static constexpr int64_t int_array_1[] = {1L, };
    AtenTensorHandle workspace_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda,  0, &workspace_handle));

        RAIIAtenTensorHandle workspace(workspace_handle);
        workspace.zero_();
```

- Fix handle grid_fn  for grid computation. Pass in "RBLOCK" to `split_scan_grid`
-  Fix dynamic shapes:
Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32*((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined.

The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code.

- We also generate slightly different cpp code depending on if `abi_compatible` is turned on.
```cpp
RAIIAtenTensorHandle workspace(workspace_handle);
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get()));
```
vs

```cpp
    at::Tensor workspace = at::detail::empty_strided_cuda({8L*(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA);
    workspace.zero_();
```

Test Plan:

```
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper
TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552
Approved by: https://github.com/desertfire
2024-09-12 23:53:09 +00:00
00dc7d4356 fix compiled_autograd deadlock throw (#135795)
Fixes #135298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795
Approved by: https://github.com/xmfan
2024-09-12 23:24:57 +00:00
1760bbc259 [FlexAttention] Ensure q/k/v and block_mask on excact the same device (#135823)
Fixes #134739

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135823
Approved by: https://github.com/BoyuanFeng
2024-09-12 23:11:01 +00:00
fb9d8e3248 [ROCm] Use ieee precision for fp32 in flex attention (#135702)
3bebc09be9

Brought in a change to flex_attention to allow TF32 precision, this largely lacks support on ROCm side and we should use ieee.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135702
Approved by: https://github.com/jeffdaily, https://github.com/drisspg
2024-09-12 23:00:48 +00:00
aaabfc8930 [Easy] Check if quant registered in constant folding (#135875)
Belated fix for https://github.com/pytorch/pytorch/issues/110904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135875
Approved by: https://github.com/shunting314
2024-09-12 22:16:39 +00:00
63d6cd351a [dynamo] support torch.nn.attention.sdpa_kernel context manager (#135404)
Fixes https://github.com/pytorch/pytorch/issues/134608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135404
Approved by: https://github.com/jansel, https://github.com/drisspg
2024-09-12 22:04:48 +00:00
3de9e474df Revert "Check function declarations of Core ML code (#135467)"
This reverts commit bc1b8f094d24de27432f4c29f0729e85a6b5ba63.

Reverted https://github.com/pytorch/pytorch/pull/135467 on behalf of https://github.com/malfet due to This breaks ios periodic jobs, see https://github.com/pytorch/pytorch/actions/runs/10797026668/job/29947377532 ([comment](https://github.com/pytorch/pytorch/pull/135467#issuecomment-2347322784))
2024-09-12 22:04:35 +00:00
3e1a4ea132 Revert "[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)"
This reverts commit 83c594ebd6dfa517fdd67ae23929cc60d5fa325d.

Reverted https://github.com/pytorch/pytorch/pull/135725 on behalf of https://github.com/ZainRizvi due to This is breaking lint. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10835983999/job/30068709508) [HUD commit link](83c594ebd6) ([comment](https://github.com/pytorch/pytorch/pull/135725#issuecomment-2347303272))
2024-09-12 21:47:38 +00:00
e157ce3ebb Validate input types for torch.nn.Linear and torch.nn.Bilinear (#135596)
Adding validation checks to check the input types and display better error messages for the same.
Fixes https://github.com/pytorch/pytorch/issues/135463

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596
Approved by: https://github.com/malfet
2024-09-12 21:28:37 +00:00
b897ab0540 [export] ignore mark_dynamic() in export (#135536)
Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`.

Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536
Approved by: https://github.com/avikchaudhuri
2024-09-12 21:22:19 +00:00
3d24313809 Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058)
Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897

This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel

Example:
```python
import torch

DIM = 4096
INPUT_SIZE1 = 32
INPUT_SIZE2 = 16

class LinearNet(torch.nn.Module):
   def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(DIM, DIM, bias=False)

   def forward(self, x):
        x = self.fc1(x)
        return x

input1 = torch.randn(size=(INPUT_SIZE1, DIM))
input2 = torch.randn(size=(INPUT_SIZE2, DIM))

with torch.no_grad():
    model = LinearNet()
    model =  torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear})

    model(input1)   # this goes to ACL lowp_gemm
    print("="*50)
    model(input2)   # this goes to gemm:jit without this PR, and to ACL with this PR
```
In the code snippet above:
- The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR)
- The matmul from `model(input2)`: **Without this PR**: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, **With this PR** the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135058
Approved by: https://github.com/jondea, https://github.com/malfet
2024-09-12 20:30:20 +00:00
cd472bb1e3 [torch][fx] Add new replacement_callback to materialize a replacement just in time (#135553)
Summary:
Sometimes we only want to generate a replacement for a matched pattern
once we know some information about the nodes in the pattern.

So far, we have found this the most useful to do matches based on specific
shapes of tensors flowing into functions.
Use a callback function similar to `match_filters`. By default this isn't used.

Had to make `replacement` a None-able parameter because Callable was
already used to detect a case where a graph needed to be traced.

Differential Revision: D62412628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135553
Approved by: https://github.com/SherlockNoMad
2024-09-12 18:52:14 +00:00
f032135bbf Add batching rule for torch.scatter_reduce (#135547)
Fixes #134797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135547
Approved by: https://github.com/zou3519
2024-09-12 18:51:21 +00:00
525bec804c NJT <-> padded dense conversions (#125947)
This PR:
* Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values)
* Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics
    * Note: there is currently no public API for this; design booted to a future PR

TODO:
* ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~
* ~~Verify that Inductor does computation fusion via test logic~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947
Approved by: https://github.com/soulitzer
2024-09-12 17:54:25 +00:00
83c594ebd6 [DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)
Fix https://github.com/pytorch/pytorch/issues/134095
This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective).  This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725
Approved by: https://github.com/fegin
2024-09-12 17:43:57 +00:00
c1277945d3 [AOTI][Tooling] Support debug printing for inductor level extern kernel call such as externkernel.addmm, bmm, etc. (#135731)
Summary:
As title.

Effect after merging this diff would look something like this:

```
        print('inductor: before_launch - triton_poi_fused_0 - buf0', buf0)
        triton_poi_fused_0.run(buf0, 6, grid=grid(6), stream=stream0)
        print('inductor: after_launch - triton_poi_fused_0 - buf0', buf0)
        buf1 = empty_strided_cuda((16, 6), (6, 1), torch.float32)
        # Topologically Sorted Source Nodes: [linear], Original ATen: [aten.addmm]
        print('inductor: before_launch - extern_kernels.addmm - buf0', buf0)
        extern_kernels.addmm(buf0, reinterpret_tensor(arg2_1, (16, 16), (16, 1), 0), reinterpret_tensor(L__self___weight, (16, 6), (1, 16), 0), alpha=1, beta=1, out=buf1)
        print('inductor: after_launch - extern_kernels.addmm - buf0', buf0)
```

Context: D62272588 only support major triton kernel jit inductor debug printing codegen

Test Plan: CI & OSS CI

Reviewed By: chenyang78, ColinPeppler

Differential Revision: D62397017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135731
Approved by: https://github.com/ColinPeppler
2024-09-12 17:31:10 +00:00
dab7d646d5 Use a better decomposition for split_with_sizes (#135728)
This decomposition has less checks and improves the performance
of torch.compile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135728
Approved by: https://github.com/ezyang
2024-09-12 16:38:51 +00:00
7647c398ff Allow optional positional arguments for torch.func.functional_call (#134643)
This PR resolves #134408. Add an additional test and have passed the local test.

Do you think we should add a post-check to ensure `args` and `kwargs` are not both `None`? It seems to be possible to have modules without inputs.

This PR does not include any such post-check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134643
Approved by: https://github.com/zou3519
2024-09-12 15:22:06 +00:00
d67cc58181 [ONNX] Fix symbolic values and numpy implementation (#135786)
1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that
2. Update the `__array__` method so that it works for tensor on GPU

Fixes https://github.com/pytorch/pytorch/issues/135700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135786
Approved by: https://github.com/titaiwangms
2024-09-12 14:24:43 +00:00
dddaadac6c [dynamo] Dont graph break on inner torch.compile (#135819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135819
Approved by: https://github.com/jansel
2024-09-12 11:39:09 +00:00
02169364e1 [inductor] Split reduction loops when there is no shared reads (#134307)
Fixes #129102

![image](https://github.com/user-attachments/assets/0d00f75b-2bb9-4ce6-a0d9-2daceaff539c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134307
Approved by: https://github.com/shunting314
2024-09-12 09:45:08 +00:00
c30042fbeb [GPT-fast] Update compilation time target for Llama & Mixtral (#135817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135817
Approved by: https://github.com/xmfan, https://github.com/huydhn
2024-09-12 07:13:44 +00:00
6700175531 [Inductor] simplify indexing_exprs in LoopBody._init_with_copy (#135574)
This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135574
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-09-12 06:56:34 +00:00
de8a8653c0 [dtensor][BE] replace compute_local_shape with compute_local_shape_and_global_offset (#135554)
**Summary**
1. This PR removes the public API `compute_local_shape` and replace its use with the more general API `compute_local_shape_and_global_offset`.
2. To keep `compute_local_shape_and_global_offset` consistent with `compute_local_shape` on empty shards, it now returns local tensor shape `(0,)` for empty shards which is more aligned with DTensor's semantics on non-participating ranks.

**Test**
`pytest test/distributed/_tensor/test_dtensor.py`
`pytest test/distributed/_tensor/test_init.py`
`pytest test/distributed/_tensor/test_tensor_ops.py`

Differential Revision: [D62415591](https://our.internmc.facebook.com/intern/diff/D62415591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135554
Approved by: https://github.com/tianyu-l, https://github.com/wz337
2024-09-12 06:30:09 +00:00
86335e9135 [reland 3/3][fx] Bypass custom __setattr__ in Node.__init__ (#135735)
Relands #135079 whcih was reverted by #135562

I broke this up into three parts to test internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135735
Approved by: https://github.com/oulgen
2024-09-12 05:50:39 +00:00
14e3f3c062 [aoti] Remove nlohmann/json.hpp from header (#135765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135765
Approved by: https://github.com/malfet
2024-09-12 05:38:51 +00:00
9852c6d236 xpu: fix 3rd party builds on systems with cmake<3.25 (#135767)
Cmake LINUX variable is available on starting from cmake 3.25. Better to use CMAKE_SYSTEM_NAME instead to relax cmake version requirement.

See: https://cmake.org/cmake/help/v3.25/variable/LINUX.html
Fixes: #135766
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135767
Approved by: https://github.com/malfet, https://github.com/guangyey
2024-09-12 05:31:01 +00:00
6354271178 [inductor] Skip unused call to get_estimated_runtime() (#135776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135776
Approved by: https://github.com/oulgen
ghstack dependencies: #135445, #135446
2024-09-12 05:22:23 +00:00
12902f6ecf [inductor] Cache get_operation_names/get_buffer_names (#135446)
Before:
![image](https://github.com/user-attachments/assets/db5b6fce-d849-4512-a21d-7a09efc72311)

After:
![image](https://github.com/user-attachments/assets/097e340c-03b2-491e-ad36-132350b37892)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135446
Approved by: https://github.com/oulgen
ghstack dependencies: #135445
2024-09-12 05:22:23 +00:00
3decb676aa [inductor] Optimize cache_on_self (#135445)
This is a small compile time win, but also makes profiles more readable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135445
Approved by: https://github.com/oulgen
2024-09-12 05:22:23 +00:00
8d68a02905 OpenReg: Split the daemon into drvier/executor (#135646)
Split the daemon into a proper user-process driver vs device-process executor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135646
Approved by: https://github.com/albanD
2024-09-12 05:03:46 +00:00
28330a8a39 [reland 1/3][fx] Bypass custom __setattr__ in Node.__init__ (#135733)
Relands #135079 whcih was reverted by #135562

I broke this up into three parts to test internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135733
Approved by: https://github.com/oulgen
2024-09-12 04:29:37 +00:00
eaba287adb [dynamo] Bug fix for _torchdynamo_inline source handling (#135612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612
Approved by: https://github.com/drisspg
2024-09-12 04:05:08 +00:00
cyy
f5f1d0a753 Fix build warnings for torch_python (#134981)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134981
Approved by: https://github.com/ezyang
2024-09-12 03:59:34 +00:00
5bc238c73e torch.hub: add get_dir/set_dir type hints (#134906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134906
Approved by: https://github.com/Skylion007
2024-09-12 03:53:29 +00:00
79223114db Avoid inserting extra transpose when the input to group norm is NHWC (#135575)
When the input format for group norm is NHWC and the device is privateuseone, it introduces an additional transpose operation. To avoid this issue, a check for the privateuseone device needs to be added here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135575
Approved by: https://github.com/ezyang
2024-09-12 03:36:05 +00:00
cyy
7cfd23636c Fix clang-tidy warnings in Caffe2 code (#134935)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935
Approved by: https://github.com/ezyang
2024-09-12 03:27:09 +00:00
0d1d69fd25 Update torch-xpu-ops pin (ATen XPU implementation) (#135647)
Release cycle for PyTorch 2.5
1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135647
Approved by: https://github.com/EikanWang
2024-09-12 03:16:08 +00:00
21a64d57b1 [BE] typing for decorators - masked/_ops (#135108)
Differential Revision: D62184735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135108
Approved by: https://github.com/Skylion007
2024-09-12 01:34:09 +00:00
1a74952925 "Remove BLOCK_LIST" (#135729)
Summary:
Skip test_prepare_qat_conv_bn_fusion_getitem_placeholder when we use training ir, since it's only for bn-getitem pattern, but the pattern doesn't exist in training ir.

Remove BLOCK_LIST since it's empty.
Now all internal unittests will use training ir.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan'  caffe2/test/quantization:test_quantization -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder
buck2 run 'fbcode//mode/dev-nosan'  caffe2/test:quantization_pt2e_qat -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder
```

Differential Revision: D62387987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135729
Approved by: https://github.com/tugsbayasgalan
2024-09-12 01:22:06 +00:00
a130ed828a Fix the upload of x86 micro benchmark results (#135780)
Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from https://github.com/pytorch/pytorch/pull/135042.  So, the workflow is running but nothing has been uploaded yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135780
Approved by: https://github.com/atalman
2024-09-12 01:16:38 +00:00
eb0fe02933 [PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167)
Summary:
We observed another long computation issue for OBA_AFOC pyper model, thus adding a pattern to avoid the perf regression

- Only happens in A100
- Do not want to use force_shape_pad since it will pad all GEMMs, which may not be optimal. Optimus pass has more flexisibility to customized GEMM shape and do corresponding padding
- To enable, we pass the pass to config, where "k_threshold_to_pad" can be customized

inductor_config.patch(post_grad_fusion_options={"pad_aten_mm_pass": {"k_threshold_to_pad" : 8388608}})

Test Plan:
# unit test

```
buck2 test mode/opt //caffe2/test/inductor:pad_mm
```
Buck UI: https://www.internalfb.com/buck2/58b0f272-f405-45be-bc8d-aec2dc4d5841
Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099209954651
Network: Up: 9.0KiB  Down: 142B  (reSessionID-8eb71a37-a5ca-4aff-a4f1-93ade3e47e4e)
Jobs completed: 9. Time elapsed: 3:18.0s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3)
Tests finished: Pass 17. Fail 0. Fatal 0. Skip 0. Build failure 0

# e2e test
see [D62388582](https://www.internalfb.com/diff/D62388582)

Differential Revision: D62220158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135167
Approved by: https://github.com/jackiexu1992
2024-09-12 00:51:34 +00:00
d270e2d240 [FSDP2] better error msg for cpu offloading (#135156)
when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward
```
RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device
```

this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading

```
FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight']
```

`pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135156
Approved by: https://github.com/awgu
2024-09-12 00:05:07 +00:00
16b37b309f [Inductor] Rename cpp_wrapper_cuda.py as cpp_wrapper_gpu.py (#135313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135313
Approved by: https://github.com/jansel, https://github.com/desertfire
ghstack dependencies: #135312
2024-09-11 23:59:54 +00:00
13ee85ca5e [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312)
[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison
2024-09-11 23:59:54 +00:00
94d2471d1f [Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730)
Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good).

This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern.

------

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching`
- `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager`
- `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32`
- `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730
Approved by: https://github.com/bdhirsh
2024-09-11 23:01:05 +00:00
5ca46be15e Fix/torch cat doc attr (#135698)
The `torch.cat` attr name for tensors in the docs differs from the method signature, unlike other methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135698
Approved by: https://github.com/albanD

Co-authored-by: Alexander Jipa <azzhipa@amazon.com>
2024-09-11 22:32:55 +00:00
9a04cfbeff fix for fp16 (#134106)
This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm.
The original author is @kkontny

Previous PR summary:
Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation.

I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor.

Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability.

```
class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        LlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)
```

Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy
2024-09-11 22:02:07 +00:00
66db61f0d1 [ONNX] Update fake mode usage in onnx docs (#135512)
Update fake mode usage in onnx docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-09-11 21:29:04 +00:00
c025f7becc Revert "[Partitioner] Reuse partition to check whether nodes exist (#135317)"
This reverts commit e004d539da3335d97a8134c9081245628f18eb67.

Reverted https://github.com/pytorch/pytorch/pull/135317 on behalf of https://github.com/izaitsevfb due to BC-breaking, breaks executorch and internal meta builds ([comment](https://github.com/pytorch/pytorch/pull/135317#issuecomment-2344730294))
2024-09-11 21:27:53 +00:00
8c4e1148b8 Refactoring byte_order (#135558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135558
Approved by: https://github.com/mikaylagawarecki
2024-09-11 21:06:43 +00:00
e20ee39558 Expand bitwise ops to unsigned types (#135525)
Fixes https://github.com/pytorch/pytorch/issues/135436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135525
Approved by: https://github.com/ezyang
2024-09-11 20:48:52 +00:00
74fd1bf965 [ROCm] Update to AOTriton 0.7b (#134498)
Notable changes:
1. Enable CudaGraph related tests
2. Fix UT problems
3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Know Problem:
1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest`
    + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest`

Note:
AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it.

Fixes #133540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet
2024-09-11 20:34:01 +00:00
5d964a5eb7 [Export] Fix SDPA decomposition (#135297)
Summary: Update SDPA decomposition to match updated stride from D62009189 which aligns strides with the `aten._scaled_dot_product_attention_math.default`, which makes `t.permute().continuous().permute()` no longer necessary.

Test Plan: CI

Differential Revision: D62278378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135297
Approved by: https://github.com/drisspg
2024-09-11 20:21:59 +00:00
118d7e1480 [Inductor] add _dynamo.reset to test_cat_slice_cat_cuda (#135694)
Summary: test_cat_slice_cat_cuda runs inductor multiple times and check counters["inductor"] in between, and thus we need to reset properly.

Differential Revision: D62500331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135694
Approved by: https://github.com/masnesral
2024-09-11 20:07:11 +00:00
dd47f6f623 Simplify expr before getting implications in _maybe_evaluate_static (#135499)
Fixes #134268

Previously we weren't simplifying these expressions before calling get_implications, resulting in inconsistent application of FloorDiv/CleanDiv. See #134268  for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135499
Approved by: https://github.com/ezyang
2024-09-11 19:48:29 +00:00
e05ea2b179 Add decomposition for transpose_copy (#130943)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-11 19:45:22 +00:00
ad75b09d89 Replace capture_pre_autograd_graph with export_for_training in torch tests (#135623)
Summary: as title

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_conv_dynamic
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r matcher
 buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r x86
```

CI

Differential Revision: D62448302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135623
Approved by: https://github.com/tugsbayasgalan
2024-09-11 19:23:08 +00:00
a2cb9b7331 Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581)
This is to match the default layout constraint for custom operators. By
default, Inductor should match the stride order of inputs to a triton
kernel.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135581
Approved by: https://github.com/eellison
ghstack dependencies: #135530
2024-09-11 18:43:18 +00:00
451eaf0ff2 Log full exception trace when error raised in Dynamo (#135697)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135697
Approved by: https://github.com/Skylion007
2024-09-11 18:14:33 +00:00
09519eb195 Support rolling over a percentage of workflows (#134816)
In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py.

Details of the new format are in the comments up top.

On the plus side, this now includes some unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816
Approved by: https://github.com/PaliC, https://github.com/zxiiro
2024-09-11 18:01:26 +00:00
5314ae2660 Don't use exception chaining for BackendCompilerFailed (#135545)
Commandeered from https://github.com/pytorch/pytorch/pull/135496 as I'm now helping @ezyang ship dynamic float arguments in PT2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135545
Approved by: https://github.com/ezyang
2024-09-11 17:49:18 +00:00
da587de9cb [ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters v2 (#133852)
Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic.

The original code was:
`if torch.version.hip is not None:`

Which was incorrectly replaced by:
`if self.device_props.type != "hip":`

Another occurence of https://github.com/pytorch/pytorch/pull/130617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133852
Approved by: https://github.com/masnesral, https://github.com/malfet
2024-09-11 17:21:40 +00:00
82a4df2d5f [CI] [ROCm] Run rocm workflow on every push to main branch (#135644)
Dial the frequency back up from https://github.com/pytorch/pytorch/pull/131637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135644
Approved by: https://github.com/huydhn
2024-09-11 17:21:05 +00:00
18a9030952 [CI] Fix update slow tests (#135390)
* Add pytorchbot to list of approvers for file
* Add labels to the auto created PR

The auto generated PR is currently not merging due to some failing tests on slow workflow that were supposed to be moved back to normal

idk if this has much value, clearly we've been managing without the update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135390
Approved by: https://github.com/ZainRizvi
2024-09-11 17:02:17 +00:00
03f23d07b4 Optimize ShapeEnv.replace (#135652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135652
Approved by: https://github.com/ezyang
ghstack dependencies: #135621, #135622
2024-09-11 16:50:59 +00:00
8c738c9270 Improve performance of sympy_generic_le (#135622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135622
Approved by: https://github.com/ezyang
ghstack dependencies: #135621
2024-09-11 16:20:03 +00:00
7ddacaf40a Improve performance of canonicalize_bool_expr (#135621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135621
Approved by: https://github.com/ezyang
2024-09-11 16:20:03 +00:00
183c32fd3b Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137)"
This reverts commit 0d15122092c27fec1143b800bab7c996d126b547.

Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/133137#issuecomment-2344054339))
2024-09-11 15:57:00 +00:00
3ab12e2596 Revert "[Dynamo] Support thread local setattr (#135443)"
This reverts commit 160c228a4bd60ceffa62b045a6b0a6f9413835c5.

Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135443#issuecomment-2344042800))
2024-09-11 15:53:55 +00:00
596e93b506 Revert "[dynamo] Bug fix for _torchdynamo_inline source handling (#135612)"
This reverts commit 5c3d0a2dedbc0e85f3b256ce56ac674078a5fae1.

Reverted https://github.com/pytorch/pytorch/pull/135612 on behalf of https://github.com/clee2000 due to broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_linear_input_transpose_bias_True_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10805518363/job/29982386304) [HUD commit link](5c3d0a2ded), bad TD ([comment](https://github.com/pytorch/pytorch/pull/135612#issuecomment-2344039370))
2024-09-11 15:51:12 +00:00
f96e8041b1 Revert "[Dynamo] Simplify torch function mode stack guard (#135444)"
This reverts commit 444b52ff40cf4afce7bc3fdcf021a88eab3b954c.

Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135444#issuecomment-2344036843))
2024-09-11 15:48:27 +00:00
7cf9c81918 Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)"
This reverts commit 6a3edfcc1e474e6ebd0c06624000a6d6bf1a0dee.

Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/clee2000 due to broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2344016694))
2024-09-11 15:39:21 +00:00
49e0b88aab Fix test_triton_kernel_float64_constant (#135583)
Summary: Landed https://github.com/pytorch/pytorch/pull/135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes).

Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135583
Approved by: https://github.com/isuruf, https://github.com/eellison, https://github.com/Skylion007
2024-09-11 15:16:23 +00:00
ee8c5cc1cc For S444023: Back out "deprecate search_autotune_cache (#133628)" (#135186)
Summary: For S444023

Test Plan:
Revert prevented the NaN errors - f639391901
Training job ran for 7767 iterations. NaN errors show up within the first 1k.

Reviewed By: nmacchioni

Differential Revision: D62224747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135186
Approved by: https://github.com/kit1980
2024-09-11 14:08:40 +00:00
ce4d146f56 ATen | Fix MPSCNNNeuron creation on Mac Catalyst. (#135595)
Summary:
These are still utilized directly when using relu/sigmoid/tanh tensors directly from here: https://fburl.com/code/k6n7ofzd
However, on Mac Catalyst we always were returning `nil`, as such in most cases yielding the entire graph completely useless and most often just stray `MPSTemporaryImage` references that were never written into.

This fixes the issue completely by making sure that we always return the valid kernels back, so they can be executed.

Test Plan: Test with segmentation net that uses a combination of relu and other tensors together - run this via Mac Catalyst build - it works! {F1858576745}

Reviewed By: MichaelTay

Differential Revision: D62430010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135595
Approved by: https://github.com/MichaelTay
2024-09-11 11:12:23 +00:00
0226fcaacf Disable cuda specific restrictions in _scaled_mm for other devices (#135579)
Fixes #135576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135579
Approved by: https://github.com/drisspg
2024-09-11 11:05:38 +00:00
4cde5096c4 [Inductor][FlexAttention] Supports dynamic shapes with block mask (#135629)
Fixes #134560 and #135206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135629
Approved by: https://github.com/drisspg
2024-09-11 08:10:50 +00:00
443c015393 [Distributed] Improve efficiency of NaN checker (#135414)
Some customers would like to run the NaN checks on the fly, so we are improving its efficiency.

## Benchmarking
Allreduce 2G floats. `TORCH_NCCL_NAN_CHECK=1`
Red kernel: ncclAllreduce
Blue kernel: Nan check

<img width="1093" alt="Screenshot 2024-09-06 at 10 00 05 PM" src="https://github.com/user-attachments/assets/5501bc31-024f-4115-adb2-dd66eb4025d3">

## Comparison with torch ops:
Let's say a user manually check for NaNs with the following torch ops before all-reduce:
```
torch.any(torch.isnan(x))
```
<img width="1091" alt="Screenshot 2024-09-06 at 10 14 53 PM" src="https://github.com/user-attachments/assets/1f8b5f63-c955-4612-bb96-241b6c69959b">

So our perf is on-par with torch ops.

## Changes
- Load from vidmem using "big packs" of 16 bytes
- Bump `blockDim.x` from 256 to 512
- Separate loads and checks into two loops, each of 8 iterations
- Unroll the loops
- Templated functions for checking NaN in a "big pack" based on dtype

Special thanks to @jbachan from NCCL!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135414
Approved by: https://github.com/wconstab
2024-09-11 07:53:42 +00:00
4ae6d7c18f Back out "[pytorch][PR] [export] fix re-export custom metadata" (#135634)
Summary: Broke some tests. Revert this diff

Test Plan: CI

Differential Revision: D62474337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135634
Approved by: https://github.com/tugsbayasgalan
2024-09-11 06:16:26 +00:00
3084b7b5c0 [cuDNN][SDPA] Support attn_bias in cuDNN (#130482)
CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-11 05:59:25 +00:00
5c3d0a2ded [dynamo] Bug fix for _torchdynamo_inline source handling (#135612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612
Approved by: https://github.com/drisspg
ghstack dependencies: #135588
2024-09-11 05:23:42 +00:00
c608b17f60 [PTD][BE][c10d] Add some code documents for TCPStore code and cosmetic changes to libUVStore code (#130496)
While designing something else when TCPStore is needed. I spent some time digging into the codebase of TCPStore and found that the code is a little bit challenging to understand without proper documents. Although people from OSS community must be smarter than me, I still want to document my findings in the code so that devs and users can use them as a reference down the road.

Also for libuv, we need to make private variables with a "_", so it's a pure renaming of private variables such as `tcpServer`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130496
Approved by: https://github.com/wconstab
2024-09-11 04:42:25 +00:00
444b52ff40 [Dynamo] Simplify torch function mode stack guard (#135444)
The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422.  The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443
2024-09-11 04:18:22 +00:00
160c228a4b [Dynamo] Support thread local setattr (#135443)
In preparation for tracing through DeviceContext (defb515306/torch/utils/_device.py (L66))
This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137
2024-09-11 04:18:22 +00:00
0d15122092 [Dynamo] Trace torch function modes entered outside of torch.compile (#133137)
This PR adds initial tracing for torch function modes.

Details:
In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call.
This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers.

Previously landed:
https://github.com/pytorch/pytorch/pull/133135
https://github.com/pytorch/pytorch/pull/133136
https://github.com/pytorch/pytorch/pull/133134
https://github.com/pytorch/pytorch/pull/133133
https://github.com/pytorch/pytorch/pull/133132
https://github.com/pytorch/pytorch/pull/133131
https://github.com/pytorch/pytorch/pull/133729
https://github.com/pytorch/pytorch/pull/133130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #134732
2024-09-11 04:18:22 +00:00
6a3edfcc1e [Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
2024-09-11 04:18:22 +00:00
356f14e7b7 Fix the output of FileCheck when not run and add unit tests (#135345)
When FileCheck is destructed without execution, it should output all rules.
For example:
```
>>> fc = FileCheck().check("test")
>>> del fc
You have not run this instance of FileCheck!
FileCheck checks:
        CHECK: test
```

Additionally, unit tests for the Python interface of FileCheck will be added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135345
Approved by: https://github.com/eellison
2024-09-11 04:13:24 +00:00
34dc8f69a1 Adding entry-point based support for out-of-tree rendezvous plugins (#132633)
Fixes #127519

Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages.

#### AUTHORING NEW PLUGIN
Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows:

```
plugin_root
|_ pyproject.toml
|_ src
   |_ redis
      |_ __init__.py
      |_ redis_store.py
      |_ redis_backend.py
```

The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows:

```
[project]
name = "redis"
version = "0.0.1"

[project.entry-points.'torchrun.plugins']
redis = 'redis'
```

The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows:

```
def getPluginHandler():
    def _create_redis_handler(params: RendezvousParameters):
        from redis_rendezvous_backend import create_backend
        backend, store = create_backend(params)
        return create_handler(store, backend, params)
    return _create_redis_handler
```

The files `redis_store` and `redis_backend` contain the implementation of [Store](41189b0da4/torch/_C/_distributed_c10d.pyi (L171)) and [RendezvousBackend](e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)) respectively.

#### USER EXPERIENCE
Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`.

Once installed, the new backend can be used in torchrun as follows:

```
torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633
Approved by: https://github.com/fduwjj
2024-09-11 03:35:02 +00:00
cd9ee49a69 [aoti] Add cpp loader (#135374)
* Added a cpp loader, AOTIModelPackageLoader, which can load the .pt2, build the .so, and create a runner. The python-facing API is that users can directly call the `run` function, whereas in cpp users can directly access the `runner_` if they are more familiar with that. I couldn't figure out how to bind the `get_runner()` function to python...
* Added a new config, `aot_inductor.package_cpp_only` which will **not** package the so. This means that whenever the package is loaded, we will need to build the so. This is turned off by default so that new environments do not need to rebuild their so. The `package_cpp_only` is a feature which torchchat intends to use to provide flexibility to users.
* Added a new config, `aot_inductor.metadata` which stores user-provided metadata, serialized to the pt2 as a json file. It also stores the device used when exporting, "cuda" or "cpu", so that during load time, we can use that data to determine which AOTIModelContainerRunner to use. The metadata can be accessed through `loader.get_metadata()`. TODO is to move this metadata to the toplevel `package_aoti` function so that we can remove the metadata as a config.
* Separated out `package_aoti` as a standalone function, instead of it automatically being called in inductor. This is to prepare for the case where users will compile multiple models, and want to bundle it in one package. The specific use case is in torchchat, where we want to package the separately-exported encoder and decoder layers. An example of how to use this is in `test_multiple_methods`.
* `load_package` will load a singular model, given the model name.
* The loader doesn't support windows for now, I think I need to add some more casing to make the build commands work on windows?

Differential Revision: [D62329906](https://our.internmc.facebook.com/intern/diff/D62329906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135374
Approved by: https://github.com/desertfire, https://github.com/malfet
2024-09-11 03:00:01 +00:00
26e5572dd2 Bump triton xpu pin and release version (#135638)
Similar with https://github.com/pytorch/pytorch/pull/135627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135638
Approved by: https://github.com/atalman
2024-09-11 00:56:15 +00:00
693897df42 [dynamo] Missing guard source keys for corner case of NNModuleVariabl… (#135041)
Potentially fixes - https://fb.workplace.com/groups/1286739428954016/permalink/1319662695661689/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135041
Approved by: https://github.com/ezyang
2024-09-11 00:43:26 +00:00
3bf6be457d [MPS] Add missing dispatch to rshift.Tensor (#135607)
Missed it while working on https://github.com/pytorch/pytorch/pull/131813
Test plan: `python -c "import torch;print(torch.randint(100, 500, (64,), device='mps') >> torch.tensor([3,], device='mps'))"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135607
Approved by: https://github.com/manuelcandales
2024-09-11 00:20:53 +00:00
492f064f15 [ONNX] Add assertion nodes to ignoring list (#135591)
Fixes #135419

PS: there are 104 empty output nodes, I suggest we add them one by one when we run into them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135591
Approved by: https://github.com/justinchuby
2024-09-11 00:18:17 +00:00
29408ea81a Add option to tweak inductor stride settings for user-defined triton kernels (#135530)
Previously, Inductor was allowed to modify the stride/storage_offset
(layout) for inputs to user-defined triton kernels. This can cause
silent incorrectness because most triton kernels are written for a
specific striding pattern (usually contiguous).

This PR adds a config to allow the user to choose Inductor's behavior on
this. The options are:
- "flexible_layout" (default): Inductor can modify the layout for inputs
  to user-defined triton kernels as much as it wants.
- "needs_fixed_stride_order": Inductor must preserve the stride order
  (when compared to tracing) for inputs to user-defined triton kernels.

This matches our handling for custom operators. In the future, we'll
want a "needs_exact_strides" option (this is the safest option).

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135530
Approved by: https://github.com/FindHao, https://github.com/oulgen
2024-09-11 00:11:17 +00:00
02dcb07765 Add boolean support in pack segments ops for both cpu and cuda impls (#132897) (#135620)
Summary:

Same as int types, forward only.

bypass-github-export-checks diff has been synced to github

Test Plan:
buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:test -- test_pack_segments
https://www.internalfb.com/intern/testinfra/testconsole/testrun/16888498646804437/

Reviewed By: garroud

Differential Revision: D60785563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135620
Approved by: https://github.com/kit1980

Co-authored-by: Haoming Lu <haominglu@meta.com>
2024-09-11 00:03:17 +00:00
5c38aa72c0 [dynamo][dicts][nv-embed] Support update with kwargs (#135588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135588
Approved by: https://github.com/yanboliang
2024-09-10 23:50:23 +00:00
5134ba7458 Bump triton pin and release version (#135627)
Update the pin and release version to sync with https://github.com/triton-lang/triton/tree/release/3.1.x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135627
Approved by: https://github.com/Chillee, https://github.com/drisspg, https://github.com/malfet
2024-09-10 23:46:36 +00:00
e48ee2cf50 [ONNX] Fix scaled_dot_product_attention with float scale (#135594)
Fixes #125158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135594
Approved by: https://github.com/justinchuby
2024-09-10 23:04:02 +00:00
eb38ee21ba [ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config (#135397)
Fixes #132964

This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform.
By increasing this parameter, it uses fewer threadblocks and improved the performance.

Test:
Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s).

Also tested with other different sizes of tensors and also see perf improvement.

```python
import torch
from triton.testing import do_bench

x = torch.randn(2**30, device='cuda')

ms = do_bench(lambda: x.sum(dim=-1))

bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9)

time_s = ms / 1000

bw_per_second = bandwidth_gbyte / time_s

print(bw_per_second)
```

Co-author: @carlobertolli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135397
Approved by: https://github.com/eqy, https://github.com/malfet
2024-09-10 21:03:01 +00:00
8057b72763 [ez][inductor] don't benchmark cloning if there are no mutated args (#135533)
When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel.
Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135533
Approved by: https://github.com/jansel
ghstack dependencies: #135531
2024-09-10 20:54:31 +00:00
7b17918dc9 [inductor] fix a device sync issue for benchmarking fusion (#135531)
Fix https://github.com/pytorch/pytorch/issues/134768 .

When we benchmark the latency for a fused node set, we do benchmarking twice:
1. benchmark the latency of the kernel including cloning mutated args
2. benchmark the latency of cloning mutated args without running the kernel

We subtract result 2 from result 1 to get the latency of the kernel itself.

But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0).

The fix is to set the correct current device in our benchmarking code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135531
Approved by: https://github.com/jansel
2024-09-10 20:54:31 +00:00
66c45f3ed9 [export] fix re-export custom metadata (#135282)
Fixes #134778

When a model is exported and debug handles are added to the "custom" field of non-placeholder and non-output nodes in the graph, re-exporting it will change the metadata of placeholder nodes (the "custom" field will be added or copied to these nodes, depending whether `ExportedProgram` or `ExportedProgram.module()` is passed to `generate_numeric_debug_handle()`).

This occurs because when we re-export the model, `placeholder` nodes are unlifted to `get_attr` nodes. These nodes remain as `get_attr` after being exported to `gm_torch_level`.  Their metadata are modified [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1347) based on `params_buffers_to_node_meta` which is collected [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1312).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135282
Approved by: https://github.com/jerryzh168, https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2024-09-10 20:15:02 +00:00
0a9d55d2ee Revert "[AOTI] Fix assert_function call in cpu autotune template (#135086)"
This reverts commit 16c3b8f87cfa9cb5acee8104820baa389e7ee2bd.

Reverted https://github.com/pytorch/pytorch/pull/135086 on behalf of https://github.com/izaitsevfb due to breaks internal tests, see D62405818 ([comment](https://github.com/pytorch/pytorch/pull/135086#issuecomment-2341889428))
2024-09-10 19:51:16 +00:00
4ca65d3323 [CI] Increase sharding for jobs that are timing out (#135582)
Increase sharding for
* slow grad check
* slow cuda tests slow / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test
* avx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135582
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-09-10 19:45:13 +00:00
c932b39739 [FSDP2] Added _set_unshard_async_op (#135523)
This PR adds a private API `_set_unshard_async_op` that allows for running pre-forward and pre-backward all-gathers using the `async_op=True` path so that all-gather allocations happen in the default stream to avoid inter-stream fragmentation.

If using this option, forward requires explicit prefetching e.g. via the `unshard(async_op=True)` API for overlap. fp32 -> bf16 casts and the all-gather copy-in will not overlap with compute.

Differential Revision: [D62401551](https://our.internmc.facebook.com/intern/diff/D62401551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135523
Approved by: https://github.com/weifengpy
2024-09-10 19:28:02 +00:00
1f15973657 [AOTI][Tooling][7/n] Add debug printing support for JIT inductor codegen path as well (#135285)
Summary:
1.  Add the debug printer call to a level lower for triton kernel python wrapper codegen path
2. Add `torch.save()` for jit inductor as well
3. This also fixes the issue introduced in D61949020 (at python wrapper code level for triton kernel not printing)

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1  TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda
```

Differential Revision: D62272588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135285
Approved by: https://github.com/chenyang78
2024-09-10 19:24:58 +00:00
fc88ba260f [amdsmi][torch] Update amdsmi API usages (#135504)
Summary: In ROCm 6.2.0 there were API name changes-- we check if the new APIs exist and use them in this diff; see 7b2463abe0 for the changes

Test Plan: CI

Differential Revision: D62325661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135504
Approved by: https://github.com/eqy, https://github.com/houseroad
2024-09-10 19:15:39 +00:00
bf8d0e3107 [inductor] Enable subprocess parallel compile internally with killswitch (#132467)
Differential Revision: [D60629630](https://our.internmc.facebook.com/intern/diff/D60629630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132467
Approved by: https://github.com/eellison
2024-09-10 19:05:46 +00:00
3a1239a248 [Profiler] Harden Record Function Kwargs (#135365)
Summary:
In S445839, we had HTA break because of the "stream" parameter that was added to gpu traces. This brought up discussions regarding hardening our post processing of said inputs as to not break JSON schema as well as downstream tools. For this reason, this diff does the following.

1. Only allow int, double, bool and string values to be processed as kwinputs for JSON output. We can handle lists if needed in the future.
2. Make sure that any boolean is lowercase  when a string so that the JSON does not break when parsing it
3. Force stream parameter to be an int

Test Plan: Added unit tests to ensure that the list of requirements above is true for kwargs only.

Differential Revision: D62304843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135365
Approved by: https://github.com/aaronenyeshi
2024-09-10 18:44:05 +00:00
4f9f1775d8 Fix flaky TestCudaWrapper.test_randint_cuda_cuda_wrapper (#135370)
Summary: This test is flaky when run after `test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper` because the TestCase sets config options globally in its setUp() that stick around for subsequent tests. For test isolation, we use a contextlib.ExitStack pattern in other tests to patch the config options and restore them in tearDown(). Update all TestCases in `test/inductor/test_combo_kernels.py` to use that pattern.

Test Plan:
```
python test/inductor/test_combo_kernels.py
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper TestCudaWrapper.test_randint_cuda_cuda_wrapper
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135370
Approved by: https://github.com/jansel
2024-09-10 18:43:14 +00:00
5e0788befb Migrate remaining jobs to use runner determinator (#134867)
At this point all self-hosted runner jobs should be using the runner determinator to switch between LF and Meta runners. This change updates the remaining jobs that have not yet been migrated over.

Issue: https://lf-pytorch.atlassian.net/browse/PC-25

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134867
Approved by: https://github.com/ZainRizvi
2024-09-10 18:14:00 +00:00
440f8f57af Revert "[fx] Bypass custom __setattr__ in Node.__init__ (#135079)" (#135562)
This reverts commit 66da3b3b2acacb116a9b23e91b24934830eaf6b8.

#135079 breaks internal tests and needs to be reverted. Revert with mergebot doesn't work as this PR is technically part of the stack, but, according to @jansel, it should be possible to revert it individually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135562
Approved by: https://github.com/jansel, https://github.com/seemethere
2024-09-10 18:07:11 +00:00
e004d539da [Partitioner] Reuse partition to check whether nodes exist (#135317)
The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317
Approved by: https://github.com/ezyang
2024-09-10 17:45:29 +00:00
c4b84a46a9 Add more logging to TunableOp validators (#135396)
Summary: Add more logging to TunableOp validators

Test Plan:
Verified additional logging when loading kernel selections:
```
ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty
GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack-
HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178
ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39
PT_VERSION validation: expect 2.5.0 to match 2.5.0
```

```
[qizixi@devgpu039.atn3 /data/users/qizixi/fbsource/fbcode (f9305317d|remote/master)]$ PYTORCH_TUNABLEOP_VERBOSE=1 buck2 run mode/{opt,amd-gpu} -c fbcode.e
nable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enable-tuning
File changed: fbcode//hipblas_tuning_pt_llama0.csv
Buck UI: https://www.internalfb.com/buck2/1ed2fac4-743e-49ef-805f-7fb6b9300022
Network: Up: 0B  Down: 0B
Jobs completed: 4189. Time elapsed: 0.2s.
BUILD SUCCEEDED
Enabled tuning
- Run Linear (matmul) 2 x 1280 x 8192, dtype = torch.bfloat16
INFO:2024-09-06 14:38:07 2834864:2835138 CuptiActivityProfiler.cpp:260] HIP versions. Roctracer: 4.1; Runtime: 60032830; Driver: 60032830
INFO:2024-09-06 14:38:07 2834864:2836083 DynoConfigLoader.cpp:61] Setting communication fabric enabled = 0
reading tuning results from hipblas_tuning_pt_llama0.csv
Validator PT_VERSION=2.5.0
Validator ROCM_VERSION=6.0.0.0-12969-1544e39
Validator HIPBLASLT_VERSION=800-a15e4178
Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack-
Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty
ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty
GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack-
HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178
ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39
PT_VERSION validation: expect 2.5.0 to match 2.5.0
Loading results
Avg time: 13.165860176086426 us, Achieved 3.19 TFLOPS, 1598.24 GB/s

- Run Linear (matmul) 2 x 8192 x 1024, dtype = torch.bfloat16
Avg time: 13.230760097503662 us, Achieved 2.54 TFLOPS, 1271.14 GB/s

- Run Linear (matmul) 2 x 7168 x 8192, dtype = torch.bfloat16
Avg time: 26.804399490356445 us, Achieved 8.76 TFLOPS, 4384.90 GB/s

- Run Linear (matmul) 2 x 8192 x 3584, dtype = torch.bfloat16
Avg time: 13.407809734344482 us, Achieved 8.76 TFLOPS, 4384.14 GB/s

2x1280x8192-torch.bfloat16,13.165860176086426,3.18574247630113,1598.237845349412
2x8192x1024-torch.bfloat16,13.230760097503662,2.536092541374924,1271.1420867780075
2x7168x8192-torch.bfloat16,26.804399490356445,8.762778814892096,4384.9040543618985
2x8192x3584-torch.bfloat16,13.407809734344482,8.759112362638383,4384.138585247748
```

Reviewed By: leitian

Differential Revision: D62322830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135396
Approved by: https://github.com/eqy
2024-09-10 17:20:59 +00:00
cyy
bc1b8f094d Check function declarations of Core ML code (#135467)
Relax the restrictions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135467
Approved by: https://github.com/ezyang
2024-09-10 16:05:22 +00:00
f65a564fa2 [inductor] Flip custom_op_default_layout_constraint (#135239)
By default, Inductor should respect the stride order of input Tensors to
custom operators.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135239
Approved by: https://github.com/albanD
ghstack dependencies: #135391
2024-09-10 14:27:43 +00:00
386b313028 Handle KeyError for compiler collective in scalars too (#135385)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135385
Approved by: https://github.com/jansel
2024-09-10 12:33:04 +00:00
6d7cbc20d2 Add dynamo itertools.pairwise support (#135416)
Fixes #133766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135416
Approved by: https://github.com/XuehaiPan, https://github.com/jansel

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
2024-09-10 11:37:59 +00:00
ca16956b20 [Inductor] Generalize device guard codegen for cpp_wrapper mode. (#134761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134761
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #134693
2024-09-10 10:11:52 +00:00
67735d1ee8 [Inductor] Generalize is_cuda to specific device_type to make cpp_wrapper mode be extensible (#134693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134693
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/jansel
2024-09-10 10:11:13 +00:00
6e13f5eb38 [FlexAttention] Add broadcast support for kv batch dimension (#135505)
This PR adds broadcast support for KV batch dimension.

## Details
Consider Q of shape `[Bq, Hq, Q_LEN, D]`, and K, V of shape `[Bkv, Hkv, KV_LEN, D]`. Prior to this diff, we require `Bq == Bkv`. However, for some use cases, we may have Bkv < Bq. For example, in paged attention, we provide K, V of shape `[1, Hkv, MAX_LEN, D]`, while still providing Q of shape `[Bq, Hq, Q_LEN, D]`. Here, MAX_LEN is the maximal number of tokens supported by paged attention.

This PR relax this requirement to be `Bq == Bkv or (Bq > 1 and Bkv == 0)`. This support covers both flex decoding, flex attention forward and backward.

## Benchmark
GPU: H100

We see negligible (1%~2%) performance change from this PR when `Bq == Bkv`.

```
python benchmarks/transformer/score_mod.py --calculate-bwd
```
### Perf before this PR

**FWD**

| Type    |   Speedup | score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)        |
|---------|-----------|---------------|------------|----------------|------------------------------|
| Average |     0.743 |               |            |                |                              |
| Max     |     0.955 | head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)   |
| Min     |     0.548 | relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128) |

**BWD**

| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)       |
|---------|-----------|-------------|------------|----------------|-----------------------------|
| Average |     0.834 |             |            |                |                             |
| Max     |     1.261 | head_bias   | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)   |
| Min     |     0.456 | None        | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128) |

<details>
<summary> Full performance sweep </summary>

| score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)         |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------|------------|----------------|-------------------------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.264 |              17.184 |          107.040 |             140.800 |         0.888 |         0.760 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.840 |              19.744 |          112.576 |             140.064 |         0.802 |         0.804 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.232 |              17.344 |           87.744 |             142.496 |         0.878 |         0.616 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.264 |              17.184 |          108.192 |             143.328 |         0.888 |         0.755 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.904 |              22.400 |          106.432 |             136.512 |         0.889 |         0.780 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.424 |              26.752 |           91.712 |             106.688 |         0.726 |         0.860 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.808 |              22.432 |           89.024 |             101.920 |         0.883 |         0.873 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.840 |              22.272 |           88.896 |             102.592 |         0.891 |         0.867 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.240 |              32.416 |          116.768 |             112.256 |         0.933 |         1.040 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           29.536 |              37.024 |          113.664 |             102.688 |         0.798 |         1.107 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.656 |              32.800 |          116.992 |             127.008 |         0.935 |         0.921 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.592 |              32.480 |          116.928 |             112.160 |         0.942 |         1.043 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           40.448 |              61.920 |          198.656 |             204.512 |         0.653 |         0.971 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           37.760 |              62.528 |          189.536 |             170.624 |         0.604 |         1.111 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           40.896 |              62.368 |          198.304 |             205.824 |         0.656 |         0.963 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           40.448 |              61.952 |          198.432 |             203.648 |         0.653 |         0.974 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          318.528 |             355.904 |          947.232 |            1162.496 |         0.895 |         0.815 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          199.776 |             252.128 |          677.792 |             813.184 |         0.792 |         0.834 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          316.512 |             363.328 |          947.712 |            1361.984 |         0.871 |         0.696 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          317.984 |             356.864 |          947.264 |            1165.024 |         0.891 |         0.813 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          446.656 |             734.656 |         1664.288 |            2172.960 |         0.608 |         0.766 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          278.688 |             467.648 |         1182.624 |            1339.296 |         0.596 |         0.883 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          447.872 |             744.096 |         1662.944 |            2196.544 |         0.602 |         0.757 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          448.128 |             732.928 |         1663.072 |            2156.800 |         0.611 |         0.771 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.648 |              16.640 |          107.520 |             143.008 |         0.940 |         0.752 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.776 |              18.240 |          129.056 |             141.920 |         0.865 |         0.909 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.168 |              16.640 |          103.616 |             139.648 |         0.912 |         0.742 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.616 |              16.640 |          128.608 |             164.448 |         0.938 |         0.782 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.776 |              21.952 |          125.344 |             170.304 |         0.901 |         0.736 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.776 |              23.712 |          104.288 |             196.896 |         0.834 |         0.530 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.072 |              21.952 |          102.080 |             177.056 |         0.869 |         0.577 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.648 |              21.920 |          109.920 |             170.848 |         0.896 |         0.643 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.464 |              31.936 |          127.808 |             228.832 |         0.954 |         0.559 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           29.472 |              33.856 |          113.152 |             215.072 |         0.871 |         0.526 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.496 |              32.160 |          116.576 |             231.744 |         0.948 |         0.503 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.464 |              31.904 |          116.320 |             229.824 |         0.955 |         0.506 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           40.480 |              61.440 |          176.448 |             345.312 |         0.659 |         0.511 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           38.304 |              59.424 |          169.312 |             371.360 |         0.645 |         0.456 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           40.960 |              61.760 |          176.512 |             358.912 |         0.663 |         0.492 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           40.352 |              61.696 |          176.512 |             344.928 |         0.654 |         0.512 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          316.224 |             357.728 |          905.728 |            1668.448 |         0.884 |         0.543 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          199.904 |             248.416 |          636.544 |            1109.088 |         0.805 |         0.574 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          314.880 |             363.616 |          906.304 |            1658.176 |         0.866 |         0.547 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          316.160 |             354.368 |          906.080 |            1649.024 |         0.892 |         0.549 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          446.912 |             739.840 |         1555.808 |            2521.952 |         0.604 |         0.617 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          279.776 |             463.904 |         1068.928 |            1849.888 |         0.603 |         0.578 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          446.080 |             748.960 |         1553.504 |            2629.888 |         0.596 |         0.591 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          446.208 |             740.608 |         1558.880 |            2524.960 |         0.602 |         0.617 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           33.568 |              41.280 |          170.016 |             147.584 |         0.813 |         1.152 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           30.688 |              43.040 |          159.552 |             146.720 |         0.713 |         1.087 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           34.112 |              41.504 |          170.112 |             152.672 |         0.822 |         1.114 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           34.240 |              41.152 |          170.272 |             134.976 |         0.832 |         1.261 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.672 |              76.416 |          295.296 |             263.648 |         0.637 |         1.120 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           45.088 |              72.576 |          281.920 |             237.664 |         0.621 |         1.186 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.032 |              76.672 |          295.520 |             265.248 |         0.626 |         1.114 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.096 |              76.096 |          295.456 |             262.112 |         0.632 |         1.127 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           93.920 |             111.232 |          401.568 |             382.944 |         0.844 |         1.049 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           68.192 |              95.232 |          338.752 |             326.816 |         0.716 |         1.037 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           93.984 |             111.840 |          401.856 |             444.224 |         0.840 |         0.905 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           94.176 |             110.496 |          401.600 |             383.136 |         0.852 |         1.048 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          131.488 |             227.040 |          727.424 |             739.712 |         0.579 |         0.983 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |           95.616 |             169.760 |          616.864 |             574.112 |         0.563 |         1.074 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          131.680 |             228.672 |          727.616 |             746.048 |         0.576 |         0.975 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          131.104 |             225.696 |          727.904 |             735.392 |         0.581 |         0.990 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1227.296 |            1386.656 |         3720.192 |            4539.904 |         0.885 |         0.819 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |          691.360 |             831.712 |         2515.872 |            3067.808 |         0.831 |         0.820 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1228.192 |            1403.136 |         3715.520 |            5309.280 |         0.875 |         0.700 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1229.024 |            1384.992 |         3715.904 |            4550.368 |         0.887 |         0.817 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1784.832 |            2865.888 |         6539.840 |            8460.224 |         0.623 |         0.773 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1017.408 |            1660.480 |         4369.824 |            5056.992 |         0.613 |         0.864 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1792.448 |            2904.864 |         6546.080 |            8537.024 |         0.617 |         0.767 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1795.552 |            2856.864 |         6544.672 |            8400.160 |         0.629 |         0.779 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           34.240 |              38.880 |          148.832 |             179.936 |         0.881 |         0.827 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           31.168 |              38.080 |          138.528 |             167.552 |         0.818 |         0.827 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           34.240 |              39.168 |          148.512 |             181.248 |         0.874 |         0.819 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           34.240 |              38.784 |          148.864 |             180.224 |         0.883 |         0.826 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           48.832 |              76.352 |          253.632 |             295.968 |         0.640 |         0.857 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           45.760 |              65.792 |          239.040 |             290.752 |         0.696 |         0.822 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           48.768 |              76.576 |          253.312 |             304.032 |         0.637 |         0.833 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           48.768 |              76.192 |          253.600 |             296.096 |         0.640 |         0.856 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           93.728 |             109.728 |          357.696 |             498.912 |         0.854 |         0.717 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           68.704 |              92.288 |          295.616 |             386.240 |         0.744 |         0.765 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           93.632 |             111.392 |          357.408 |             512.448 |         0.841 |         0.697 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           93.280 |             109.952 |          357.696 |             501.440 |         0.848 |         0.713 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          131.392 |             230.496 |          612.224 |             807.552 |         0.570 |         0.758 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |           96.512 |             165.184 |          502.624 |             672.384 |         0.584 |         0.748 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          131.360 |             232.608 |          612.064 |             832.320 |         0.565 |         0.735 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          131.008 |             230.528 |          612.640 |             804.320 |         0.568 |         0.762 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1227.968 |            1377.408 |         3477.920 |            5324.384 |         0.892 |         0.653 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |          695.264 |             824.544 |         2268.224 |            3210.208 |         0.843 |         0.707 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1228.640 |            1404.576 |         3476.832 |            5463.456 |         0.875 |         0.636 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1228.416 |            1378.752 |         3478.048 |            5367.712 |         0.891 |         0.648 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1788.736 |            2867.712 |         6039.520 |            8616.256 |         0.624 |         0.701 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1021.952 |            1653.824 |         3866.208 |            5306.848 |         0.618 |         0.729 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1786.752 |            2896.352 |         6044.128 |            8871.360 |         0.617 |         0.681 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1786.080 |            2868.672 |         6040.160 |            8550.144 |         0.623 |         0.706 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           57.504 |              71.552 |          312.768 |             255.040 |         0.804 |         1.226 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           49.472 |              71.104 |          285.696 |             243.520 |         0.696 |         1.173 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           58.112 |              72.896 |          312.768 |             288.256 |         0.797 |         1.085 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           57.952 |              71.680 |          312.768 |             255.552 |         0.808 |         1.224 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           82.336 |             144.256 |          580.128 |             500.160 |         0.571 |         1.160 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           76.160 |             123.712 |          552.544 |             447.648 |         0.616 |         1.234 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           82.400 |             145.184 |          580.032 |             504.032 |         0.568 |         1.151 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           82.368 |             143.904 |          580.192 |             499.936 |         0.572 |         1.161 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          177.216 |             209.568 |          787.872 |             747.712 |         0.846 |         1.054 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          121.984 |             168.256 |          651.968 |             628.256 |         0.725 |         1.038 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          177.088 |             211.488 |          788.320 |             864.352 |         0.837 |         0.912 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          177.440 |             208.576 |          787.424 |             749.120 |         0.851 |         1.051 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          249.472 |             441.376 |         1405.440 |            1431.648 |         0.565 |         0.982 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          172.960 |             312.064 |         1172.064 |            1096.448 |         0.554 |         1.069 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          249.632 |             446.336 |         1405.408 |            1448.480 |         0.559 |         0.970 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          250.944 |             440.128 |         1406.624 |            1421.952 |         0.570 |         0.989 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2418.720 |            2747.936 |         7330.432 |            9023.712 |         0.880 |         0.812 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         1353.696 |            1608.480 |         4941.696 |            6078.752 |         0.842 |         0.813 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2427.456 |            2746.816 |         7329.792 |           10539.968 |         0.884 |         0.695 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2426.688 |            2763.168 |         7336.256 |            9057.536 |         0.878 |         0.810 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3554.240 |            5634.400 |        12919.872 |           16843.489 |         0.631 |         0.767 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         2003.648 |            3250.784 |         8610.144 |           10015.424 |         0.616 |         0.860 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3582.080 |            5710.944 |        12923.328 |           17011.871 |         0.627 |         0.760 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3581.920 |            5618.144 |        12934.528 |           16745.888 |         0.638 |         0.772 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           57.120 |              71.232 |          269.760 |             295.680 |         0.802 |         0.912 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           49.408 |              65.312 |          242.304 |             253.952 |         0.756 |         0.954 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           57.504 |              72.544 |          269.632 |             298.976 |         0.793 |         0.902 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           57.760 |              71.040 |          269.600 |             296.640 |         0.813 |         0.909 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           82.336 |             147.168 |          466.080 |             487.456 |         0.559 |         0.956 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           76.704 |             115.040 |          435.392 |             453.248 |         0.667 |         0.961 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           81.856 |             147.424 |          465.920 |             499.552 |         0.555 |         0.933 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           81.760 |             146.656 |          466.176 |             485.984 |         0.557 |         0.959 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          176.608 |             206.976 |          678.080 |             866.976 |         0.853 |         0.782 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          121.664 |             164.768 |          538.240 |             636.160 |         0.738 |         0.846 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          176.608 |             209.664 |          677.696 |             883.424 |         0.842 |         0.767 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          177.440 |             207.840 |          677.248 |             868.288 |         0.854 |         0.780 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          250.272 |             449.536 |         1163.424 |            1420.832 |         0.557 |         0.819 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          173.472 |             305.376 |          929.408 |            1104.544 |         0.568 |         0.841 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          249.376 |             454.976 |         1163.648 |            1455.296 |         0.548 |         0.800 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          250.368 |             450.144 |         1163.520 |            1409.984 |         0.556 |         0.825 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2416.576 |            2726.208 |         6835.520 |           10442.784 |         0.886 |         0.655 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         1357.440 |            1590.752 |         4433.664 |            5975.296 |         0.853 |         0.742 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2427.360 |            2747.040 |         6853.056 |           10670.784 |         0.884 |         0.642 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2441.120 |            2718.944 |         6836.640 |           10433.792 |         0.898 |         0.655 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3555.392 |            5620.960 |        11944.000 |           16504.801 |         0.633 |         0.724 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         2010.848 |            3241.152 |         7636.064 |            9870.464 |         0.620 |         0.774 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3557.440 |            5688.352 |        11935.744 |           17090.496 |         0.625 |         0.698 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3562.720 |            5630.432 |        11939.168 |           16392.033 |         0.633 |         0.728 |

</details>

### Perf after this PR

**FWD**

| Type    |   Speedup | score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)      |
|---------|-----------|---------------|------------|----------------|----------------------------|
| Average |     0.776 |               |            |                |                            |
| Max     |     1.006 | None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64) |
| Min     |     0.566 | relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128) |

**BWD**

| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)       |
|---------|-----------|-------------|------------|----------------|-----------------------------|
| Average |     0.817 |             |            |                |                             |
| Max     |     1.150 | None        | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 128) |
| Min     |     0.454 | None        | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128) |

<details>
<summary> Full performance sweep </summary>

| score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)         |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------|------------|----------------|-------------------------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.680 |              17.056 |           64.544 |              73.376 |         0.919 |         0.880 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.712 |              19.872 |           65.408 |              72.864 |         0.791 |         0.898 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           16.160 |              17.280 |           64.896 |              73.888 |         0.935 |         0.878 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           16.192 |              17.120 |           64.896 |              75.424 |         0.946 |         0.860 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.648 |              22.496 |           89.184 |              82.592 |         0.873 |         1.080 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           20.320 |              26.816 |           91.264 |              82.880 |         0.758 |         1.101 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           20.096 |              22.528 |           89.184 |              83.776 |         0.892 |         1.065 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.680 |              22.432 |           89.184 |             120.096 |         0.877 |         0.743 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           32.384 |              32.512 |          119.232 |             128.960 |         0.996 |         0.925 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.176 |              37.248 |          113.664 |             119.520 |         0.810 |         0.951 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           32.512 |              32.928 |          119.264 |             131.456 |         0.987 |         0.907 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           32.448 |              32.704 |          119.200 |             128.352 |         0.992 |         0.929 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           41.952 |              62.176 |          199.040 |             214.304 |         0.675 |         0.929 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           39.744 |              62.880 |          189.504 |             179.968 |         0.632 |         1.053 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           41.472 |              62.784 |          199.136 |             217.664 |         0.661 |         0.915 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           42.048 |              61.952 |          199.168 |             214.496 |         0.679 |         0.929 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          341.184 |             357.632 |          980.256 |            1328.896 |         0.954 |         0.738 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          212.576 |             252.960 |          673.888 |             824.864 |         0.840 |         0.817 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          340.000 |             363.296 |          980.768 |            1375.808 |         0.936 |         0.713 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          340.768 |             356.832 |          980.960 |            1326.272 |         0.955 |         0.740 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          459.392 |             737.120 |         1678.240 |            2205.248 |         0.623 |         0.761 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          292.672 |             468.096 |         1178.016 |            1371.584 |         0.625 |         0.859 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          462.144 |             745.312 |         1680.000 |            2252.512 |         0.620 |         0.746 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          462.112 |             736.576 |         1679.008 |            2216.480 |         0.627 |         0.758 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           16.064 |              16.704 |          105.120 |             120.768 |         0.962 |         0.870 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.552 |              18.144 |          107.136 |             121.696 |         0.857 |         0.880 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           16.096 |              16.768 |          102.688 |             120.864 |         0.960 |         0.850 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           16.032 |              16.576 |          104.736 |             124.672 |         0.967 |         0.840 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.392 |              21.952 |          104.736 |             174.656 |         0.883 |         0.600 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           20.128 |              23.712 |          105.216 |             199.008 |         0.849 |         0.529 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.904 |              21.888 |          103.744 |             179.520 |         0.909 |         0.578 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.968 |              21.952 |          104.640 |             177.312 |         0.910 |         0.590 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           32.096 |              31.904 |          118.720 |             231.968 |         1.006 |         0.512 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.528 |              33.952 |          112.480 |             218.304 |         0.899 |         0.515 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           32.160 |              32.224 |          118.752 |             237.312 |         0.998 |         0.500 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           32.128 |              32.032 |          118.240 |             233.120 |         1.003 |         0.507 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           41.312 |              61.280 |          177.408 |             350.688 |         0.674 |         0.506 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           39.552 |              59.360 |          168.832 |             371.488 |         0.666 |         0.454 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           41.984 |              61.696 |          177.376 |             360.416 |         0.680 |         0.492 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           41.312 |              61.760 |          177.184 |             355.744 |         0.669 |         0.498 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          339.744 |             357.888 |          939.712 |            1665.376 |         0.949 |         0.564 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          212.608 |             248.832 |          633.280 |            1122.848 |         0.854 |         0.564 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          339.712 |             363.232 |          940.448 |            1689.440 |         0.935 |         0.557 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          341.056 |             355.264 |          940.128 |            1641.152 |         0.960 |         0.573 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          460.736 |             741.024 |         1569.824 |            2559.552 |         0.622 |         0.613 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          293.856 |             464.192 |         1066.240 |            1840.416 |         0.633 |         0.579 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          460.704 |             753.152 |         1570.112 |            2641.088 |         0.612 |         0.594 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          460.832 |             745.536 |         1570.144 |            2602.560 |         0.618 |         0.603 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           35.680 |              41.280 |          171.840 |             158.176 |         0.864 |         1.086 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           31.360 |              42.976 |          158.912 |             139.264 |         0.730 |         1.141 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           35.168 |              41.600 |          171.648 |             161.344 |         0.845 |         1.064 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           35.136 |              41.152 |          171.808 |             158.336 |         0.854 |         1.085 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.832 |              76.384 |          295.680 |             277.696 |         0.639 |         1.065 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           45.632 |              72.512 |          281.760 |             250.752 |         0.629 |         1.124 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           49.504 |              76.608 |          295.584 |             279.712 |         0.646 |         1.057 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.864 |              75.904 |          295.456 |             277.568 |         0.644 |         1.064 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           99.392 |             111.232 |          408.640 |             442.656 |         0.894 |         0.923 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           71.392 |              95.168 |          338.784 |             341.760 |         0.750 |         0.991 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           99.808 |             112.256 |          408.608 |             456.160 |         0.889 |         0.896 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |          100.032 |             110.816 |          408.512 |             444.192 |         0.903 |         0.920 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          135.040 |             226.112 |          726.880 |             774.176 |         0.597 |         0.939 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |           99.904 |             169.696 |          616.448 |             607.104 |         0.589 |         1.015 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          135.488 |             228.384 |          727.776 |             782.368 |         0.593 |         0.930 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          135.744 |             225.664 |          728.000 |             773.600 |         0.602 |         0.941 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1324.192 |            1387.808 |         3866.944 |            5217.184 |         0.954 |         0.741 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |          738.464 |             832.608 |         2507.392 |            3146.688 |         0.887 |         0.797 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1326.016 |            1404.256 |         3867.872 |            5382.624 |         0.944 |         0.719 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1326.144 |            1386.688 |         3867.552 |            5203.264 |         0.956 |         0.743 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1847.488 |            2866.336 |         6612.704 |            8597.696 |         0.645 |         0.769 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1066.592 |            1660.640 |         4357.696 |            5174.016 |         0.642 |         0.842 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1850.464 |            2905.408 |         6616.928 |            8793.280 |         0.637 |         0.752 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1848.896 |            2834.720 |         6623.872 |            8637.920 |         0.652 |         0.767 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           36.384 |              38.656 |          150.336 |             182.624 |         0.941 |         0.823 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           31.360 |              38.112 |          137.664 |             171.840 |         0.823 |         0.801 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           36.608 |              39.040 |          150.528 |             183.872 |         0.938 |         0.819 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           36.064 |              38.656 |          150.560 |             183.520 |         0.933 |         0.820 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           49.344 |              76.352 |          253.920 |             301.440 |         0.646 |         0.842 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           46.720 |              65.824 |          239.424 |             296.384 |         0.710 |         0.808 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           49.248 |              76.416 |          253.728 |             307.808 |         0.644 |         0.824 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           49.376 |              76.288 |          253.728 |             304.736 |         0.647 |         0.833 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           99.264 |             110.144 |          364.960 |             503.072 |         0.901 |         0.725 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           71.136 |              92.384 |          294.432 |             393.056 |         0.770 |         0.749 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           99.200 |             111.360 |          365.152 |             512.640 |         0.891 |         0.712 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           99.264 |             110.240 |          365.088 |             504.224 |         0.900 |         0.724 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          135.680 |             230.336 |          613.472 |             816.896 |         0.589 |         0.751 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          100.256 |             165.088 |          502.144 |             676.480 |         0.607 |         0.742 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          135.008 |             232.480 |          613.184 |             836.672 |         0.581 |         0.733 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          135.232 |             230.624 |          613.536 |             827.136 |         0.586 |         0.742 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1324.064 |            1378.688 |         3631.808 |            5308.384 |         0.960 |         0.684 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |          731.776 |             826.688 |         2263.168 |            3241.344 |         0.885 |         0.698 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1316.128 |            1403.200 |         3625.088 |            5550.688 |         0.938 |         0.653 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1311.904 |            1378.880 |         3616.320 |            5353.696 |         0.951 |         0.675 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1837.856 |            2887.392 |         6121.632 |            8586.656 |         0.637 |         0.713 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1066.976 |            1654.368 |         3843.136 |            5291.040 |         0.645 |         0.726 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1854.208 |            2896.832 |         6130.112 |            8745.984 |         0.640 |         0.701 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1860.512 |            2889.344 |         6135.648 |            8750.592 |         0.644 |         0.701 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           60.640 |              71.552 |          315.968 |             296.512 |         0.847 |         1.066 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           50.784 |              71.040 |          284.288 |             258.880 |         0.715 |         1.098 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           61.312 |              72.704 |          315.680 |             302.016 |         0.843 |         1.045 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           60.800 |              71.776 |          316.320 |             297.152 |         0.847 |         1.065 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           84.576 |             144.416 |          580.576 |             535.936 |         0.586 |         1.083 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           76.064 |             123.648 |          553.344 |             481.376 |         0.615 |         1.150 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           84.160 |             145.248 |          581.024 |             540.000 |         0.579 |         1.076 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           84.512 |             143.552 |          581.088 |             535.776 |         0.589 |         1.085 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          189.152 |             209.408 |          798.400 |             868.704 |         0.903 |         0.919 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          127.552 |             168.800 |          650.816 |             663.328 |         0.756 |         0.981 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          189.376 |             211.360 |          798.080 |             895.552 |         0.896 |         0.891 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          189.440 |             208.576 |          797.888 |             873.152 |         0.908 |         0.914 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          257.536 |             441.760 |         1408.960 |            1514.720 |         0.583 |         0.930 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          179.328 |             312.096 |         1170.368 |            1177.472 |         0.575 |         0.994 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          259.264 |             446.944 |         1408.768 |            1530.400 |         0.580 |         0.921 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          258.080 |             440.480 |         1408.864 |            1514.144 |         0.586 |         0.930 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2595.808 |            2771.456 |         7616.704 |           10405.248 |         0.937 |         0.732 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         1435.744 |            1610.336 |         4927.520 |            6220.000 |         0.892 |         0.792 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2595.264 |            2745.056 |         7611.232 |           10631.392 |         0.945 |         0.716 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2576.256 |            2735.456 |         7626.400 |           10346.976 |         0.942 |         0.737 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3679.744 |            5634.816 |        13077.056 |           17182.528 |         0.653 |         0.761 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         2099.360 |            3250.176 |         8589.664 |           10236.672 |         0.646 |         0.839 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3676.800 |            5716.288 |        13073.088 |           17311.071 |         0.643 |         0.755 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3679.136 |            5570.496 |        13070.720 |           17192.863 |         0.660 |         0.760 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           61.600 |              71.008 |          272.320 |             300.000 |         0.868 |         0.908 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           50.176 |              65.344 |          241.568 |             258.912 |         0.768 |         0.933 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           61.120 |              72.512 |          272.672 |             305.408 |         0.843 |         0.893 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           61.248 |              71.136 |          272.640 |             301.120 |         0.861 |         0.905 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           83.872 |             146.784 |          466.912 |             496.832 |         0.571 |         0.940 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           76.704 |             115.072 |          435.584 |             462.112 |         0.667 |         0.943 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           83.392 |             147.392 |          466.656 |             504.448 |         0.566 |         0.925 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           83.360 |             146.688 |          466.656 |             499.040 |         0.568 |         0.935 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          189.024 |             207.584 |          684.768 |             873.568 |         0.911 |         0.784 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          126.944 |             164.288 |          536.192 |             645.984 |         0.773 |         0.830 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          188.768 |             209.760 |          684.096 |             897.504 |         0.900 |         0.762 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          189.408 |             207.776 |          685.024 |             876.384 |         0.912 |         0.782 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          259.168 |             449.536 |         1167.936 |            1433.280 |         0.577 |         0.815 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          180.000 |             305.312 |          928.000 |            1113.920 |         0.590 |         0.833 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          258.464 |             455.136 |         1167.808 |            1462.848 |         0.568 |         0.798 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          257.824 |             450.208 |         1167.744 |            1448.000 |         0.573 |         0.806 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2598.368 |            2729.120 |         7134.400 |           10381.632 |         0.952 |         0.687 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         1435.456 |            1591.040 |         4424.768 |            6035.808 |         0.902 |         0.733 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2594.752 |            2725.952 |         7128.384 |           10822.496 |         0.952 |         0.659 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2597.888 |            2716.960 |         7101.568 |           10385.440 |         0.956 |         0.684 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3647.648 |            5581.632 |        12089.952 |           16667.233 |         0.654 |         0.725 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         2093.952 |            3241.440 |         7579.392 |            9847.936 |         0.646 |         0.770 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3650.528 |            5650.688 |        12105.568 |           16963.680 |         0.646 |         0.714 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3680.064 |            5585.312 |        12117.504 |           16935.040 |         0.659 |         0.716 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135505
Approved by: https://github.com/Chillee
2024-09-10 09:30:02 +00:00
23b1486185 [MPS] Allow nan mean reduction in nll_loss (#135434)
This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162.

Fixes #134431

Ref #64572 #119108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135434
Approved by: https://github.com/malfet
2024-09-10 08:37:59 +00:00
9902b349cb [Inductor] Make static_input_idxs a set for faster lookup (#135314)
`static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases.

Profile before change:
<img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e">

Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph
<img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314
Approved by: https://github.com/oulgen
2024-09-10 07:27:55 +00:00
5a9ac83e94 Fix doc (#135551)
Differential Revision: [D62412667](https://our.internmc.facebook.com/intern/diff/D62412667/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135551
Approved by: https://github.com/yushangdi
ghstack dependencies: #135549
2024-09-10 07:18:44 +00:00
1adf28a5c0 [inductor] print triton float64 constants correctly (#135260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135260
Approved by: https://github.com/jansel
2024-09-10 07:05:02 +00:00
c18052da0e Add some minor doc improvement and ban using training IR for unflattener (#135549)
Title

Differential Revision: [D62412490](https://our.internmc.facebook.com/intern/diff/D62412490/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135549
Approved by: https://github.com/yushangdi
2024-09-10 06:48:42 +00:00
c0d2f991b1 Increase TRITON_MAX_BLOCK['X'] (#135181)
Fixes #135028

As title, increase `TRITON_MAX_BLOCK['X']` to 4096 and fix an error, thanks to @Chillee: https://github.com/pytorch/pytorch/pull/133300/files#r1744706189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135181
Approved by: https://github.com/jansel
2024-09-10 05:54:37 +00:00
e889252493 Implementation of scan (#134102)
This operation is supposed to be the pendant to the `associative_scan`, but can operate with non-associative functions.

@ydwu4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134102
Approved by: https://github.com/ydwu4
2024-09-10 04:51:16 +00:00
6546c6186d do not raise when flatten_fn_with_keys not found when suggesting fixes (#135518)
Test Plan: added test

Differential Revision: D62395371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135518
Approved by: https://github.com/zhxchen17
2024-09-10 03:47:36 +00:00
1d9fefff19 [DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535)
Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: https://github.com/pytorch/pytorch/issues/133415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535
Approved by: https://github.com/wz337
2024-09-10 03:10:00 +00:00
7ec17b49cf Fix dynamo benchmark skip logic for cpu device (#135193)
Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193
Approved by: https://github.com/chuanqi129, https://github.com/jansel
2024-09-10 03:02:19 +00:00
146921007a [inductor] [cpp] fix the input contiguous check in max-autotune (#134982)
## Description
Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm.

In this PR, we check whether input is contiguous using the following way:
If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous.

## Additional context
The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails:
d14fe3ffed/torch/_inductor/ir.py (L4712-L4715)

And it finally runs into this `copy_input` and returns a `FlexibleLayout`.
d14fe3ffed/torch/_inductor/ir.py (L4722)

When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model.
The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](d14fe3ffed/torch/_inductor/mkldnn_lowerings.py (L1051)) which calls [slice_nd](d14fe3ffed/torch/_inductor/codegen/cpp_template_kernel.py (L150)) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](d14fe3ffed/torch/_inductor/ir.py (L2288)) invokes
[decide_layout](d14fe3ffed/torch/_inductor/ir.py (L2135)) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134982
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-09-10 02:47:38 +00:00
a71e5509bc [inductor]Add profiler to operatorbench (#135515)
Add profiling to operatorbench. The new argument `--profile` is added and the profiling trace is like the following figure.
<img width="954" alt="image" src="https://github.com/user-attachments/assets/5b00d6e3-4905-4a77-a5e9-9f62620a5fd5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135515
Approved by: https://github.com/shunting314
2024-09-10 02:33:30 +00:00
136e28f616 Enable forward AD in functional.affine_grid (#135494)
Fixes #121411
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135494
Approved by: https://github.com/zou3519, https://github.com/soulitzer
2024-09-10 00:07:07 +00:00
39a61795e3 remove amax_ptr from scaled_gemm (#135421)
amax was removed from _scaled_mm by #128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well.  This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135421
Approved by: https://github.com/drisspg, https://github.com/eqy
2024-09-09 23:04:36 +00:00
b4feec9782 [xplat][XNNPACK] don't prefer static linkage in xplat for main target (#135529)
Building XNNPACK as a static library has some issues because of multiple global params floating around.

Let's try to get rid of it in xplat and see how it fares.

Differential Revision: [D60776152](https://our.internmc.facebook.com/intern/diff/D60776152/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D60776152/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135529
Approved by: https://github.com/kimishpatel, https://github.com/mcr229, https://github.com/kirklandsign
2024-09-09 22:47:01 +00:00
d81731615f [Dynamo] Adding CallFunctionNoArgsSource and (#135425)
CallFunctionNoArgsGuardAccessor to support torch.cuda.current_device()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135425
Approved by: https://github.com/anijain2305
2024-09-09 22:46:00 +00:00
e2f9a83b85 [ONNX] Drop final None values as inputs for nodes in exporter graph (#135520)
When value for an optional input is not provided, it is defaulted to `None`, which gets translates to "" in the onnx graph. To avoid this, if we have a list of inputs and the final few are all `None`, strip them in the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135520
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-09-09 22:28:41 +00:00
70a65a8bd5 Revert "NJT <-> padded dense conversions (#125947)"
This reverts commit 09a5e88bef04d5485b70d8f65f46a675aaa52942.

Reverted https://github.com/pytorch/pytorch/pull/125947 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing dynamo test 09a5e88bef, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/125947#issuecomment-2339228570))
2024-09-09 22:01:09 +00:00
689d278543 Revert "Add __init__.py to shape inference folder. (#135461)"
This reverts commit dced0d6d9f05f0962f74a3c6227f774111c15715.

Reverted https://github.com/pytorch/pytorch/pull/135461 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it exposes some public function without appropriate doc. I will reopen the issue with hi-prio so that it can be fixed properly ([comment](https://github.com/pytorch/pytorch/pull/135461#issuecomment-2339218382))
2024-09-09 21:55:13 +00:00
9b764491e3 Use upload-artifact@v4.4.0 for create_release.yml (#135528)
Fixes failure: https://github.com/pytorch/pytorch/actions/runs/10780281005/job/29895846007

Due broken sync
```
actions/upload-artifact@v2
and
actions/download-artifact@v4.1.7
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135528
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-09-09 20:48:52 +00:00
cbc6b30a24 Fix broken E2E tests on Linux machines (#135394)
Summary:
I'm not entirely sure why this is failing with an `ImportError` (according to lastnameye a super class of `ModuleNotFoundError`s), but on our E2E tests on Linux machines (but not Macs?), we're seeing the import failure not getting caught --
`ImportError: cannot import name 'parutil' from 'libfb.py' (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbsource/d0c916ec8d40ce11/arvr/libraries/ctrl/studies/replay/__ctrl-r__/ctrl-r#link-tree/libfb/py/__init__.py)` from this test run https://www.internalfb.com/sandcastle/workflow/2522015791331601269, an instance of this job:  https://www.internalfb.com/intern/test/844425085172858?ref_report_id=0 is the overall job

Test Plan:
`arc skycastle schedule tools/skycastle/workflows2/ctrl/js_tests.sky:test_js_e2e_replay_tests --sandcastle-spec-overrides '{"type": "fbcode", "unicastle_size": "I1_MEDIUM"}'`
->
https://www.internalfb.com/sandcastle/workflow/256705178764255769

Differential Revision: D62321167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135394
Approved by: https://github.com/laithsakka
2024-09-09 20:18:08 +00:00
5b368de7f7 Revert "[ONNX] Update fake mode usage in onnx docs (#135512)"
This reverts commit a13c118994b4f118388d97a35abcb91a396cd437.

Reverted https://github.com/pytorch/pytorch/pull/135512 on behalf of https://github.com/davidberard98 due to failing test  https://github.com/pytorch/pytorch/actions/runs/10778813316/job/29891679127 ([comment](https://github.com/pytorch/pytorch/pull/135512#issuecomment-2338999090))
2024-09-09 20:15:12 +00:00
09a5e88bef NJT <-> padded dense conversions (#125947)
This PR:
* Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values)
* Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics
    * Note: there is currently no public API for this; design booted to a future PR

TODO:
* ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~
* ~~Verify that Inductor does computation fusion via test logic~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947
Approved by: https://github.com/soulitzer
2024-09-09 19:37:32 +00:00
a4e6a0b240 [split build] move periodic split builds into own concurrency group (#135510)
To avoid nightly workflows cancelling each other
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135510
Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-09 19:35:57 +00:00
4ab232d0c4 Fix symbolic number's type and tensor's dtype mismatch bug in Tensor ctor (#135433)
Fixes #135432

In the current implementation, if we try to store a symbolic number in Tensor's constructor, it assumes that the tensor's dtype and the symbolic number's type are matched, which is not the case.

In other words, if we try to store a `SymInt`, current implementation assumes tensor's dtype is `torch.int32`, `torch.int64` or something. And if we try to store a `SymFloat`, it assumes tensor's dtype is `torch.float32` or `torch.float64`. However, the tensor's dtype could also be `torch.float32` or something else when we try to store `SymInt`, which would be wrong.

This PR stores symbolic numbers by tensor's scalar type by wrapping `SymInt` and `SymFoat`'s guarded number into a PyObject.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135433
Approved by: https://github.com/ezyang
2024-09-09 19:32:18 +00:00
2032f107d7 Don't try to tag s390x docker images (#135509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135509
Approved by: https://github.com/atalman
2024-09-09 19:07:48 +00:00
5f7d956362 Fix bugs blocking flipping the default layout constraint for custom ops (#135391)
Fixes two things:
- For regular PyTorch ops, the default layout constraint tag is always
flexible_layout. This was a bug with #135238
- Mark the new quantized _wrapped_linear_prepack ops as flexible_layout.
  The metas for these are incorrect, I didn't want to fix them (and
  changing the default requires the metas actually be correct).

Test Plan:
- The next PR up in the stack. The PRs are split because the next one is
  riskier.

foo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135391
Approved by: https://github.com/albanD
2024-09-09 18:24:21 +00:00
a13c118994 [ONNX] Update fake mode usage in onnx docs (#135512)
Update fake mode usage in onnx docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512
Approved by: https://github.com/justinchuby
2024-09-09 18:10:37 +00:00
21241bfeee [CP] Extend CP to support load-balancing shards (#132442)
This PR extends the current ring attention to support load-balancing shards -- the context/sequence is divided into `2 * world_size` shards and each rank gets `rank` and `(world_size * 2 - rank - 1)` shards. The data re-shuffling is done in the `context_parallel` API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132442
Approved by: https://github.com/wconstab
2024-09-09 18:04:38 +00:00
73a6fc6e30 Revert "[Inductor] Make static_input_idxs a set for faster lookup (#135314)"
This reverts commit 011cae9570fb3c44b7f6f0c8004c470579ed21da.

Reverted https://github.com/pytorch/pytorch/pull/135314 on behalf of https://github.com/ZainRizvi due to Lint is failing on this file in trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10777258770/job/29885960050) [HUD commit link](011cae9570) ([comment](https://github.com/pytorch/pytorch/pull/135314#issuecomment-2338678219))
2024-09-09 17:33:01 +00:00
09287e3af4 [MPS] Add regression test for fft.fftfreq (#135440)
The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it.

Fixes #135223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135440
Approved by: https://github.com/ezyang
2024-09-09 17:12:36 +00:00
16c3b8f87c [AOTI] Fix assert_function call in cpu autotune template (#135086)
Summary: In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135086
Approved by: https://github.com/chenyang78, https://github.com/angelayi
ghstack dependencies: #134857
2024-09-09 16:54:12 +00:00
9c6dff4941 [AOTI] Add C shim for aten.mkldnn_rnn_layer in cpp wrapper (#134857)
Summary: Support aten.mkldnn_rnn_layer in the ABI-compatible mode. Because aten.mkldnn_rnn_layer is an aten op, it is easier to add a C shim function for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134857
Approved by: https://github.com/angelayi
2024-09-09 16:54:12 +00:00
0eb425a563 [Release] Apply Release changes scripts after release 2.4 (#135495)
Based on additional changes required for https://github.com/pytorch/pytorch/pull/128347
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135495
Approved by: https://github.com/kit1980
2024-09-09 16:49:04 +00:00
011cae9570 [Inductor] Make static_input_idxs a set for faster lookup (#135314)
`static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases.

Profile before change:
<img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e">

Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph
<img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314
Approved by: https://github.com/oulgen
2024-09-09 16:24:58 +00:00
dfb2b661f7 Use float data type for Half var_sum in batchnorm stats updating on CPU (#126525)
Using float data type for Half `var_sum` in batchnorm stats updating on CPU to avoid `var_sum` overflow since the representation range of Half is small.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126525
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-09 15:31:38 +00:00
5a69e0ebbe [MPS] Update decorator comments with issue ref (#135448)
Updating the comments with references to better places for context now that the bugs have been identified.

xref #135442 #135447 #134184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135448
Approved by: https://github.com/ezyang
2024-09-09 15:18:52 +00:00
5e145861f2 [ONNX] Improves documentation of ONNX exporter (#135372)
The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135372
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-09-09 15:09:01 +00:00
c35b953531 Fix wrong error msg (#135423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135423
Approved by: https://github.com/ezyang
2024-09-09 13:28:31 +00:00
dced0d6d9f Add __init__.py to shape inference folder. (#135461)
Fixes #135196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135461
Approved by: https://github.com/ezyang
2024-09-09 13:27:58 +00:00
c0436c5701 [inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686) (#135438)
Fix #134686.

PR https://github.com/pytorch/pytorch/pull/132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224:
SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling
AUTOTUNE linear_unary(12544x3072, 768x3072, 768)
  cpp_packed_gemm_2 2.9371 ms 100.0%
  _linear_pointwise 3.1584 ms 93.0%

But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135438
Approved by: https://github.com/leslie-fang-intel
2024-09-09 05:16:02 +00:00
cyy
60e8dc4374 Check function declarations in Caffe2 code (#134925)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134925
Approved by: https://github.com/ezyang
2024-09-09 05:03:29 +00:00
e6c3f58584 Fix example: Address broadcasting error in the addition of `attn_bias… (#135427)
…` and `attn_mask`, and correct device assignment for newly created variables in the method.

Fix example: Address broadcasting error in the addition of `attn_bias` and `attn_mask`, and correct device assignment for newly created variables in the method.

1. Adding `attn_bias += attn_mask` results in a broadcasting error. The expected shape of `attn_bias` is (L, S), so the output should also have the shape (L, S). However, when the input shape is (N, num_heads, L, S), broadcasting occurs, leading to an output shape of (N, num_heads, L, S), which is not desired.
2. `attn_bias` is a newly created variable within the method, but it is not assigned to the correct device.

**This is my retry of PR #130209 . The PR has been merged into commit `d4a79d4a7c746068d25fe5cf9333495561f4ce1f`, but the modifications were overwritten by subsequent commits.**

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
@mikaylagawarecki  provided a more elegant implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135427
Approved by: https://github.com/ezyang
2024-09-09 03:47:34 +00:00
90e12cf63d Fix return type of nansum example. (#135435)
One of the examples in the documentation of `torch.nansum` contains a wrong return type. This fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135435
Approved by: https://github.com/ezyang
2024-09-09 03:34:52 +00:00
44c08f4984 [Partitioner] Query whether nodes exist in graph faster (#135316)
Find node if exist in graph.nodes (linked list) take too long time. Using graph._find_nodes_lookup_table (hash table) instead to speed up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135316
Approved by: https://github.com/ezyang
2024-09-09 03:34:02 +00:00
b6186353c6 enable lazy_init for hpu (#135203)
enables lazy_init for hpu device
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135203
Approved by: https://github.com/ezyang
2024-09-09 03:32:20 +00:00
b7eb7256fb docs: torch.nn.utils.rnn.pack_padded_sequence: docs improve (#135417)
docs: `torch.nn.utils.rnn.pack_padded_sequence`: docs improve

/cc @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135417
Approved by: https://github.com/ezyang
2024-09-09 03:16:11 +00:00
c1ae78be92 [inductor] calibration inductor windows uts (18/N) (#135449)
skip test_quantized_* UTs of `test/inductor/test_cpu_select_algorithm.py`.
Windows inductor don't support quantize so far.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135449
Approved by: https://github.com/ezyang
2024-09-09 03:10:54 +00:00
defb515306 [NJT]Add permute ops support (#135336)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135336
Approved by: https://github.com/davidberard98
2024-09-08 21:00:41 +00:00
31c4e0d37d [inductor] Cleanup analysis done at lowering time (#135412)
Before this we would take multiple passes over the body of each IRNode as we did lowering.  This combines most analysis into `OpCounterCSE` so it can be done in a single pass.

Before:
![image](https://github.com/user-attachments/assets/0047db09-4258-4491-a9a6-b078e183092a)

After:
![image](https://github.com/user-attachments/assets/1e03adcb-8303-4bb1-8bbb-cc42dacd44d7)

This stack:
![image](https://github.com/user-attachments/assets/d6b50b24-c30c-4d23-8b1a-344b3ba65d7a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135412
Approved by: https://github.com/oulgen
ghstack dependencies: #135286, #135306, #135377, #135400
2024-09-08 18:02:36 +00:00
53290ca00b [inductor] Refactor BaseSchedulerNode.__init__ (#135400)
Might be a small compile time improvement since we remove a call to extract_read_writes().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135400
Approved by: https://github.com/oulgen
ghstack dependencies: #135286, #135306, #135377
2024-09-08 18:02:36 +00:00
16f5155992 [inductor] Fast path for extract_read_writes without tracing (#135377)
Before (bottom of stack):
![image](https://github.com/user-attachments/assets/13060ff9-b31d-42a9-8e8f-c50b2bf3dc2f)

After (this PR):
![image](https://github.com/user-attachments/assets/7d190821-b614-46b7-9e9e-9087443df654)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135377
Approved by: https://github.com/oulgen
ghstack dependencies: #135286, #135306
2024-09-08 18:02:32 +00:00
37144be03d [inductor] Remove ReadWrites.op_counts (#135306)
This was (almost) unused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135306
Approved by: https://github.com/oulgen
ghstack dependencies: #135286
2024-09-08 18:02:28 +00:00
3bdc54ed18 [inductor] Refactor LoopBody.memory_usage (#135286)
This is preparing for some other changes where I speed up extract_read_writes tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135286
Approved by: https://github.com/oulgen
2024-09-08 18:02:24 +00:00
cyy
2196f32475 [22/N] Fix clang-tidy warnings in jit (#135319)
Follows #134537
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135319
Approved by: https://github.com/titaiwangms
2024-09-08 17:18:29 +00:00
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
20cab91a12 [dynamo] Remove skip from jit freeze tests (#135281)
Fixes https://github.com/pytorch/pytorch/issues/119781
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135281
Approved by: https://github.com/zou3519
2024-09-08 15:11:12 +00:00
a6fae2e811 Use BRGEMM for Half flash attention forward kernel (#131879)
Use oneDNN BRGEMM on packed data to get better performance on the 5th generation of Xeon where Intel® Advanced Matrix Extensions (AMX) will have fp16 support, e.g. amx-fp16.
Multiple models have achieved acceleration, for instance, FP16 stable diffusion v2.1 has achieved over 50% improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131879
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #131878
2024-09-08 12:32:23 +00:00
042f2f7746 [ONNX] Re-raise the exception if the dynamic shapes cannot be refined (#135418)
Improve error reporting. Otherwise users will just see not being able to refine shapes most of the time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135418
Approved by: https://github.com/titaiwangms
2024-09-08 05:30:34 +00:00
fd494dd426 Change wrapped_linear_prepack and wrapped_quantized_linear_prepacked to private by adding _ as prefix (#135401)
Summary: In https://github.com/pytorch/pytorch/pull/134232, we added two new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked. From the review comments and offline discussion, we are changing them to private by adding `_` as prefix

Differential Revision: D62325142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135401
Approved by: https://github.com/houseroad
2024-09-08 04:16:24 +00:00
8334cb2fb9 remove commented out breakpoints (#135363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135363
Approved by: https://github.com/oulgen
2024-09-08 02:15:45 +00:00
e72ed4717e [Dynamo] Fix Huggingface PretrainedConfig get non const attr (#135413)
Fixes #135329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135413
Approved by: https://github.com/anijain2305
2024-09-07 19:16:29 +00:00
3bebc09be9 [FlexAttention] Align the matmul tensorcore usage (#135168)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135168
Approved by: https://github.com/Chillee
2024-09-07 16:33:41 +00:00
a2db22e6bb [inductor] Catch BrokenProcessPool and print a more helpful message. (#135120)
Summary: BrokenProcessPool means a parallel-compile subprocess exited, which we never expect. It's likely due to a crash, so print a more meaningful error message and instructions that it's probably easier to debug by turning off parallel compile. Output looks like:
```
...
  File "/data/users/slarsen/pytorch/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_slarsen/4q/c4qw7xk5lbb7whg5txnk4hwbc7z6kepak3o666tr3d64gcad5r5b.py", line 815, in <module>
    async_compile.wait(globals())
  File "/data/users/slarsen/pytorch/torch/_inductor/async_compile.py", line 265, in wait
    raise RuntimeError(
RuntimeError: A compilation subprocess exited unexpectedly. This is likely due to a crash. To facilitate debugging, you can re-run with TORCHINDUCTOR_COMPILE_THREADS=1 to cause compilation to occur in the main process.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135120
Approved by: https://github.com/Chillee
2024-09-07 16:33:37 +00:00
eac5e12548 [inductor] Move LoopBody to its own file (#135257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135257
Approved by: https://github.com/oulgen
2024-09-07 16:29:15 +00:00
18479c5f70 [Doc] update max-autotune for CPU (#134986)
The current doc for `max-autotune` is applicable only for GPU. This PR adds the corresponding content for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134986
Approved by: https://github.com/jgong5, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-07 13:42:40 +00:00
f7c0c06692 Add oneDNN BRGEMM support on CPU (#131878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-07 13:22:30 +00:00
b53d97c7be [Intel GPU] Add XPU memory-related APIs (#129919)
# Motivation
According to https://github.com/pytorch/pytorch/issues/116322, we will help unify the device allocator. So we introduce a simple xpu device allocator only with the key functionality first. And expect to add some memory statistics-related functionality after the unification.
But now, some memory statistic-related APIs listed in https://github.com/pytorch/pytorch/issues/127929 are requested. We need more time to unify the device allocator. In order to facilitate the user experience, we expect to support these memory statistic-related APIs before the unification.

# Additional Context
Fixes: #127929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129919
Approved by: https://github.com/dvrogozh, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #130923
2024-09-07 11:15:17 +00:00
6c1da66407 [Reland] Refactor caching device allocator utils (#130923)
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy
2024-09-07 11:14:17 +00:00
d7c97e7245 [inductor][cpp][gemm] cache blocking config for dynamic shapes (#133538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133538
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #135277, #133447

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-09-07 11:09:30 +00:00
be9f4ffe88 [inductor][cpp][gemm] enable dynamic M for k-slicing (#133447)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133447
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #135277

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-09-07 11:09:30 +00:00
692faa9bc6 [inductor][cpp][gemm] reduce memory alloc overhead by allocating local acc once per thread (#135277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135277
Approved by: https://github.com/leslie-fang-intel

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-09-07 11:09:25 +00:00
32f3af72b7 [ONNX] Support FakeTensor in ONNXProgram (#135399)
Sync with https://github.com/justinchuby/torch-onnx/compare/v0.1.20...v0.1.21 to support FakeTensors in ONNXProgram. Specifically, this PR implements the `apply_weights` method to allow users to supply a dictionary of concrete tensors to replace FakeTensors in the exported model weights.

An error is raised when users try to serialize a FakeTensor to avoid segfaults.

Also fixed a bug in `.save()` when `keep_initializers_as_inputs` is True and `include_initializers` is False.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135399
Approved by: https://github.com/titaiwangms
2024-09-07 04:48:18 +00:00
ebab5c85c4 [FlexAttention] Skip very small block size unit tests on H100 due to Triton bug (#135393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135393
Approved by: https://github.com/BoyuanFeng
2024-09-07 04:35:22 +00:00
3d734d837b [ONNX] Handle mixed sequence inputs properly (#135378)
Previously, when an input contains a mixture of `Value` and python constants like `[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]`, we get errors like

```pytb
Traceback (most recent call last):
  File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 367, in _call_op
    converted_named_inputs = _process_python_constants_and_sequences(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 275, in _process_python_constants_and_sequences
    raise TypeError(
TypeError: Constant input '[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]' of type '<class 'list'>' is not supported
```

This PR updates Sequence handling to support this case, as well as variadic inputs and ONNX Sequence inputs.

Synced from https://github.com/justinchuby/torch-onnx/pull/187
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135378
Approved by: https://github.com/titaiwangms
2024-09-07 03:07:39 +00:00
c92227c41a [quant][pt2e] fix placeholder typo and related quantization tests (#135379)
A previous typo on "placeholder" and related tests in quantization are fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135379
Approved by: https://github.com/jerryzh168
2024-09-07 02:31:43 +00:00
e6a0221fc6 [Inductor] Optionally allow padding on non-GPU devices (#135280)
This is the OSS component of a larger MTIA diff.

Currently, Inductor disables padding for non-GPU devices. We need to change this behavior to enable padding on MTIA.

This PR adds a config option to enable padding on the CPU, or any other non-GPU device. In the future, we might want to enable padding on all devices by default. However, that might require supporting device-dependent padding defaults, since CPUs will likely use different settings than H100 GPUs.

Differential Revision: D61038114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135280
Approved by: https://github.com/jfix71, https://github.com/shunting314
2024-09-07 02:19:14 +00:00
a6b9d444fb [ONNX] Refactor exporter errors (#135180)
Refactor exporter errors to combine old errors and new errors for API consistency.

This PR also

1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited.
2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors.
3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`.
4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact.
5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct.

Fixes https://github.com/pytorch/pytorch/issues/135125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180
Approved by: https://github.com/titaiwangms
2024-09-07 00:50:15 +00:00
d42b0c8f22 Add release matrix for 2.5 (#135383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135383
Approved by: https://github.com/huydhn
2024-09-07 00:49:53 +00:00
941d094dd1 [Dynamo][DTensor] Fixes SymNodeVariable() is not a constant error in Compiled DDP + TP unit test (#135315)
Before the fix, the unit test will fail at forward Dynamo tracing:
```
  File "/data/users/willfeng/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 415, in test_ddp_tp
    loss = compiled_replicate_model(data).sum()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant

from user code:
   File "/data/users/willfeng/pytorch/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 34, in _unflatten_tensor
    result = DTensor.from_local(
```
After the fix, the compilation fails at a later step (Compiled Autograd tracing), due to needing "pre-dispatch tracing of backward graph" feature (see details at https://github.com/pytorch/pytorch/issues/127797#issuecomment-2291695474).

I believe this PR is a net improvement, because it should also fix the 1D Traceable FSDP2 failure case on internal models (https://github.com/pytorch/pytorch/issues/130978#issuecomment-2319476690), which is much harder to build a minimal unit test for.

Fixes https://github.com/pytorch/pytorch/issues/130978.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135315
Approved by: https://github.com/bdhirsh
2024-09-07 00:11:25 +00:00
b1a934741e Change test_constant_prop_preserve_metadata (#135268)
Summary: In new export_for_training, "stack_trace" does not exist in node meta anymore.

Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e -- -r test_constant_prop_preserve_metadata
```

Reviewed By: angelayi

Differential Revision: D62219974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135268
Approved by: https://github.com/angelayi
2024-09-07 00:02:35 +00:00
0c661f3e1a [Split Build] Refactor split build binary builds into their own workflows and move split build binary builds to periodic (#134624)
As we need to move split build binary tests from trunk to periodic this pr, refactors those jobs out into its own workflow to achieve this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134624
Approved by: https://github.com/malfet
2024-09-06 23:57:56 +00:00
2c7e314803 [Inductor][CPP] Fix the issue of view dtype (#135301)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135160, it's a regression introduced by https://github.com/pytorch/pytorch/pull/134569, where the dtype of `to_dtype_bitcast` was incorrectly handled when using the scalarize implementation.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_view_dtype
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135301
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-06 23:36:44 +00:00
ead4407f57 [inductor] Fix loop split optimization (#135303)
Fix https://github.com/pytorch/pytorch/issues/135274.

Improve the check whether the div expr matches: add a check whether `split_var` is in `original_body.iter_vars`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135303
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-09-06 23:06:25 +00:00
2f5b40c099 [aoti test] Disable FP8 funz dtypes in fp8 runtime check test (#135373)
Fixing https://github.com/pytorch/pytorch/issues/126734

Key is the funz FP8 types are for AMD only.

source: https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135373
Approved by: https://github.com/chenyang78
2024-09-06 23:05:47 +00:00
993b5647ab [export] fix placeholder name collision tests by removing map call (#135366)
The current test is failing because of the current unstable state of map. torch.compile and non-strict export are taking two seperate routes unlike cond and while_loop. This pr fix the test it self. We'll fix map in follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135366
Approved by: https://github.com/angelayi
2024-09-06 22:02:50 +00:00
2ab26806f1 Require tlparse for failing tests in test_structured_trace.py (#135376)
Summary: These tests are currently failing internally. Per discussion, skip if tlparse is unavailable

Test Plan:
```
feature remove tlparse
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py
feature install tlparse
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py
```

Differential Revision: D62310342

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135376
Approved by: https://github.com/ezyang
2024-09-06 21:53:41 +00:00
b1612569f6 [BE] Clarify defaulting behavior in optimizer (#135384)
Fixes #135340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135384
Approved by: https://github.com/drisspg, https://github.com/jainapurva
2024-09-06 21:52:55 +00:00
dc0e818738 [FR] Automatically infer a common filename prefix (#135158)
Save the annoyance of specifying this on the command line each time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135158
Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o
ghstack dependencies: #135157
2024-09-06 21:44:27 +00:00
06e414d7fe [FR] Make trace_dir a required argument (#135157)
Ensures users get a clean error if they forget to specify the dir, and
improves the help message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135157
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-09-06 21:44:27 +00:00
a681260caf Revert "[ONNX] Refactor exporter errors (#135180)"
This reverts commit 5eebd9315a72422d59b6f8d8ca8e4e573e231d5c.

Reverted https://github.com/pytorch/pytorch/pull/135180 on behalf of https://github.com/clee2000 due to I think this broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10743909338/job/29800779403) [HUD commit link](5eebd9315a), possibly a landrace with the PR that landed before it ([comment](https://github.com/pytorch/pytorch/pull/135180#issuecomment-2334844191))
2024-09-06 21:39:18 +00:00
95e976a63f [dynamo] recursively skip frames when Dynamo cache limit is hit (#135144)
Fixes https://github.com/pytorch/pytorch/pull/135144 and [T197117723](https://www.internalfb.com/intern/tasks/?t=197117723).

In general, adds `SkipCodeRecursiveException` to Dynamo - when raised in Dynamo, convert_frame will return a `skip_code_recursive_flag` back to C Dynamo, signaling it to skip the current frame and all recursive calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135144
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-06 21:38:53 +00:00
306ac44eaa [ez][TD] Fix request for issue body returns None (#135389)
I assumed it would be empty string if the body is empty, but its just None
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135389
Approved by: https://github.com/malfet
2024-09-06 21:02:01 +00:00
a7643baceb Revert expectFailureIf condition on tests with torch.compile on Windows (#134759)
Fixes #134716

This PR reverts some changes introduced in 6eae569546 (#133987)

torch.compile is not available on Windows, tests should be expected to fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134759
Approved by: https://github.com/malfet
2024-09-06 20:51:55 +00:00
a4030e37be [dynamo] reland map/zip iterator related changes (#135074)
Differential Revision: [D62211019](https://our.internmc.facebook.com/intern/diff/D62211019)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135074
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos
2024-09-06 20:38:02 +00:00
22e1fb6faa [test][easy] Add debug utils for cpu select algorithm test (#135038)
Summary: Add debug utils to debug a flaky test in fbcode ci.

Some context: https://github.com/pytorch/pytorch/pull/126545

Test Plan: ci

Differential Revision: D62005445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135038
Approved by: https://github.com/jgong5, https://github.com/XuehaiPan
2024-09-06 20:30:49 +00:00
2a4890e315 [ONNX] Clean up the missed lines from previous PRs (#135368)
Some missed deleted lines

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135368
Approved by: https://github.com/justinchuby
2024-09-06 20:27:52 +00:00
3ce433aef2 [TCPStore] use wait counters (#135283)
This replaces the existing TCPStore counters with the new shared wait counters. There's no users of the tcpstore counters so should be completely safe to remove.

Test plan:

Existing tests + build

There's no OSS backend for wait counters so can't write any tests with them currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135283
Approved by: https://github.com/c-p-i-o
2024-09-06 19:54:25 +00:00
7f2d20e687 Run all autograd node post hooks (#134728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134728
Approved by: https://github.com/albanD, https://github.com/soulitzer
2024-09-06 19:44:28 +00:00
32fd29c1ea [ONNX] Properly handle Attributes in traceable functions (#135367)
Previously the attributes were sent in as Attr objects even when we call the function as a plain Python function. Turning them into python objects.

From https://github.com/justinchuby/torch-onnx/pull/186
Related https://github.com/microsoft/onnxscript/issues/1846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135367
Approved by: https://github.com/justinchuby
2024-09-06 19:35:22 +00:00
5eebd9315a [ONNX] Refactor exporter errors (#135180)
Refactor exporter errors to combine old errors and new errors for API consistency.

This PR also

1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited.
2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors.
3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`.
4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact.
5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct.

Fixes https://github.com/pytorch/pytorch/issues/135125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180
Approved by: https://github.com/titaiwangms
2024-09-06 19:10:56 +00:00
a15aabc975 Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.

I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
2024-09-06 19:06:23 +00:00
b143426db3 [Inductor] Use argument names as the key for the constants dict and the signature dict (#135170)
Referencing how triton constructs these dictionaries

ca3fb5f6fa/python/triton/runtime/jit.py (L639)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135170
Approved by: https://github.com/htyu
2024-09-06 19:05:00 +00:00
13ba0a2e5c Run bypassed graph compile outside the except block to avoid chaining of exceptions (#135175)
Fixes #135172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135175
Approved by: https://github.com/masnesral, https://github.com/ezyang
2024-09-06 19:03:57 +00:00
8520ce5f78 Fix incorrect trace of post-accumulate grad hook on tensor with zero dims (#135226)
Fix incorrect trace of post-accumulate grad hook on tensor with zero dimensions

Fixes #135207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135226
Approved by: https://github.com/xmfan
2024-09-06 18:19:54 +00:00
196748d491 [elastic] support local_addr across all rendezvous impls (#135262)
Summary:
There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used.

This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests.

Test Plan:
Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions.

```
buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3
```

To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism.

Differential Revision: D62256407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262
Approved by: https://github.com/fduwjj, https://github.com/wz337
2024-09-06 17:55:43 +00:00
177e4f4218 remove _check call on item() for torch.istft (#135234)
Fixes #135014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135234
Approved by: https://github.com/tugsbayasgalan
2024-09-06 17:31:25 +00:00
3988b3468b [aoti][easy] remove breakpoint() in wrapper.py (#134807)
Differential Revision: D61687146

Remove an unintended breakpoint in code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134807
Approved by: https://github.com/YUNQIUGUO
2024-09-06 17:25:05 +00:00
04118d8617 [export] Record the global torch version in serialization. (#135243)
Summary: In general I think it will be useful to also record the global torch version in the EP, so that we can track them in the logging in addition to the schema version.

Test Plan: CI

Reviewed By: henryoier

Differential Revision: D62252626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135243
Approved by: https://github.com/yushangdi
2024-09-06 17:02:06 +00:00
24482e5c68 [torch][fx] Set maximum warning count during fx.Graph.lint (#135069)
Summary:
resnet152 spent about 15 minutes writing warning messages in _unlift
during `to_executorch` because they're all written to unbuffered stderr
by the `warnings` module.

These warnings are almost always about get_attr nodes referencing a
non-existent name:
```lang=py
warnings.warn(f'Node {node} target {node.target} {atom} of {seen_qualname} does '
  'not reference an nn.Module, nn.Parameter, or buffer, which is '
  'what \'get_attr\' Nodes typically target'
)
```
I'm not aware of a way to configure the warnings module to write this out
at most once, so I'm just going to disable the lint for now.

Test Plan:
Re-ran resnet152 with Executorch and the XNNPackBackend, it is much faster now

Differential Revision: D62156090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135069
Approved by: https://github.com/yushangdi
2024-09-06 16:41:59 +00:00
c0ec599f27 Update submodule ideep to include aarch64 change (#134897)
This PR is per ARM request, which is in https://github.com/intel/ideep/issues/334.

Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134897
Approved by: https://github.com/jgong5, https://github.com/atalman, https://github.com/snadampal
2024-09-06 16:40:26 +00:00
7074de43c0 Porting to GCC 15 (#135188)
uint8_t is found on cstdint header

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135188
Approved by: https://github.com/Skylion007
2024-09-06 16:16:53 +00:00
771dcce11d [AOTI][Tooling][6/n] Fix long dtype input tensors calling mean() in aoti_torch_print_tensor_handle (#135072)
Differential Revision: D61635232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135072
Approved by: https://github.com/hl475, https://github.com/ColinPeppler
2024-09-06 15:59:32 +00:00
de74aafff4 error on exporting ScriptModule (#135302)
Test Plan: added test

Differential Revision: D62279179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135302
Approved by: https://github.com/yushangdi
2024-09-06 15:12:40 +00:00
ad29a2c0dc Add Inductor config for default stride behavior (#135238)
By default, Inductor is allowed to manipulate the layout
(strides+storage offset) of input tensors to custom operators.

We want to change it so that the default is that Inductor should respect
the stride order of input tensors to custom operators.

This PR adds a config to toggle the behavior, in the next PR up we'll
change the default. We also make the following changes:
- We add a new operator Tag (flexible_layout), which means that
inductor is allowed to manipulate the layout. When we flip the default,
users can specify they want the old behavior by using this tag.

This is a reland of https://github.com/pytorch/pytorch/pull/126986,
which was previously reverted due to silent incorrectness. We've since
fixed the silent incorrectness
(https://github.com/pytorch/pytorch/pull/133639)

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135238
Approved by: https://github.com/albanD
2024-09-06 14:48:24 +00:00
3a9e33dca8 [torchelastic] Don't do signal handling when off the main thread (#135088)
Summary:
In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error:
> "ValueError('signal only works in main thread of the main interpreter')"

To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling.

Test Plan:
Before this change, MAST job failed:
https://fburl.com/mlhub/iq2m10v8

With this change, MAST job succeeded:
https://fburl.com/mlhub/q6kb8343

Differential Revision: D62166943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088
Approved by: https://github.com/d4l3k
2024-09-06 14:47:03 +00:00
a086882d72 [inductor][triton] mark workspace args as mutated (#134648)
SplitScan makes use of a workspace arg that needs to be zeroed before it is used - then, it is used to communicate between thread blocks during the triton kernel implementation. It is mutated during during the execution of the kernel, so it should be marked as such.

Before this PR, it is not marked as mutated; AFAIK this is fine during normal execution, but during autotuning it causes problems. The workspace starts off zeroed (as expected), but during autotuning the kernel will be executed multiple times and the workspace does not get re-set between executions, resulting in incorrect data. If the data is used for indexing, then you can fail device-side asserts (and the results after the initial run (with autotuning) could be wrong). The test added in this PR repros the issue when the fix is removed.

When we mark the arg as mutated, then the arg gets cloned before autotuning, so that the arg passed to the kernel during autotuning will always be zeroed as expected.
804852c1f9/torch/_inductor/runtime/triton_heuristics.py (L685-L689)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134648
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-09-06 14:23:37 +00:00
84ae6b7d6b AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193)
This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960:

Part of FSDP2's tracing strategy right now is that:

(1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated

(2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views)

(3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup.

It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960.

However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through).

So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when:

(1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state)

(2) **and** our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193
Approved by: https://github.com/yf225

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-09-06 14:06:48 +00:00
60a097a071 [CD] Update binary_linux_test.sh to include calling builder smoke test (#133869)
Run smoke test

Fixes #1969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133869
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2024-09-06 13:27:24 +00:00
13bae39e22 [inductor] [cpp] improve cache blocking for is_dynamic_M (#131306)
## Performance
Models with >= 3% performance speedup are listed below:

### AMP single-thread dynamic shape (measured on CPU with AMX support)
No regressions

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | soft_actor_critic| 3%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131306
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
ghstack dependencies: #135275

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
2024-09-06 13:21:24 +00:00
4ef6c05f65 [inductor][cpp][gemm] fix autotune runtime error from linear_binary fusion (#135275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135275
Approved by: https://github.com/leslie-fang-intel
2024-09-06 13:21:23 +00:00
d6b9bd3e60 Also handle compiler collective when input variable doesn't exist on all ranks (#135147)
Internal xref:
https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135147
Approved by: https://github.com/jansel
2024-09-06 13:18:36 +00:00
d0591f4658 Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053)
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/

This now also incorporates a test from https://github.com/pytorch/pytorch/pull/133585 (which it fixes) and the prep PR https://github.com/pytorch/pytorch/pull/134407 Including the PR desc from that:

I am trying to fix a problem reported by user in [fb.workplace.com/groups/6829516587176185/permalink/7705964779531357](https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/) The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis).

In https://github.com/pytorch/pytorch/pull/133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way.

I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053
Approved by: https://github.com/ydwu4
2024-09-06 13:13:15 +00:00
b5dea061c8 check compilation status before query cudnn version in conv (#135332)
This PR is created for fixing the https://github.com/pytorch/pytorch/issues/135322.  The cudnn compilation status should be check firstly before querying version, otherwise, conv may trigger runtimeerror before any check in other non-cuda backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135332
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-09-06 12:50:04 +00:00
041960a1ce [Dynamo] Automatically in-graph traceable tensor subclass ctors (#135151)
Fixes https://github.com/pytorch/pytorch/issues/114389

Previously, dynamo would attempt to trace through the `__init__` of traceable tensor subclasses, since their constructors are AOT dispatcher traceable by definition, dynamo should automatically put these in the graph like we do for any other tensors. Not doing this is difficult because dynamo would need to apply mutations post tensor subclass creation in the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135151
Approved by: https://github.com/bdhirsh
2024-09-06 12:23:38 +00:00
67c7924ea1 [inductor] Fix gen_transposed_tile_load_store (#135307)
Recent PR: https://github.com/pytorch/pytorch/pull/131745 bring new VLA logical in cpp codegen. And it will raise build fail error on MSVC and error code is `Compiler Error C2131`: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2131?view=msvc-170

reproduce UT:
```cmd
pytest test\inductor\test_torchinductor_dynamic_shapes.py -v -k test_large_block_sizes_dynamic_shapes_cpu
```

Original generated code:
```c++
alignas(16) float tmp1[static_cast<int64_t>(((-256LL)*(c10::div_floor_integer(static_cast<int64_t>(ks1), static_cast<int64_t>(16LL)))) + (16LL*ks1))];
```

Changes:
allocate a large-enough fixed-sized buffer.

New genarated code:
```c++
alignas(16) float tmp1[16*16];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135307
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-06 10:44:08 +00:00
217ba7b2ab [Docs] Update FileCheck doc (#135199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135199
Approved by: https://github.com/soulitzer
2024-09-06 08:18:38 +00:00
758d515d98 [Inductor][CPP] Select tiling factor for lower precision data types (#133830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133830
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-06 08:12:37 +00:00
60d98b4cfb Update torch-xpu-ops pin (ATen XPU implementation) (#135300)
Release cycle for PyTorch 2.5
1. Bugfixing: correct reduction logic in cdist kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135300
Approved by: https://github.com/EikanWang
2024-09-06 07:30:09 +00:00
590a3e9f8a [export][training ir migration] quantized_decomposed.quantize_per_tensor decomposition (#134525)
Summary:
In graph of  TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir.

This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor
now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv
```

Differential Revision: D61364547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525
Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168
2024-09-06 07:06:06 +00:00
764ee6e3f9 [FlexAttention] Specify padding_value for boundary checked loads (#134573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134573
Approved by: https://github.com/Chillee
2024-09-06 06:47:26 +00:00
67f98a99a4 [DeviceMesh][Easy] Make RuntimeError a bit more descriptive by including the actual world_size (#135271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135271
Approved by: https://github.com/fduwjj
2024-09-06 06:23:20 +00:00
e020a8755a [Fix][FR][ez] Remove debugging logs (#135308)
Removing the print added during debugging process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135308
Approved by: https://github.com/wz337
2024-09-06 06:14:33 +00:00
7ffb3b201c [inductor] Remove LoopBody.reads,writes,other (#135256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135256
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082, #135084, #135079, #135235
2024-09-06 06:11:55 +00:00
f946bf88c4 [inductor] Skip retracing an existing LoopBody (#135235)
This is roughly a 7% speedup in inductor compile time for hf_Bert_large.  The time spent in `LoopBody.__init__` improves from 15% to 8% of `fx_codegen_and_compile`.

Before
![image](https://github.com/user-attachments/assets/7de0f28e-35bd-472f-b4be-b52733d2a85c)

After
![image](https://github.com/user-attachments/assets/5f0cf11a-43c5-43ae-b13c-f32383a75a7f)

Overall
![image](https://github.com/user-attachments/assets/6a369d8c-fb5e-4ad2-9504-0fc745ad6568)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135235
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082, #135084, #135079
2024-09-06 06:11:55 +00:00
66da3b3b2a [fx] Bypass custom __setattr__ in Node.__init__ (#135079)
Before:
![image](https://github.com/user-attachments/assets/5f0a6ae6-6049-44d0-b5f2-a549a23ad97f)

After:
![image](https://github.com/user-attachments/assets/51c9f91b-f8a0-4043-8362-65813feec823)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135079
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082, #135084
2024-09-06 06:11:46 +00:00
41e653456e [RDP] Fix "No module named 'libfb’" (#135244)
Summary:
D62215095 Introduced an import error to arvr pipelines as the is_fbcode() function does not work as intended.

This changes is_fbcode() to be a much stricter check.

Test Plan:
```
buck2 run arvr/mode/platform010/opt-stripped //arvr/libraries/depthlink/clients/mr_replay:pipeline_runner -c bolt.use_eva3_sim=True -- --config_file arvr/libraries/depthlink/clients/mr_replay/configs/runner_config.yaml --features DEPTH
```

Differential Revision: D62237502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135244
Approved by: https://github.com/aorenste
2024-09-06 04:52:31 +00:00
e40a0a9359 Add randomness checking for sdpa vmap (#135176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135176
Approved by: https://github.com/zou3519
2024-09-06 04:50:49 +00:00
c05a7adb36 [inductor][debug] fix draw_buffers (#135266)
**Before:**
![image](https://github.com/user-attachments/assets/aac756f3-1349-4647-9da3-87cf105cf647)

**After:**
<img width="791" alt="image" src="https://github.com/user-attachments/assets/d72c663c-e598-42fa-ac40-9e58956f1ec1">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135266
Approved by: https://github.com/yf225
2024-09-06 04:12:41 +00:00
5f57be7571 [Distributed] Change function call in test to non-deprecated to eliminate warning (#134938)
Migrate function call in test to eliminate warning message in below and reduce the chance of test fail when methods removed

-  from deprecated `save_state_dict` change to `save`
-  from deprecated `load_state_dict` change to `load`

Warning message:
```bash
pytorch/test/distributed/checkpoint/test_fsdp_model_state.py:37: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134938
Approved by: https://github.com/wz337, https://github.com/fegin
2024-09-06 03:25:09 +00:00
29d72c1100 [inductor] check intel compiler minimal version (#135209)
On Windows: early version icx has `-print-file-name` issue, and can't preload correctly for inductor. Add minimal version check for Intel compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135209
Approved by: https://github.com/ezyang
2024-09-06 03:21:07 +00:00
3b1a334c0f [Inductor][CPP] Avoid mistake wgt tensor delete (#135100)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/134998: Previously, we only checked if the `get_attr` FX node for the weight had a single user node. However, two `get_attr` nodes may share the same tensor and should not be deleted in such cases. In this PR, we add the count of users for tensor along with the num of users for nodes to decide whether this tensor can be deleted or not.

**TestPlan**
```
 python test/inductor/test_cpu_select_algorithm.py -k test_linear_wgt_multi_users
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135100
Approved by: https://github.com/jgong5
2024-09-06 03:13:36 +00:00
07689a38bf [Inductor] Fix AOT weight alignment issue on CPU (#135205)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135205
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-09-06 03:06:51 +00:00
06a7dc21c1 Remove dead expect_rational (#135105)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135105
Approved by: https://github.com/malfet
2024-09-06 02:57:27 +00:00
d9a18173fa Report qualname of exception type rather than <class 'RuntimeError'> (#135146)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135146
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/yanboliang
ghstack dependencies: #135148, #135145
2024-09-06 02:56:50 +00:00
d8543e3162 Include exception type qualname when rewrapping InternalTorchDynamoError (#135145)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135145
Approved by: https://github.com/drisspg, https://github.com/anijain2305
ghstack dependencies: #135148
2024-09-06 02:56:50 +00:00
ad01fc194d Consolidate raise and rewrap raise error branches (#135148)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135148
Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/yanboliang, https://github.com/malfet
2024-09-06 02:56:46 +00:00
e162414963 add instrumentation of CCA stats for reserved and allocated memory size (#135231)
As titled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135231
Approved by: https://github.com/c-p-i-o
2024-09-06 02:48:56 +00:00
9e5a797771 Improve test_public_bindings import module error reporting (#135258)
Error was hard to understand without message. Render it now. See https://github.com/pytorch/pytorch/pull/135259 for it in action.

Example failure:

```
2024-09-05T20:04:45.3022000Z FAILED [5.9524s] test_public_bindings.py::TestPublicBindings::test_modules_can_be_imported - AssertionError: String comparison failed: '' != "torch._logging.scribe failed to import w[112 chars].py)"
2024-09-05T20:04:45.3025413Z + torch._logging.scribe failed to import with error ImportError: cannot import name 'TypeAlias' from 'typing' (/opt/conda/envs/py_3.9/lib/python3.9/typing.py)
2024-09-05T20:04:45.3026990Z
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135258
Approved by: https://github.com/albanD
2024-09-06 02:40:03 +00:00
b46a1b9e2d Use Python 3.9 on all libtorch jobs (#135245)
Part of the migration py3.8->3.9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135245
Approved by: https://github.com/izaitsevfb
2024-09-06 02:27:22 +00:00
9688014820 aarch64: extend matmul heuristic checks to all neoverse platforms (#134548)
for aarch64 neoverse platforms there are two gemm backends available
for matmul operator on PyTorch: (1) Arm Compute Library and (2) OpenBLAS.
While Arm Compute Library provides better performance over OpenBLAS,
it has overhead for the kernel launch time, and hence we use OpenBLAS
for smaller tensor compute. The heuristic was originally implemented for
neoverse_v1. This commit extends the heuristic to other neoverse platforms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134548
Approved by: https://github.com/malfet
2024-09-06 01:40:50 +00:00
8f6e73f068 [ONNX] Enable experimental exporter logic to dynamo_export and support refine dynamic_shapes (#134976)
(1) Enable experimental exporter logic to dynamo_export
(2) Refine dynamic shapes and retry export in export strategies
(3) Delete `torch_export_graph_extractor` and use the new export logic
(4) Disable ExportedProgram test in `test_fx_onnx_with_onnxruntime.py`, as ONNXProgram is different now.

Fixes https://github.com/pytorch/pytorch/issues/126479
Fixes #135183
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134976
Approved by: https://github.com/justinchuby
2024-09-06 01:29:56 +00:00
1e57ef08fa [AOTI] Support MKLDNN qconv ops in cpp wrapper (#134795)
Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qconv in the ABI-compatible mode for cpp-wrapper Inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134795
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi
ghstack dependencies: #134475, #134783
2024-09-06 01:01:53 +00:00
614b86d602 [AOTI] Support MKLDNN qlinear ops in cpp wrapper (#134783)
Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qlinear in the ABI-compatible mode for cpp-wrapper Inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134783
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi
ghstack dependencies: #134475
2024-09-06 01:01:53 +00:00
0b96dfb736 [AOTI] Support MKLDNN conv ops in cpp wrapper (#134475)
Summary: Partially fix https://github.com/pytorch/pytorch/issues/123040. In the ABI-compatible mode, MKLDNN fallback ops do not have C shim implementations and thus need to go through the custom ops launch path. Other MLKDNN ops will be fixed in following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134475
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi
2024-09-06 01:01:53 +00:00
62b221d5cc Add Percentages to Function Events (#135155)
Summary: Users have recently asked that the profiler contains self/total CPU and device percentages to FunctionEvents so that teams can process the data procedurely. Some of it could be done mathematically via subroutines but since we already have the information in the _build_table, lets build it there.

Test Plan: Check that we have the same table as before but also check that the parameters we check also have the expected values

Differential Revision: D62210351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135155
Approved by: https://github.com/shanw-meta, https://github.com/kit1980
2024-09-06 00:39:11 +00:00
66dd4577b1 Track base of FunctionalTensor in inference mode. (#135141)
The idea behind the tracking is the following, whenever we see a tensor if the tensors is a root tensors (does not have any view metas ) when we consider is as the base of the all the tensors that shares its storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135141
Approved by: https://github.com/zou3519
2024-09-06 00:10:25 +00:00
cyy
cc28634172 [Submodule] Bump pybind11 to v2.13.5 (#135202)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135202
Approved by: https://github.com/Skylion007
2024-09-06 00:09:00 +00:00
c83cdf068b [DTensor] Fix view op replicating on tensor dim when the size of the tensor dim = 1 (#135054)
We found a corner case that when a tensor dimension is 1, calling `view(1)` would result in an unexpected replication (see case 1 below). When the tensor dimension to shard is not 1, no matter whether the tensor dimension is evenly-shardable across the mesh dimension, it won't cause an implicit replication behind the scenes if view doesn't change the size of the given tensor dimension (see case 2 and 3).

When the tensor dimension to shard is of size 1, it is not being added to shardable_dims here:
https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/ops/_view_ops.py#L518

```
# uneven case where the size of the tensor dimension to shard is 1
p = torch.randn(1,2)
mesh = init_device_mesh(“cuda”, (2,))
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(1, 2)
# this would result in replication, meaning t is now replicated across all ranks.

# uneven case where the size of the tensor dimension to shard is not 1
p = torch.randn(3, 2)
mesh = init_device_mesh(“cuda”, (2,))
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(3, 2) # this would not result in replication.
# this would not result in replication, meaning t stays as sharded.

# even case
p = torch.randn(2,2)
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(2, 2)
# this would not result in replication, meaning t stays as sharded.
```

Differential Revision: [D62155606](https://our.internmc.facebook.com/intern/diff/D62155606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135054
Approved by: https://github.com/tianyu-l, https://github.com/wanchaol
2024-09-06 00:03:54 +00:00
28ccfba248 [ONNX] Delete ONNXProgramSerializer (#135261)
Fixes #135182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135261
Approved by: https://github.com/justinchuby
2024-09-05 23:52:51 +00:00
b2386bdca1 [debug] Add helper to run cProfile on a function (#135084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135084
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082
2024-09-05 23:41:30 +00:00
bdfc8d9f96 [fx] Don't use generators in map_aggregate (#135082)
While the generators avoid a copy, they are slow.

Before:
![image](https://github.com/user-attachments/assets/70a55a9a-0595-4105-b0ab-22cf77c7409c)

After:
![image](https://github.com/user-attachments/assets/cecb9c59-ae36-47de-8b08-cab2c7cb3d57)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135082
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076
2024-09-05 23:41:30 +00:00
70779dded8 [fx] Compile time optimization in Node.__update_args_kwargs (#135076)
Before this we took two passes over all of the args.

Before:
![image](https://github.com/user-attachments/assets/24ce5628-03f4-4983-9f2d-5ddf0ca5816e)

After:
![image](https://github.com/user-attachments/assets/c9681aa2-32f0-4f6b-a598-fc6f90ffafb5)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135076
Approved by: https://github.com/Chillee
ghstack dependencies: #135070
2024-09-05 23:41:30 +00:00
ea231300d1 [inductor] Improve compile time regression from MemoryDep.normalize (#135070)
Possible fix for #135056

Before
![image](https://github.com/user-attachments/assets/3962cb85-e808-4fd4-991f-471ff5ef7eae)

After
![image](https://github.com/user-attachments/assets/2322d48d-6518-4518-baca-336027b5cda8)

Measured based on:
```
python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --training --only hf_Bert_large --stats -n1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135070
Approved by: https://github.com/Chillee
2024-09-05 23:41:30 +00:00
8f66995459 Revert "Support rolling over a percentage of workflows (#134816)"
This reverts commit fc890b55b51098437b6149abf1026a8b2aaee389.

Reverted https://github.com/pytorch/pytorch/pull/134816 on behalf of https://github.com/malfet due to Causes lint to intermittently fail ([comment](https://github.com/pytorch/pytorch/pull/134816#issuecomment-2332902609))
2024-09-05 23:39:41 +00:00
144fde4fd2 [MPS] Add support for autocast in MPS (#99272)
Fixes https://github.com/pytorch/pytorch/issues/88415

Need to run inductor/test_cpu_select_algorithm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272
Approved by: https://github.com/malfet

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Roy Hvaara <roy@lightyear.no>
2024-09-05 23:23:17 +00:00
43f4947d44 fix fake tensor tolist implementation (#135131)
Summary:
When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies.

Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints.

Test Plan:
Some expected failures are gone now.
Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes.

Differential Revision: D62197742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131
Approved by: https://github.com/ezyang
2024-09-05 23:20:31 +00:00
65e1c34061 [rfc] scuba for flight recorder (#134794)
Summary: Record flight recorder status in a scuba table.

Test Plan: Testing with timing out a job. Will post results soon.

Differential Revision: D61729221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134794
Approved by: https://github.com/fduwjj
2024-09-05 23:18:10 +00:00
830247c355 [Intel Triton] Update Intel Triton to release/2.5.0 (#134074)
This PR relands https://github.com/pytorch/pytorch/pull/134053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134074
Approved by: https://github.com/EikanWang
2024-09-05 22:46:31 +00:00
4262755b5a [cond] fix typo in cond codegen (#134708)
As titled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134708
Approved by: https://github.com/jansel
2024-09-05 22:38:24 +00:00
3825607144 Add torch._logging.scribe (#135224)
See https://github.com/pytorch/pytorch/pull/135138 for a usage example. Meta only, see https://docs.google.com/document/d/1JpbAQvRhTmuxjnKKjT7qq57dsnV84nxSLpWJo1abJuE/edit#heading=h.9wi46k7np6xw for context

fbscribelogger is a library that allows us to write to scribe, which is Meta's logging infrastructure, when you have appropriate access token (this token is available for jobs running on main, as well as authorized jobs with the ci-scribe label). The resulting data is accessible via Scuba (a real time in-memory database) and Hive (a more traditional SQL persisted database).

Here's the motivating use case. Suppose there is somewhere in PyTorch's codebase where you'd like to log an event, and then you'd like to find all the situations where this log is called. If PyTorch is rolled out to our internal users, we have some FB-oriented APIs (like torch._utils_internal.signpost_event) with which you can do this. But you have to actually land your PR to main, wait for it to be ingested to fbcode, and then wait for us to actually roll out this version, before you get any data. But what if you want the results within the next few hours? Instead, you can use torch._logging.scribe to directly write to our logging infrastructure *from inside CI jobs.* The most convenient approach is to log unstructured JSON blobs to `open_source_signpost` (added in this PR; you can also add your own dedicated table as described in the GDoc above). After adding logging code to your code, you can push your PR to CI, add 'ci-scribe' label, and in a few hours view the results in Scuba, e.g., (Meta-only) https://fburl.com/scuba/torch_open_source_signpost/z2mq8o4l If you want continuous logging on all commits on master, you can land your PR and it will be continuously get logging for all CI runs that happen on main.

Eventually, if your dataset is important enough, you can consider collaborating with PyTorch Dev Infra to get the data collected in our public AWS cloud so that OSS users can view it without access to Meta's internal users. But this facility is really good for prototyping / one-off experiments. It's entirely self serve: just add your logging, run your PR CI with ci-scribe, get results, do analysis in Scuba.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135224
Approved by: https://github.com/Skylion007
2024-09-05 22:37:13 +00:00
eqy
3c8f71ff93 [cuDNN][64-bit indexing] cuDNN v9.3+ supports non-batch-splittable convolutions with > 2**31 elements (#134890)
For longstanding issues such as #95024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134890
Approved by: https://github.com/Skylion007
2024-09-05 22:22:45 +00:00
fc890b55b5 Support rolling over a percentage of workflows (#134816)
In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py.

Details of the new format are in the comments up top.

On the plus side, this now includes some unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816
Approved by: https://github.com/PaliC, https://github.com/zxiiro
2024-09-05 22:21:45 +00:00
058a69d91a [fbcode][dynamo] Turn on guard_nn_modules using justknobs_check (#134928)
As Title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134928
Approved by: https://github.com/ezyang
2024-09-05 22:05:54 +00:00
6c5920d515 Tune int8 AMX WoQ micro-kernel for CPU (#134832)
This patch prevents performance regression against the default ATen implementation for LLaMA 3.1 int8 GPTQ WoQ workload.

Uses AMX micro-kernel only if `M` >= `block_m`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134832
Approved by: https://github.com/jgong5
2024-09-05 22:01:14 +00:00
116fd474da [export] Expand coverage to more copied sym ops for unflattener. (#135119)
Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//torchrec/ir/tests:test_serializer -- --run-disabled

```
File changed: fbcode//caffe2/torch/export/unflatten.py
Buck UI: https://www.internalfb.com/buck2/2e0377e7-e2b6-4bd0-8133-a787245165a0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549824883887
Network: Up: 0B  Down: 0B
Jobs completed: 16. Time elapsed: 10.2s.
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D62190172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135119
Approved by: https://github.com/yushangdi
2024-09-05 21:58:20 +00:00
a5d70cf545 [PyTorch] Add isfinite to BFloat16-math.h (#135052)
Missing function from <cmath>.

Differential Revision: [D62148884](https://our.internmc.facebook.com/intern/diff/D62148884/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135052
Approved by: https://github.com/PaliC, https://github.com/albanD
ghstack dependencies: #135031
2024-09-05 21:50:36 +00:00
7fe819d917 [PyTorch] Fix -Wshadow -Werror build in BFloat16-inl.h (#135031)
`float_t` is required to exists in C99 math.h, which causes -Wshadow to fire. We don't need the alias, fortunately.

Differential Revision: [D62135908](https://our.internmc.facebook.com/intern/diff/D62135908/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135031
Approved by: https://github.com/albanD
2024-09-05 21:48:21 +00:00
f63571060c Revert "Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264)"
This reverts commit 9c0b03020b7204ca5d5dbe18174bab005f79c47b.

Reverted https://github.com/pytorch/pytorch/pull/135264 on behalf of https://github.com/atalman due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/135264#issuecomment-2332674607))
2024-09-05 21:43:05 +00:00
38fead8f7c [hop] preserve metadata in re-tracing hop subgraph by running with interpreter (#135159)
In this way, the interpreter.run can preserve the current metadata of subgraphs correctly when tracing the subgraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135159
Approved by: https://github.com/tugsbayasgalan
2024-09-05 21:36:56 +00:00
24a223c49d Run inductor micro benchmark on x86 metal runner (#135042)
This enables inductor micro benchmark on CPU (x86):

* Running on AWS metal runner for more accurate benchmark
* I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU.  We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64)

The next step would be to run this one cpu arm64, and cuda (a10g).

### Testing
Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180

```
name,metric,target,actual,dtype,device,arch,is_model
mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False
Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True
gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False
gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False
Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True
layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135042
Approved by: https://github.com/yanboliang
2024-09-05 21:31:36 +00:00
e4920a1364 [Traceable FSDP2][Dynamo] allow tracing through auto_functionalized HOP (#135169)
If an `auto_functionalized` HOP is included in backward graph due to activation checkpointing, we will run into a scenario where Compiled Autograd Dynamo tracing will need to trace through the `auto_functionalized` HOP. This PR adds support for it.

Test commands:
- `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_auto_functionalized`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135169
Approved by: https://github.com/zou3519
2024-09-05 21:22:45 +00:00
bc5ecf83d7 [training ir migration] Fix quantization tests (#135184)
Summary:
Fixed some quantization tests for new training ir:

Fix batch norm node pattern matcher. In training ir, we have `aten.batch_norm` node instead of `aten._native_batch_norm_legit` and `aten._native_batch_norm_legit_no_training`.

Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e
```

Reviewed By: tugsbayasgalan

Differential Revision: D62209819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135184
Approved by: https://github.com/tugsbayasgalan
2024-09-05 21:19:28 +00:00
e55c0f59e5 Revert "[Reland] Refactor caching device allocator utils (#130923)"
This reverts commit 9809080b9ed657a8c0ea0383be7cbdce3a26e05e.

Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/kit1980 due to breaking internal builds - Error: Relocation overflow has occured ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2332640961))
2024-09-05 21:16:14 +00:00
a4cf9653ee Revert "Remove Caffe2 code from tool scripts (#134941)"
This reverts commit c818ecd1698a28d9fadf4a81453a89914b18374a.

Reverted https://github.com/pytorch/pytorch/pull/134941 on behalf of https://github.com/kit1980 due to breaking internal builds - The path `caffe2/operators/hip/gather_op.cuh` does not exist ([comment](https://github.com/pytorch/pytorch/pull/134941#issuecomment-2332636624))
2024-09-05 21:12:54 +00:00
9c0b03020b Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264)
To be consistent with https://github.com/pytorch/pytorch/pull/135263 and rest of workflows. Use v4.4.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135264
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-09-05 21:05:06 +00:00
034717a029 [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-09-05 20:36:45 +00:00
9c38b00999 [export] Add ability to run eagerly on UnflattenedModule (#133996)
Summary:
Added the contextmanager, `_disable_interpreter`, which is meant to put around a call to `unflatten`. This will generate an UnflattendModule and sub-InterpreterModules which will not use torch.fx.Interpreter to run eagerly. We want to have this as a state of the module instead of a contextmanager around running the module because it's not clear where we are calling the unflattened module.

This seems to improve the performance: https://fb.workplace.com/groups/1075192433118967/posts/1473590629945810/?comment_id=1473621763276030

Test Plan: CI

Differential Revision: D60939034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133996
Approved by: https://github.com/pianpwk
2024-09-05 20:28:42 +00:00
8efe547046 Use actions/upload-artifact@v4.4.0 for triton builds (#135263)
Same as: https://github.com/pytorch/pytorch/pull/135139
Fixes upload failure: https://github.com/pytorch/pytorch/actions/runs/10722567217/job/29748125015
fix regression introduced by https://github.com/pytorch/pytorch/pull/135068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135263
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-09-05 20:03:39 +00:00
82d00acfee Allow cross-device copies for cpu scalars in refs (#135140)
This copies our eager-mode behavior where someone can do torch.add(a, b, out=c)
where a and b are CPU scalar tensors and c is a CUDA tensor.

Fixes https://github.com/pytorch/pytorch/issues/121619 by side effect (we get into a situation where we're writing a CPU scalar into a FakeTensor that is actually a meta tensor)

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135140
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
2024-09-05 19:08:48 +00:00
098431a29d Update Resize.cpp with new device type (#135117)
Update Resize.cpp with new device type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135117
Approved by: https://github.com/egienvalue
2024-09-05 18:53:13 +00:00
be660ea2d3 [PT2] Directly set meta.val in group_batch_fusion_aten (#135078)
Summary: instead of using FakeTensorProp after the pass

Differential Revision: D62162640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135078
Approved by: https://github.com/frank-wei
2024-09-05 18:17:06 +00:00
52c7c89ea4 [Inductor][CPP] Leverage full bits for BF16/FP16 vectorization (#126502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126502
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-05 17:17:46 +00:00
1efd341d15 [fake_tensor] Move unrecognized_type NotImplemented before ConstProp (#135033)
We should not try to do ConstProp on the unrecognized types (e.g. Subclasses).
In case of those types throwing NotImplemented will jump to the next torch_dispatch.

Test:
```
 python test/functorch/test_aotdispatch.py -k test_aot_test_subclasses_with_tensor_factories
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135033
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-09-05 17:09:41 +00:00
a096f2899d Add torch.serialization.skip_data context manager (#134504)
## Semantic

The semantic is
(1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint).

```python
import torch
import torch.nn as nn

sd = nn.Linear(3, 5).state_dict()
with torch.serialization.skip_data():
    torch.save(sd, 'foo.pt')
print(torch.load('foo.pt', weights_only=True))
```

(2)  With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor)

```python
import torch
import torch.nn as nn
from torch._subclasses.fake_tensor import FakeTensorMode

with FakeTensorMode():
    m = nn.Linear(3, 5, dtype=torch.float16, device='cuda')

sd = m.state_dict()
with torch.serialization.skip_data(materialize_fake_tensors=True):
    torch.save(sd, 'bla.pt')
print(torch.load('bla.pt', weights_only=True))
# OrderedDict([('weight', tensor([[0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))])

```

## Follow Ups

- [ ] `torch.load` semantic for skip_data context manager
- [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass)

Differential Revision: [D62238610](https://our.internmc.facebook.com/intern/diff/D62238610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504
Approved by: https://github.com/albanD
2024-09-05 16:53:39 +00:00
dbeb8a1691 Render log filepaths that are not anchored in torch's directory in a reasonable way (#135165)
For example, if I do TORCH_LOGS=fbscribelogger I'll get:

```
I0904 17:59:07.567000 3672513 fbscribelogger/__init__.py:161] stop
```

instead of

```
I0904 12:46:15.332000 2930287 ../../../../../home/ezyang/local/a/pytorch-env/lib/python3.10/site-packages/fbscribelogger/__init__.py:161] stop
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135165
Approved by: https://github.com/Skylion007
2024-09-05 16:48:09 +00:00
b1f72e2984 Gradient scaler for DTensor (#132816)
Solve the request [here](https://github.com/pytorch/pytorch/issues/120003#issuecomment-2248805798).
Enable DTensor input in gradient scaler's APIs, especially on `.unscale_()`
Related dispatch strategy is added to accept DTensor input.

To enable found_inf to conduct reduce action across devices, we add allreduce at dispatch with args after dispatch strategy and kernel.
Since `aten._amp_foreach_non_finite_check_and_unscale_.default` is an inplace_op, grad_scale as the arg[0] with be inplaced, so that redesign a strategy or refactoring the kernel would not help

Test files are testing 2 parts under 1-d(dp) and 2-d(dp,tp) cases:
1. whether the non-inf values unscaled
2. whether all DTensors at each device could found inf even not at their device.
3. If inf not found, will new parameters generates
4. if inf found, will scale be updated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132816
Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wanchaol
2024-09-05 16:44:32 +00:00
bb3c2408f4 [inductor][test] in test_unbacked_symints, replace inductor's skipCUDAIf with common device type's skipcudaif (#133936)
Differential Revision: D61506212

Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`.

`instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`.

Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda.

FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device'

repro:
```
CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936
Approved by: https://github.com/ColinPeppler, https://github.com/desertfire
2024-09-05 16:40:14 +00:00
2c99f17a32 Implement VariableTracker.python_type() (#134215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134215
Approved by: https://github.com/amjames, https://github.com/jansel
2024-09-05 16:35:47 +00:00
0043dcd79e Switch torch pt2e xnnpack tests to use export_for_training (#134788)
Migrate all the callsites inside the pt2e XNNPACK tests to use export_for_training.

Differential Revision: D61994553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134788
Approved by: https://github.com/mergennachin
2024-09-05 16:11:18 +00:00
2e2fb668fa Upgrade expecttest to 0.2.1 (#135136)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135136
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/Skylion007
2024-09-05 16:05:35 +00:00
9d24f945ba [CI] Use larger instance for building triton whl (#135201)
When running CI jobs of "Build Triton Wheels", it failed due to the lack of resources. This PR uses a larger runner to avoid these issues.

The failure message is like:

```
Process completed with exit code 137.
```

Related running actions:
Failed actions: https://github.com/pytorch/pytorch/actions/runs/10714445036
Success actions: https://github.com/pytorch/pytorch/actions/runs/10716710830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135201
Approved by: https://github.com/chuanqi129, https://github.com/atalman
2024-09-05 14:36:23 +00:00
ecbd715363 [Intel GPU][Windows] Fix overriding default CMAKE_CXX_FLAGS (#135093)
The root cause is that `/EHsc` is part of the default `CMAKE_CXX_FLAGS` in CMake.
Fix to not override the default `CMAKE_CXX_FLAGS`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135093
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-09-05 12:52:43 +00:00
58f2477a26 [Dynamo] Support builtin function frozenset (#134563)
Support builtin function frozenset in dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134563
Approved by: https://github.com/anijain2305, https://github.com/EikanWang, https://github.com/jansel
2024-09-05 12:15:10 +00:00
43dcb4bb61 Revise CPU vectorization ISA support API (#135075)
Revising (mostly renaming) CPU vectorization ISA support API (non-frontend-user-facing). Also added AVX512_BF16 ISA detection API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135075
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/ezyang
2024-09-05 12:14:56 +00:00
50d1e37079 [AOTI] Fix a unbacked symint retrieve bug (#134670)
Summary: Fix https://github.com/pytorch/pytorch/issues/134081. When a unbacked symint is computed as the shape of a tensor from a tuple, generated C++ code needs to use std::get<> to extract the tensor.

Differential Revision: [D62142113](https://our.internmc.facebook.com/intern/diff/D62142113)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134670
Approved by: https://github.com/angelayi, https://github.com/22quinn, https://github.com/chenyang78
2024-09-05 11:34:14 +00:00
b99ef1a02e Update torch-xpu-ops pin (ATen XPU implementation) (#135185)
Release cycle for PyTorch 2.5
1. Update specific AOT targets for Windows. On Windows, AOT target list prefers Intel client GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135185
Approved by: https://github.com/EikanWang
2024-09-05 10:05:23 +00:00
8a5c8e5db9 Update unbacked symints in masked_select more precisely (#134899)
## Summary
At the moment, the fake impl for `masked_select` simply sets the upper range while updating its size-like SymInt to `sys.maxsize`(9223372036854775807, max value for an unsigned int64) if the there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape.

This solves an issue where an model being lowered to Executorch errors during memory planning because the memory allocated for `masked_select` ended up exceeded the 64-bit address space (`INT_MAX * size(dtype)`).

## Test plan
- Passes existing unit tests (tests case where upper bound is inf)
- Added unit test to verify upper bound reduction calculation
- Tested end-to-end by exporting with TORCH_LOGS="export" and ensuring that the range for `masked_select`'s SymInt size has the correct upper bound
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134899
Approved by: https://github.com/ezyang
2024-09-05 09:01:06 +00:00
c7328dff7f Enhance the stability of the complex divide code (#134647)
In C++, when a floating-point literal (e.g., 3.14) is compared with a variable of type float, the literal is by default interpreted as a double.
```c++
float f = 3.14f;
if (f == 3.14) {
    // Do something
}
```
If a device does not support double, an error will occur.
This PR addresses the issue of complex64 errors on machines that do not support double operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134647
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-09-05 08:36:37 +00:00
749dc6ceda [inductor] [cpp] use_local_acc if template_buffer_has_other_users (#135081)
Fix the compilation error of `coat_lite_mini` in timm and `YituTechConvBert` in HF:
```
/tmp/tmpuu94adg_/nf/cnf3zm677wbfjzzll522zvjp57g44udzfnj66ac2t5b2odvfqpts.cpp:239:33: error: invalid conversion from ‘const float*’ to ‘float*’ [-fpermissive]
  239 |                                 &(in_ptr2[static_cast<int64_t>(n_start + (192L*m_start) + (Nr*nci) + ((-1L)*Nr*nc))]),
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                 |
      |                                 const float*
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135081
Approved by: https://github.com/jgong5
ghstack dependencies: #134984
2024-09-05 08:31:31 +00:00
eaeae0ac95 [c10d] Change collective to take in a list of tensors so it work fully for all collectives (#135049)
We found that currently, we only pass one input and output tensor to the function `collective`, and this causes NaNCheck, work numel stats and FR input/output sizes not accurate for all-to-all, scatter and reduce. So we want to let the collective take in a list of tensors to ensure it works for all collectives inside PGNCCL.

This partially revert what we did in https://github.com/pytorch/pytorch/pull/119421, and down the road we will have another round of cleanup on the collective to make it cleaner. For now, at least for the sake of correctness, we changed it back.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135049
Approved by: https://github.com/kwen2501
2024-09-05 07:56:56 +00:00
5a0e7a408f restore CSE'd node metadata in runtime asserts pass (#134516)
Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516
Approved by: https://github.com/ezyang
2024-09-05 07:50:04 +00:00
81a8624296 [Intel GPU] Customized XPU behaviour in indexing, group norm (#134453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134453
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #133980
2024-09-05 07:41:57 +00:00
731fd3172a [inductor] [cpp] generate reindexer for each epilogue_node (#134984)
Fixes the FP32 accuracy failure of `levit_128` in timm.

Previously, we used `Y` which is the output of the final epilogue node to calculate the reindexer. We actually need to use each epilogue node to calculate the reindexer from the GEMM output to the epilogue node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134984
Approved by: https://github.com/jgong5
2024-09-05 07:08:31 +00:00
9d705605dd Fix decomp behaviour in export training IR (#134801)
Subset of changes in https://github.com/pytorch/pytorch/pull/132901, can't land the previous one because it is too complicated. Rest of the change will be implemented as follow up after export design meeting. This part just makes the training IR -> inference IR decomp to have the same path as normal export.

Differential Revision: [D62000525](https://our.internmc.facebook.com/intern/diff/D62000525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134801
Approved by: https://github.com/avikchaudhuri, https://github.com/angelayi
2024-09-05 06:37:44 +00:00
05feb6e4ed [Inductor] support masked vectorization for the tail_loop for dynamic shapes (#131745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131745
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-09-05 06:17:48 +00:00
7b280c31ba [export] dynamic_shapes serialization, load/dump (#134718)
Adds utility functions `_dump_dynamic_shapes` and `_load_dynamic_shapes`.

- `_dump_dynamic_shapes`: dynamic shapes spec -> serialized format:
    - takes in the `dynamic_shapes` pytree object you'd feed into `export()`, and dumps into serialized format
- `_load_dynamic_shapes`: serialized format -> dynamic shapes spec
    - takes the serialized format, and produces a `dynamic_shapes` object you feed into `export()`

For example with dumping:
```
dx = Dim("dx", min=4, max=16)
dy = dx + 1

inputs = (
    [
        torch.randn(4, 4),
        torch.randn(5, 4),
    ],
    torch.randn(4),
    torch.randn(4, 4),
    "hello",
)
dynamic_shapes = {
    "a": [
        (dx, 4),
        (dy, 4),
    ],
    "b": (Dim.AUTO,),
    "c": None,
    "d": None,
}
out = _dump_dynamic_shapes(dynamic_shapes, inputs)
```

would generate the following output:
```
DynamicShapesSpec(
    dynamic_shapes=(
        [
            ['dx', 4],
            ['dx + 1', 4],
        ],
        ['_DimHint.STATIC'],
        ['_DimHint.STATIC', '_DimHint.STATIC'],
        None,
    ),
    dims={
        'dx': RootDim(
            min=4,
            max=16,
            derived=['dx + 1'],
        ),
    },
)
```

The serialized format contains 2 keys, `dynamic_shapes` and `dims.`
- `dynamic_shapes` is the pytree structure matching the input to `export()`, with strings in place of Dim names and enums, and ints/Nones otherwise. Each tensor is represented with a list of shapes, non-tensors with Nones.
- `dims` contain min/max range and derived dims info for each root dim.

The test cases show some roundtrippability guarantees for these functions. Definitely taking naming suggestions for them :)

Follow up: utility function to extract serializable format from ExportedProgram.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134718
Approved by: https://github.com/avikchaudhuri
2024-09-05 05:39:44 +00:00
f2a7228aed [executorch hash update] update the pinned executorch hash (#135162)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135162
Approved by: https://github.com/pytorchbot
2024-09-05 04:21:51 +00:00
8fb1281db9 [Traceable FSDP2] Skip _backward_prefetch under compile, and rely on compiler pass to have prefetching (#135163)
Before this PR, when traceable FSDP2 + AC is run, an error would be thrown:
```
  File "/data/users/willfeng/pytorch/torch/_dynamo/variables/builtin.py", line 1449, in call_getitem
    return args[0].call_method(tx, "__getitem__", args[1:], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 435, in call_method
    return super().call_method(tx, name, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 392, in call_method
    return super().call_method(tx, name, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 131, in call_method
    return self.getitem_const(tx, value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 106, in getitem_const
    return self.items[index]
Error: Index out of bound

from user code:
   File "<eval_with_key>.5", line 105, in forward
    aot0_trace_wrapped = torch__dynamo__trace_wrapped_higher_order_op_self_invoke(aot0_tangents_1, bw_state = aot0_primals_34);  aot0_tangents_1 = None
  File "/data/users/willfeng/pytorch/torch/_dynamo/_trace_wrapped_higher_order_op.py", line 74, in self_invoke
    return _trace_wrapped_op(*args, **dyn_kwargs, **kwargs)
  File "/data/users/willfeng/pytorch/torch/_dynamo/external_utils.py", line 132, in call_hook_from_backward_state
    return getattr(bw_state, hook_name)(*args, **kwargs)
  File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 271, in _pre_backward
    self._fsdp_param_group.pre_backward(default_prefetch)
  File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 332, in pre_backward
    self._backward_prefetch()
  File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 417, in _backward_prefetch
    target_fsdp_param_group = self.comm_ctx.post_forward_order[target_index]
```

Since it's okay to rely on the compiler to recover the "prefetching" pattern, we will skip this `_backward_prefetch()` code path during tracing to avoid the error, and have a compiler pass (in future PR) to achieve the equivalent prefetching overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135163
Approved by: https://github.com/awgu
2024-09-05 03:32:04 +00:00
a7a53b796b [Intel GPU]device guard codegen for XPU (#133980)
This PR is a supplement to #130082. The previous PR  #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts.  Current PR is aimed to facilitate the XPU device guard code generation.

With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated.
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
  std::optional<Device> common_device = std::nullopt;
(void)common_device; // Suppress unused variable warning
  c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out");
  c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean");
  const OptionalDeviceGuard device_guard(device_of(out));
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```
Nevertheless, without current change, the generated code is
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
    // No device check
  // DeviceGuard omitted
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133980
Approved by: https://github.com/EikanWang, https://github.com/malfet
2024-09-05 01:53:31 +00:00
30b98940b8 Fix typo in comment (#135111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135111
Approved by: https://github.com/aorenste, https://github.com/oulgen
2024-09-05 01:39:04 +00:00
724faac260 [FSDP] casting input args with dataclass(frozen=True) (#135067)
resolve: https://github.com/pytorch/pytorch/pull/135029

when enabling mixed precision, FSDP cast input args to desired dtype by calling `_apply_to_tensors`. When input args has `dataclass(frozen=True)`, we hit following runtime error, because of using `setattr` in `_apply_to_tensors`

`dataclasses.FrozenInstanceError: cannot assign to field 'some_key'`. The fix is to use dataclasses api `dataclasses.replace`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135067
Approved by: https://github.com/awgu
2024-09-05 01:19:53 +00:00
04e11c7eed Update current scripts used for setting up s390x runners (#129866)
Update current scripts used for setting up s390x runners

Just a documentation update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129866
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-09-05 01:17:54 +00:00
a3e0d4bf07 [FlexAttention] Fix mismatched backward strides for eager impl (#135152)
# Fixes:
The first repro from: https://github.com/pytorch/pytorch/issues/134888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135152
Approved by: https://github.com/Chillee
2024-09-05 01:14:53 +00:00
27d86f93fe Remove redundant code (#134955)
Remove GetPrivateUse1HooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134955
Approved by: https://github.com/Skylion007
2024-09-05 01:11:32 +00:00
32f45f01a9 [dynamo] Retire CompileProfiler (#135133)
Fixes confusion in https://github.com/pytorch/pytorch/issues/113443

We have TORCH_LOGS that supersedes CompileProfiler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135133
Approved by: https://github.com/ezyang
ghstack dependencies: #135039, #135121, #135129, #135130
2024-09-05 01:08:40 +00:00
4a661e089a [FR] Add version based logic to FR script and make traces print can be filtered (#135154)
This PR makes version passing around the version, so that we can have different behaviors for different versions of FR dump. This PR also adds the logic of filtering to certain PG(desc) and ranks to show their traces.

Some minor refactors to make the name more accurate and util function working.

<img width="1180" alt="image" src="https://github.com/user-attachments/assets/4ef8a2d6-1296-4a45-b9a7-6d3b48fbe233">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135154
Approved by: https://github.com/wconstab
2024-09-05 00:59:32 +00:00
105ac2418c Fix binary builds artifact download (#135139)
By upgrading upload-artifacts action to v4.4.0

As artifact store layout is different between v3 and v4 actions and artifacts uploaded by v3 can not be downloaded by v4

Should fix`Unable to download artifact(s): Artifact not found for name: libtorch-cpu-shared-with-deps-release`, which could be seen for example [here](https://github.com/pytorch/pytorch/actions/runs/10707740040/job/29690137218#step:7:29)

I.e. fix regression introduced by https://github.com/pytorch/pytorch/pull/135068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135139
Approved by: https://github.com/atalman, https://github.com/huydhn
2024-09-05 00:43:34 +00:00
560f449d8f Fix: use clone_preserve_strides in auto_functionalized_v2 (#135142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135142
Approved by: https://github.com/zou3519
ghstack dependencies: #134409
2024-09-05 00:39:48 +00:00
956da79bda [CUDA][AMP] Fix autocast_dtype (#133938)
Fixes #132715

The failure in #132715 is due to `autocast_dtype` being a thread-local variable. It causes inconsistencies between `get_autocast_dtype()` among different threads.

To be exact, what is happening in the following: The amp dtype is set to `bfloat16` on main thread. The `backward` call runs on a side thread, so `at::autocast::prioritize` fails because `lower_precision_fp` defaults to `float16`:
6f738d6434/aten/src/ATen/autocast_mode.h (L221-L225)

This PR makes `autocast_dtype` thread-global so it consistent among all threads of forward and backward passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133938
Approved by: https://github.com/soulitzer
2024-09-05 00:07:32 +00:00
977a909250 [CI] Build pytorch wheel with Torch XPU Operators on Windows (#133151)
# Description
This pipeline enables the CI build on Windows with PR labeled with ciflow/xpu. This will build torch binary with Torch XPU Operators on Windows using Vision Studio BuildTools 2022.

# Changes
1. Install xpu batch file (install_xpu.bat) - Check if build machine has oneAPI in environment, and if the version of it is latest. If not, install the latest public released oneAPI in the machine.
2. GHA callable pipeline (_win-build.yml) - Set vc_year and use_xpu as parameter to set build wheel environment.
3.  GHA workflow (xpu.yml) - Add a new windows build job and pass parameters to it.
4.  Build wheels script (.ci/pytorch/win-test-helpers/build_pytorch.bat) - Prepare environment for building, e.g. install oneAPI bundle.

# Note
1. For building wheels on Intel GPU, you need Vision Studio BuildTools version >= 2022
2. This pipeline requires to use Vision Studio BuildTools 2022 to build wheels. For now, we specify "windows.4xlarge.nonephemeral" as build machine label in the yaml file. We will request to add self-hosted runners with Intel GPU and Vision Studio BuildTools 2022 installed soon.

Work for #114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133151
Approved by: https://github.com/chuanqi129, https://github.com/atalman

Co-authored-by: chuanqiw <chuanqi.wang@intel.com>
2024-09-05 00:02:46 +00:00
b3ef0c99f5 [PP] Fix zero bubble composability with DP (#134052)
Moved all the backward functions (`stage_backward_input`, `stage_backward_weight`, `stage_backward`) under the same `backward_maybe_with_nosync` function which controls the logic of the data parallel wrappers.

FSDP was not working with zero bubble PP because there will be twice as many "backward" calls and we update the weight gradients after `autograd.grad` is called. As a result, we need to manually call the FSDP `post_backward_hook()` after the weights have the correct gradients.

Fixes the tests:
`python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_FSDP_ScheduleClass0_use_new_runtime_False`

`python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134052
Approved by: https://github.com/kwen2501
2024-09-04 23:46:29 +00:00
43c9b4e0e6 Fix unintentional deduplication of returned tensors (#134726)
When CSE was used, returned tensors that had gone through identical
processing steps but were distinct from a data perspective were pruned
out of the graph.  This commit protects tensors which are directly
output from being pruned, and adds a test for this behavior.

Closes #88813 and #114344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134726
Approved by: https://github.com/amjames, https://github.com/zou3519, https://github.com/bdhirsh
2024-09-04 23:42:56 +00:00
00a8666708 [ONNX] Support output_names in dynamic_axes when dynamo=True (#135134)
Previous to this PR, if output_names shows in dynamic_axes, it errors when we turn it to dynamic_shapes of torch.export, as we only recognized input_names.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135134
Approved by: https://github.com/justinchuby
2024-09-04 23:42:13 +00:00
eqy
4f70b3cfae [CUDA][complex][TF32] Update test_noncontiguous_samples tolerances for complex64 (#134526)
Recent cuDNN heuristics change surfaces same TF32 issue as `float32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134526
Approved by: https://github.com/ezyang
2024-09-04 23:37:16 +00:00
359077fa43 [export] Fix indentation (#135128)
Summary: as title

Test Plan: CI

Differential Revision: D62195680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135128
Approved by: https://github.com/tugsbayasgalan
2024-09-04 23:26:36 +00:00
9810ce9ca7 [PP] Go back to export instead of _export (#134299)
Reverts https://github.com/pytorch/pytorch/pull/130998 because FakeTensor + real device suffice to work around the autocast issue in HF.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134299
Approved by: https://github.com/lessw2020
2024-09-04 23:25:17 +00:00
804852c1f9 [dynamo] Search for _torchdynamo_inline only for functions (#135130)
Issue seen in https://github.com/pytorch/pytorch/issues/93633

Fixes https://github.com/pytorch/pytorch/issues/93633

Unable to create a testcase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135130
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
ghstack dependencies: #135039, #135121, #135129
2024-09-04 23:02:59 +00:00
13a4a0c60d [Inductor] Apply loop split optimization in codegen_node (#132389)
This PR applies loop split optimization in codegen_node to avoid non-contiguous load. When the vector is loaded in a non-contiguous manner due to a division in the index, we eliminate the division by splitting the loop to avoid non-contiguous load.

Example:
```
import torch
import torch.nn as nn

class GNReLU(torch.nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GNReLU, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return torch.nn.functional.relu(self.gn(x))

input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last)
m = GNReLU(32, 960).eval()
compiled_m = torch.compile(m)

with torch.no_grad():
    compiled_m(input)
```

Generated code:

- Before:
```
cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 14);
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)), 16);
                        auto tmp1 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp3 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16);
                        auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16);
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = static_cast<float>(276480.0);
                        auto tmp5 = at::vec::Vectorized<float>(tmp4);
                        auto tmp6 = tmp3 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = at::vec::Vectorized<float>(tmp7);
                        auto tmp9 = tmp6 + tmp8;
                        auto tmp10 = tmp9.rsqrt();
                        auto tmp11 = tmp2 * tmp10;
                        auto tmp13 = tmp11 * tmp12;
                        auto tmp15 = tmp13 + tmp14;
                        auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                        tmp16.store(out_ptr2 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)));
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg2_1, = args
    args.clear()
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3)
    del arg2_1
    return (buf3, )
```

- After:
```
cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 14);
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(32L); x2+=static_cast<long>(1L))
                    {
                        for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 16);
                            auto tmp1 = out_ptr0[static_cast<long>(x2 + (32L*x0))];
                            auto tmp4 = out_ptr1[static_cast<long>(x2 + (32L*x0))];
                            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30L*x2)), 16);
                            auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30L*x2)), 16);
                            auto tmp2 = at::vec::Vectorized<float>(tmp1);
                            auto tmp3 = tmp0 - tmp2;
                            auto tmp5 = static_cast<float>(276480.0);
                            auto tmp6 = tmp4 / tmp5;
                            auto tmp7 = static_cast<float>(1e-05);
                            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                            auto tmp9 = 1 / std::sqrt(tmp8);
                            auto tmp10 = at::vec::Vectorized<float>(tmp9);
                            auto tmp11 = tmp3 * tmp10;
                            auto tmp13 = tmp11 * tmp12;
                            auto tmp15 = tmp13 + tmp14;
                            auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                            tmp16.store(out_ptr2 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)));
                        }
                        for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 14);
                            auto tmp1 = out_ptr0[static_cast<long>(x2 + (32L*x0))];
                            auto tmp4 = out_ptr1[static_cast<long>(x2 + (32L*x0))];
                            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30L*x2)), 14);
                            auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30L*x2)), 14);
                            auto tmp2 = at::vec::Vectorized<float>(tmp1);
                            auto tmp3 = tmp0 - tmp2;
                            auto tmp5 = static_cast<float>(276480.0);
                            auto tmp6 = tmp4 / tmp5;
                            auto tmp7 = static_cast<float>(1e-05);
                            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                            auto tmp9 = 1 / std::sqrt(tmp8);
                            auto tmp10 = at::vec::Vectorized<float>(tmp9);
                            auto tmp11 = tmp3 * tmp10;
                            auto tmp13 = tmp11 * tmp12;
                            auto tmp15 = tmp13 + tmp14;
                            auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                            tmp16.store(out_ptr2 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 14);
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg2_1, = args
    args.clear()
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3)
    del arg2_1
    return (buf3, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132389
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
2024-09-04 22:42:46 +00:00
87842cc658 [dynamo][super] Corner case where the class is not present in the __mro__ (#135129)
I could not come up with a testcase. This was seen in https://github.com/pytorch/pytorch/issues/93633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135129
Approved by: https://github.com/yanboliang
ghstack dependencies: #135039, #135121
2024-09-04 22:30:09 +00:00
d9ae92cd6e [Dynamo] Support for proxying frozen dataclasses (#134846)
Fixes https://github.com/pytorch/pytorch/issues/133858

Details: Previously Dynamo would treat dataclasses as UserDefinedVariables. This was non-desirable if we would like to proxy the value into the graph, which is needed for TensorSubclassMetadata. To rectify this, frozen dataclasses are now able to be proxied similarly to NamedTuples. We require the object to be frozen, because if arbitrary mutation were allowed, we would need to replay those mutations in the graph after construction of the object.

For tracing construction of the variable, the generated `__init__` for the dataclass uses `object.__setattr__` because frozen dataclasses throw errors on the usual `__setattr__` invocation. With this treatment, no special handling is needed in dynamo for frozen dataclass construction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134846
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2024-09-04 22:17:00 +00:00
ed06772e35 [TorchElastic] add warning when users try to pass a "use_libuv" argument to create_c10d_store (#135062)
**Summary**
Extend the warning message to be more self-explained

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135062
Approved by: https://github.com/shuqiangzhang
2024-09-04 22:05:51 +00:00
fb1c580892 [BE][optim] Make pyright recognize exported symbols (#135043)
Follows pattern introduced by https://github.com/pytorch/pytorch/pull/80955 which [pyright](https://github.com/microsoft/pyright) prefers over `__all__` symbol, see https://github.com/microsoft/pylance-release/issues/2953#issuecomment-1168956296
Fixes https://github.com/pytorch/pytorch/issues/134985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135043
Approved by: https://github.com/janeyx99
2024-09-04 21:53:46 +00:00
2276940f8c Make Dynamo inline through torch._library.custom_ops.autograd (#135066)
Fixes https://github.com/pytorch/pytorch/issues/135057

The bug was: in the situation that Dynamo graph breaks in the forward
and Compiled Autograd uses Dynamo to introspect the backward, we end up
running into a "Unsupported: inlining through SKIPFILES" error. The
solution is to mark the entirety of this module as inlineable.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135066
Approved by: https://github.com/bdhirsh, https://github.com/williamwen42, https://github.com/yanboliang
2024-09-04 21:48:28 +00:00
4e6df83d19 [PT] Add out variant for avg_pool1d and adaptive_avg_pool1d (#135051)
Test Plan: CI

Differential Revision: D62148410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135051
Approved by: https://github.com/SS-JIA
2024-09-04 21:20:01 +00:00
a8611da86f [dynamo][backend match] Optimize backend match for common case (#135121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135121
Approved by: https://github.com/williamwen42
ghstack dependencies: #135039
2024-09-04 21:02:29 +00:00
09a339fc06 [Flex Attention] update __getitem__ without tree_map_only to support compile (#134627)
Adds a helper function for getting the block mask for a specific row index during decoding. We need this change to avoid the pytree + torch.compile issue #134731. Tested in gpt-fast [pr](https://github.com/pytorch-labs/gpt-fast/pull/196).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134627
Approved by: https://github.com/Chillee
2024-09-04 20:09:41 +00:00
741d52c69f Revert "Add support for 32KB multi_tensor_apply kernel arguments (#134373)"
This reverts commit 08184aa85cf183198ebdf2fd7a49fe7bc4842c13.

Reverted https://github.com/pytorch/pytorch/pull/134373 on behalf of https://github.com/drisspg due to See https://github.com/pytorch/pytorch/issues/135126 for more details ([comment](https://github.com/pytorch/pytorch/pull/134373#issuecomment-2329839011))
2024-09-04 19:44:29 +00:00
dd7cd182ab [AIInfra][DCP] All gather keys checkpoint utils bug fix (#135045)
Summary: All gather keys checkpoint utils bug fix. Dist. get_world_size should have the process group passed in to avoid inconsistent world size in case the process group has changed. This is common in the tests.

Test Plan: UTs

Reviewed By: Saiteja64

Differential Revision: D61578832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135045
Approved by: https://github.com/MeetVadakkanchery, https://github.com/LucasLLC
2024-09-04 18:49:34 +00:00
eb0fd17bc4 [Profiler] Fix Raw Metadata Iterator (#135096)
Summary:
D62008788 added an extra parameter to the RawTensorMetadata struct. For some reason this causes some corrupted accesses in other tests as described in T200685032.

Once this is removed the tests pass. Going forward we need to document how to add parameters to this portion of the code as the AppendOnlyLists seem to be very rigid.

Test Plan: Ran all the tests locally and they all passed.

Differential Revision: D62171089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135096
Approved by: https://github.com/aaronenyeshi
2024-09-04 18:41:50 +00:00
c88c19c6de Revert "restore CSE'd node metadata in runtime asserts pass (#134516)"
This reverts commit 1dfb1052395d908ed6e67288c9357e16022da272.

Reverted https://github.com/pytorch/pytorch/pull/134516 on behalf of https://github.com/pianpwk due to breaking NestedTensor test ([comment](https://github.com/pytorch/pytorch/pull/134516#issuecomment-2329738450))
2024-09-04 18:41:21 +00:00
873abfc18e [inductor] fix compile time regression due the (disabled) loop ordering after fusion (#135071)
It's a bit surprised that the code added in Scheduler.fusable_read_and_write would increase compilation time.

Here are some number I get from a H100 on BertForMaskedLM:
- without the fix, cold start compilation time is around 82s
- with the fix, cold start compilation time is around 76s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135071
Approved by: https://github.com/jansel
2024-09-04 18:36:59 +00:00
d7b57c4d63 Fix tensor.data access under inference_mode and compile (#134878)
Fixes https://github.com/pytorch/pytorch/issues/134798

In the regular Tensor case, when you call Tensor.data, there's a check
for if inference mode is active. If it is active, then we don't set the
version counter. We replicate this check for Tensor Subclasses (the bug
was we were trying to set the version counter on a FakeTensor in
inference_mode).

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134878
Approved by: https://github.com/bdhirsh
2024-09-04 17:55:41 +00:00
0d193a0adf Add ExecuTorch warning to mobile_optimizer (#134697)
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/134697/mobile_optimizer.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134697
Approved by: https://github.com/ali-khosh, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-04 17:47:14 +00:00
193c547461 [inductor] Refactor simplify erase_nodes() (#134822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134822
Approved by: https://github.com/shunting314
ghstack dependencies: #134748, #134749
2024-09-04 17:32:07 +00:00
2ddf3ed707 [inductor] Allow cudagraphs with unused CPU inputs (#134749)
This pattern was preventing cudagraphs from kicking in on torch_multimodal_clip, resulting in `1.6529 → 3.3471` speedup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134749
Approved by: https://github.com/shunting314
ghstack dependencies: #134748
2024-09-04 17:32:07 +00:00
cff1158200 [inductor] Pass to fix device on index(..., [iota]) (#134748)
This pattern was preventing cudagraphs from kicking in on torch_multimodal_clip, resulting in `1.6529 → 3.3471` speedup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134748
Approved by: https://github.com/shunting314
2024-09-04 17:31:58 +00:00
7858045491 Revert "Fix set_unbacked_bindings when list of Tensors is returned (#133585)"
This reverts commit 2a49296d7563150d67bb00bd4c97bc5aafaa77df.

Reverted https://github.com/pytorch/pytorch/pull/133585 on behalf of https://github.com/ezyang due to fails torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/133585#issuecomment-2329602983))
2024-09-04 17:21:32 +00:00
8759ed2ac5 Revert "Compute and do renamings even when ignoring fresh unbacked symbols (#134407)"
This reverts commit 46cb2af7d822681298370bab9d49b3cba5546dd5.

Reverted https://github.com/pytorch/pytorch/pull/134407 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))
2024-09-04 17:18:21 +00:00
fc07e6bf56 Revert "Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053)"
This reverts commit a178a053ad2c8e42d1b684ed38385b9646ec3b74.

Reverted https://github.com/pytorch/pytorch/pull/135053 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))
2024-09-04 17:18:21 +00:00
c8ab9b06a2 Redesign custom op functionlaization for better re-inplace (#134409)
- The new implementation (auto_functionalized_v2) is enabled by default but can be disable
 using an inductor flag.
- In export mode the old implementation is used.

**Motiviation**
Previous functionalization fails to re-inplace arguments when they are view over other tensors.
see issue https://github.com/pytorch/pytorch/issues/131192
The new functionalization is easier to re-inplace for views.

**A) Functionalizations pass**
consider a program:

```

func(t)
    x = t[0]
    y = t[1]
    foo(x, y) # custom operator with x, y mutable
    return (x, y, t)
```

- To functionalize `foo` we generate a function that operates on the base tensors of the inputs;  (x.base() and y.base())
and record how to regenerates the views out of the base for argument x by recording ```ViewInfo=(x.base(), x.size(), x.stride, x,storage_offset())```

- Due to some limitations on the torch.export arguments format, we have to generate alot of arguments, but this is something we can simplify in the future, for the example above we get the following function.

   ```
   auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default,
     _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0 ,
     _y_base_index = 0,_y_size = (), _y_stride = (), _y_storage_offset = 1   ,
     _all_bases = [arg0_1])
   ```
 -  In the code above:
        - _all_bases[t]: refers to a unique set of bases for all foo arguments.
        - for each argument x we have _x_base_index, _x_size, _x_stride, _x_storage_offset that can be used to (1)  regenerate x from _all_bases[_x_base_index] or a copy of a the base.

-  the output of auto_functionalized is foo output , followed by x tensors one for each base in  _all_bases, that is a copy of the base tensor after observing the mutations of the all the arguments that are views of that base.

-  for each use of a base in _all_bases or a view of it , that are after the call to foo, replace it with a view of the new output

 for the function above after functionalization we get :
 ```
    def forward(self, arg0_1: "f32[2][1]cpu"):
        auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0, _y_base_index = 0, _y_size = (), _y_stride = (), _y_storage_offset = 1, _all_bases = [arg0_1])
        getitem_1: "f32[2][1]cpu" = auto_functionalized[1];  auto_functionalized = None
        copy_: "f32[2][1]cpu" = torch.ops.aten.copy_.default(arg0_1, getitem_1);  arg0_1 = copy_ = None

        # No stacktrace found for following nodes
        select_2: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 0)
        select_3: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 1);  getitem_1 = None
        return (select_2, select_3)
```

**B) Semantics of  auto_functionalize**
The new semantics of auto_functionalize is as the following:
1. For each base in all_bases, copy the base and create all_bases copies. (if a base is inplaced we do not need to copy it)
2. For each arg, regenerate the arg from the copy of its base using the view information above.
3. return the original foo output followed by the new bases.

**C) Re-inplace pass**
since auto_functionalize not copy the bases, what we actually inplace is the bases.
 (run just like before but on the beses instead of args).

1. For each base b in _all_bases check if there is any use of base (or its aliases/views) after auto_functionalize (before its overwritten with a copy) if there is not any, then inplace it (avoid copying it in step 1 above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134409
Approved by: https://github.com/zou3519
2024-09-04 17:08:58 +00:00
195ac85fb6 [Profiler] Allow kwinputs to be non-string values (#134893)
Summary: When we process keyword arguments in profiler today we assume that all values will be strings. This breaks HTA because it assumes that "stream" and other values similar to it will be ints. To fix this we will only put quotes around strings for ivalues.

Test Plan: Add chrome trace export in unit tests and check that stream does not have quotes around it

Differential Revision: D62056059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134893
Approved by: https://github.com/sanrise, https://github.com/izaitsevfb
2024-09-04 16:34:10 +00:00
60dfe1b35e Fix lint after Bump actions/download-artifact update (#135109)
Fixes lint after auto-generated PR: 367a78495f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135109
Approved by: https://github.com/ezyang, https://github.com/huydhn
2024-09-04 15:26:17 +00:00
8bfd4916d6 fast path for sympy gcd in floordiv (#134880)
Summary:
Re-implementation of https://github.com/pytorch/pytorch/pull/134150, which was reverted because of some internal tests hanging (case B). The original motivation was to get some other internal test unstuck (case A).

The root cause is that sympy.gcd is both very clever as well as can blow up in some cases. This PR introduces a fast path with an appropriate fallback to sympy.gcd that ensures that both cases A and B go through.

Test Plan:
See the included test for specific examples.
Also https://fb.workplace.com/groups/1075192433118967/posts/1491493248155548/?comment_id=1491938994777640&reply_comment_id=1492622821375924

Differential Revision: D62043315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134880
Approved by: https://github.com/ezyang
2024-09-04 14:56:49 +00:00
67208f08bd [CD] Enable XPU nightly build on Windows (#134312)
Depends on https://github.com/pytorch/builder/pull/1975 land. Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134312
Approved by: https://github.com/atalman
2024-09-04 14:46:36 +00:00
6c5669903f Fix Invalid NaN comparison due to infinity-zero multiply on latest sympy (#135044)
Fixes https://github.com/pytorch/pytorch/issues/133735

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135044
Approved by: https://github.com/zou3519
2024-09-04 14:13:09 +00:00
a178a053ad Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053)
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/

I'm not sure this is the right approach though...

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053
Approved by: https://github.com/ydwu4
ghstack dependencies: #134407
2024-09-04 13:25:08 +00:00
46cb2af7d8 Compute and do renamings even when ignoring fresh unbacked symbols (#134407)
This is a bit twisty and I don't entirely understand the situation, but here's my best explanation.

In https://github.com/pytorch/pytorch/pull/133588 I am trying to fix a problem reported by user in https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/ The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis).

In #133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way.

I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set.

But I don't entirely understand all the interactions. I just know that this seems to not cause tests to fail, and it should fix the internal issue (which I need to add a UT for.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134407
Approved by: https://github.com/ydwu4
2024-09-04 13:25:07 +00:00
5690f003a6 C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED and C10_DIAGNOST should be used in pairs (#135004)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135004
Approved by: https://github.com/aaronenyeshi
2024-09-04 13:14:23 +00:00
dcf05fcb14 Fix stale job using non-existant ARC runner (#134863)
The ARC CI system has been shutdown so this job is currently using a runner that doesn't exist.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134863
Approved by: https://github.com/ZainRizvi
2024-09-04 12:57:10 +00:00
a8467c17c3 Remove specific lazy initialization of PrivateUse1 (#135002)
As the title stated, lazy initialization of PrivateUse1 can been
removed because maybe_initialize_device have supported PrivateUse1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135002
Approved by: https://github.com/albanD
2024-09-04 11:45:45 +00:00
80a6d60829 Moving _run_autocast_outofplace to basic class named TestAutocast to reduce redundance (#134460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134460
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2024-09-04 10:48:58 +00:00
c2ff9fe042 [fp8 rowwise] Retune the tile heuristics to increase perf (#134781)
I propose a new heuristic function to select tile tile size, cluster size, and transposition given M, N and K. It improves the performance across the board (on average) while remaining simple and relying only on a handful of kernels (to limit build time and binary size).

Across the shapes I benchmarked, the new heuristic gives a (geometric) mean speedup of +16.5%. Some shapes worsen, but 98.6% of the shapes retain their old performance (up to 5% to allow for noise) or improve it.
![image](https://github.com/user-attachments/assets/bca30583-ac32-4af6-a4f9-37164bdb2430)

I benchmarked on over 5.4k different shapes:
- For M and N I swept across all values which are the sums of two powers of 2 (limited to multiples of 64, capped at 16,384)
- For K I only used powers of 2 between 1,024 and 8,192 (based on the intuition that the optimal config doesn't depend on K, which turned out to be the case)

Here's the detailed speedup for each shape
![image](https://github.com/user-attachments/assets/acac4318-9ee0-455d-861b-c764b8c13d22)

<details>
<summary>
This is the code I used to benchmark
</summary>

```
import torch
import torch.utils.benchmark

s = set()

for i in range(6, 15):
    s.add(2**i)
    for j in range(6, i):
        s.add(2**i + 2**j)

ms = [i for i in sorted(s) if i <= 2**14]
ns = [i for i in sorted(s) if i <= 2**14]
ks = [2**i for i in range(10, 14)]

def make_graph(n_iters, f):
    g = torch.cuda.CUDAGraph()
    with torch.cuda.graph(g):
        for _ in range(n_iters):
            f()
    return g

def rowwise_scale(t, dtype_t):
    min_v, max_v = torch.finfo(dtype_t).min, torch.finfo(dtype_t).max
    scale_t = torch.clamp(t.abs().amax(dim=-1, keepdim=True).float(), min=1e-12) / max_v
    t_fp8 = (t / scale_t).clamp(min=min_v, max=max_v).to(dtype_t)
    return t_fp8, scale_t

for m in ms:
    for n in ns:
        for k in ks:
            a = torch.randn((m, k), device="cuda", dtype=torch.float)
            b_t = torch.randn((n, k), device="cuda", dtype=torch.float)
            a_fp8, scale_a = rowwise_scale(a, torch.float8_e4m3fn)
            b_t_fp8, scale_b_t = rowwise_scale(b_t, torch.float8_e4m3fn)
            func = lambda: torch._scaled_mm(
                a_fp8,
                b_t_fp8.t(),
                scale_a=scale_a,
                scale_b=scale_b_t.t(),
                bias=None,
                use_fast_accum=True,
                out_dtype=torch.bfloat16
            )
            print(f"{m=},{n=},{k=}")
            print(torch.utils.benchmark.Timer("g.replay()", globals={"g": make_graph(1000, func)}).blocked_autorange(min_run_time=1).mean / 1000)
```
</details>

<details>
<summary>
This is the code I used for the plots
</summary>

```
from itertools import islice

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
from matplotlib.colors import FuncNorm
from mpl_toolkits.axes_grid1 import ImageGrid

def batched(iterable, n):
    iterator = iter(iterable)
    while batch := tuple(islice(iterator, n)):
        yield batch

def try_to_convert(v):
    if v == "False":
        return False
    if v == "True":
        return True
    return int(v)

def get_from_paste(filename):
    text = open(filename, "rt").read()
    headers = []
    data = []
    for config, value in batched(text.splitlines(), 2):
        config_elems = config.split(",")
        if not headers:
            headers = [e.partition("=")[0] for e in config_elems]
        data.append((*(try_to_convert(e.partition("=")[-1]) for e in config_elems), float(value)))
    return pd.DataFrame(data, columns=headers + ["latency"])

old_latencies = get_from_paste(...)
new_latencies = get_from_paste(...)

ratios = pd.merge(new_latencies, old_latencies, how="left", left_on=["m", "n", "k"], right_on=["m", "n", "k"], suffixes=("_new", "_old"))
ratios = ratios.assign(ratio=ratios.latency_old / ratios.latency_new)

fig = plt.figure(figsize=(40.0, 10.0))
grid = ImageGrid(
    fig,
    111,
    nrows_ncols=(1, 4),
    axes_pad=0.5,
    share_all=True,
    cbar_location="right",
    cbar_mode="single",
    cbar_size="7%",
    cbar_pad=0.15,
)

log_amax = np.max(np.abs(np.log(ratios.ratio.to_numpy())))

for K, ax in zip([1024, 2048, 4096, 8192], grid):
    pivoted = ratios[(ratios.k == K)].pivot_table(index="m", columns="n", values="ratio")
    im = ax.imshow(np.log(pivoted.to_numpy()), origin="lower", vmin=-log_amax, vmax=log_amax, cmap="PiYG")
    m_vals, n_vals = pivoted.axes
    ax.set_xticks(np.arange(len(n_vals)), labels=[f"N={i}" for i in n_vals.values], fontsize=12)
    ax.set_yticks(np.arange(len(m_vals)), labels=[f"M={i}" for i in m_vals.values], fontsize=12)
    plt.setp(ax.get_xticklabels(), rotation=90, ha="right", rotation_mode="anchor")
    ax.grid(False)
    ax.set_title(f"K={K}", fontsize=20)

norm = FuncNorm((lambda x: np.log(x), lambda x: np.exp(x)), np.exp(-log_amax), np.exp(log_amax))
ax.cax.colorbar(ScalarMappable(norm=norm, cmap="PiYG"))
plt.show()

counts, bins = np.histogram(np.log(ratios.ratio.to_numpy()), bins=500)
plt.stairs(counts, np.exp(bins), fill=True)
plt.xscale("function", functions=(lambda x: np.log(x), lambda x: np.exp(x)))
```
</details>

I only benchmarked fast_accum=True and out_dtype=torch.bfloat16 supposing that these are the most commonly-used flags (e.g., with fast_accum=False row-wise scaling is much slower than tensor-wise scaling hence unpractical).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134781
Approved by: https://github.com/drisspg, https://github.com/eqy
ghstack dependencies: #134773
2024-09-04 09:17:28 +00:00
eec8fa038e [fp8 rowwise] Support transposing operands in order to change output layout (#134773)
On some occasion, a column-major output layout is more efficient (it's unclear if it's because of better store coalescing for some tile shapes, or whether it's just that it's CUTLASS's default and thus it's better optimized).

At this stage I only add a flag that allows to transpose, but the hardest will be deciding on a new heuristic to turn it on selectively. This will be in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134773
Approved by: https://github.com/drisspg
2024-09-04 09:17:28 +00:00
679b8fe426 Update generate-xnnpack-wrappers.py parsing to handle build identifier (#134724)
Fixes an issue after updating XNNPACK where parsing the XNNPACK CMakeLists breaks. I'm just ignored the generated build identifier for now, since it's not used and we would need to update the buck build to generate it at build time.

Remove unused ukernels_xop XNNPACK target as it has no sources (after the recent update) and causes buck1 to complain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134724
Approved by: https://github.com/mcr229
2024-09-04 08:45:46 +00:00
1dfb105239 restore CSE'd node metadata in runtime asserts pass (#134516)
Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516
Approved by: https://github.com/ezyang
2024-09-04 05:56:28 +00:00
9f00317997 rationalize STATIC vs. None (#134877)
Summary:
A bit of refactoring to prepare to remove `None` as a way to specify static dimensions in dynamic shapes, given we already have `Dim.STATIC` for the same purpose. We will now warn whenever this happens. However no tests were modified because problematic uses of `None` still need to behave as they do today, until we are ready to remove support. It should be easy to port tests by replacing the warning function to raise instead.

Note that other uses of `None`, such as for entire values (tensor or non-tensor) remain as is. Moving forward this should be the only purpose of `None` (at least externally).

Finally, there's a bit of confusion in our representation now because `AUTO` also internally transforms to `None`. Renamed dynamic_shapes to transformed_dynamic_shapes where this happens. Overall the two forms (pre and post transformation) have different properties so should probably not be represented in the same format in the future.

Test Plan: existing

Differential Revision: D62040729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134877
Approved by: https://github.com/pianpwk
2024-09-04 05:34:26 +00:00
9809080b9e [Reland] Refactor caching device allocator utils (#130923)
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy
2024-09-04 05:31:08 +00:00
6448d351db [inductor] clean up cpp_builder code. (#134909)
Clean up cpp_builder duplication code.

Hi @henrylhtsang , could you please help on land internally?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134909
Approved by: https://github.com/henrylhtsang
2024-09-04 05:29:08 +00:00
2c9b4d2052 [executorch hash update] update the pinned executorch hash (#135077)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135077
Approved by: https://github.com/pytorchbot
2024-09-04 05:17:29 +00:00
6b05aafc57 Add specializations for VecMaskLoad and VecMaskCast (#126501)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126501
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #126500
2024-09-04 05:12:52 +00:00
ffd1e214df Back out "[FSDP2] Set ctx.set_materialize_grads(False) for post-backward (#133498)" (#135059)
Summary:
Original commit changeset: 96513cbc425f

Original Phabricator Diff: D61291210

There is some evidence that FB-FM-v4 has better NE with Set ctx.set_materialize_grads(False), especially when pairing up with prefetching.

See https://www.internalfb.com/intern/anp/view/?id=5732259

Test Plan:
export NUM_WORKERS=128
export BATCH_SIZE=1024
export CONFIG_FILE="mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2.yaml"

export ENTITLEMENT=ads_global_tc_2k_training_large_short
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -c fbcode.platform010_cuda_version=12 -c hpc_comms.use_nccl=2.17.1 -- mode=${CONFIG_FILE} launcher.tags='[ads_ranking_taxonomy_monetization_genai]' launcher.data_project=pytorch_at_scale launcher.max_retries=10 launcher.fbl_entitl
ement=${ENTITLEMENT} launcher.oncall=pytorch_training_enablement launcher.hardware=GRANDTETON launcher.num_workers=${NUM_WORKERS} data_loader.dataset.batch_size=${BATCH_SIZE} training.planner.proposer=dynamic_col_dim training.planner.proposer.optim_target=h
bm 2>&1| tee ~/tmp/log.mast

Differential Revision: D62009163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135059
Approved by: https://github.com/awgu
2024-09-04 04:50:32 +00:00
cyy
c818ecd169 Remove Caffe2 code from tool scripts (#134941)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134941
Approved by: https://github.com/ezyang
2024-09-04 03:47:58 +00:00
9e6f4f3f77 [dynamo] Use __eq__ for backend match (#135039)
Fixes https://github.com/pytorch/pytorch/issues/131150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135039
Approved by: https://github.com/jansel
2024-09-04 03:35:18 +00:00
367a78495f Bump actions/download-artifact from 2 to 4.1.7 in /.github/workflows (#135068)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 2 to 4.1.7.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v2...v4.1.7)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-03 20:33:57 -07:00
362ecd9817 [inductor] Skip the sub-process pool until it's ready (#133508)
Summary: Torch-compiling a quick script can be a bit slower than it needs to be: even though we initialize the subprocess pool early, it still might not be ready by the time we try to compile the first Triton kernel. Instead, let's use the single-threaded path until the pool has successfully completed a no-op job.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133508
Approved by: https://github.com/Chillee
2024-09-04 03:26:55 +00:00
7600e9b36f [ONNX] Use the stable APIs in onnxscript and sync the latest logic (#134782)
Use the stable apis from onnxscript: https://github.com/microsoft/onnxscript/issues/1827
Sync with torch-onnx at https://github.com/justinchuby/torch-onnx/compare/v0.1.12...v0.1.15.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134782
Approved by: https://github.com/titaiwangms
2024-09-04 03:10:20 +00:00
982e27e532 [halide-backend] Update CI pin (#130258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130258
Approved by: https://github.com/eellison
2024-09-04 03:08:49 +00:00
ae3aa8ff73 [AOTI][Tooling][5/n] Refactor the debug printer call to a level lower (#134789)
Summary:
1. Move the debug printer call a level lower -> at here
:https://www.internalfb.com/code/fbsource/[931d7bbb9e7cf2dcb926f42718f56fc940903eec]/fbcode/caffe2/torch/_inductor/codegen/cpp_wrapper_cuda.py?lines=335
2. Add UT for validating debug printer for user defined triton kernel codegen

The benefit of having the debug printer call happens at a more centralized place is 1) reduce the duplicate debug printer related logic code scattered everywhere in the codebase 2) it can handle more triton kernel codegen path as long as it invokes this `generate_kernel_call()` for example,  it can automatically handle/support user_defined_kernel 's debug printing which is a pretty common use case we encounter in debugging

Test Plan:
```AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_user_defined_triton_kernel_abi_compatible_cuda```

Also verified that templateKernel codegen path still works

Differential Revision: D61949020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134789
Approved by: https://github.com/ColinPeppler
2024-09-04 02:41:30 +00:00
ea89f01281 Remove unused comment (#135034)
As part of my rampup I've been reading through some of @ezyang's diffs. I noticed in https://github.com/pytorch/pytorch/pull/133439 there was a comment that he forgot to remove. This diff removes that comment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135034
Approved by: https://github.com/albanD
2024-09-04 02:32:26 +00:00
175485097a [EASY] Typofix (#135022)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135022
Approved by: https://github.com/albanD
2024-09-04 01:59:40 +00:00
15c25c4580 Fix dim mismatch logic automatic dynamic not working with compiler collectives (#135025)
Fixes
https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135025
Approved by: https://github.com/albanD
2024-09-04 01:50:21 +00:00
4ebf6b04a8 Turn on expanded index path for Half on CPU (#133553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133553
Approved by: https://github.com/yanbing-j, https://github.com/jgong5, https://github.com/peterbell10
2024-09-04 00:56:56 +00:00
e000cf0ad9 Fix license metadata in setup.py (#129219)
Package metadata in setup.py lists license as BSD-3 which is not a valid SPDX id. The correct id would be BSD-3-Clause.

Specifying an SPDX id is beneficial to license compliance scanning.

*Taking up #129123 from my personal account.*
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129219
Approved by: https://github.com/malfet, https://github.com/kit1980
2024-09-04 00:21:22 +00:00
45743019cf [PT2][Optimus] Skip meta update on symblic shape (#134975)
Summary: We noticed that there will be runtime error to do the dim broadcast when the meta example value has symbolic shape, thus we skip it.

Test Plan:
```
buck2 run mode/opt //caffe2/benchmarks/dynamo/fb:torchbench_run_ads_dhen_5x_training -- -m ads_dhen_5x -t training
```

P1559019921

Differential Revision: D62115015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134975
Approved by: https://github.com/xuzhao9
2024-09-04 00:05:51 +00:00
9ffcca7060 [Profiler] Handle Tensor Sizes/Strides Parsing Error (#134862)
Summary:
Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Differential Revision: D62008788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134862
Approved by: https://github.com/aaronenyeshi
2024-09-03 23:46:38 +00:00
f05b716d6d Add validator to ensure runner determinator script is kept in sync (#134800)
We keep two copies of the runner-determinator script:
1. In runner_determinator.py, for ease of testing.  This however is not actually executed during CI
2. Embedded in _runner-determinator.yml.  This is what CI uses.

Why the duplication? Short version: Because of how github CI works, during a given CI run the workflow yml files could actually come from the main branch, while the remaining files get read from the local commit.
This can lead to a newer version of _runner-determinator.yml trying to invoke an older version of runner_determintor.py than it was actually designed for. Chaos ensues.

We mitigate this by embedding the script into the yml file.  But we still keep the script around because it's much easier to run tests against.

This workflow's job is to ensure that if one edits the script in one of those two locations then they remember to update it in the other location as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134800
Approved by: https://github.com/zxiiro, https://github.com/PaliC
ghstack dependencies: #134796
2024-09-03 23:29:04 +00:00
469429b959 Refactor runner determinator (#134796)
Some minor refactorings to make the code easier to parse and easier to add unit tests for.  Keeping this as a separate PR for ease of review, since it should have zero functional behavior changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134796
Approved by: https://github.com/zxiiro, https://github.com/PaliC
2024-09-03 23:29:04 +00:00
c044deb9ce Revert "c10d/logging: add C10D_LOCK_GUARD (#134131)"
This reverts commit f33bcbe5fd67e6b18be259ad2f0dc11c74157075.

Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/kit1980 due to See D61985186 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2327556381))
2024-09-03 22:35:14 +00:00
2fd36086bc Revert "Add torch.serialization.skip_data context manager (#134504)"
This reverts commit 94db935749b8de99d8c3ab23fb880c67c8f3e67a.

Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/kit1980 due to See D62082697 ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2327542276))
2024-09-03 22:21:27 +00:00
85fa019697 [Docs] Fix call to deprecated function (#135037)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135037
Approved by: https://github.com/janeyx99, https://github.com/jbschlosser
2024-09-03 20:57:11 +00:00
14c8ef5198 autolabel aotinductor->export (#135040)
"module: aotinductor" will automatically add "oncall: export".

Test Plan:
- none
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135040
Approved by: https://github.com/ydwu4
2024-09-03 20:17:51 +00:00
c40e622966 [inductor] add openmp config for intel conpiler on Linux. (#134973)
Config `openmp` for Intel Compiler on Linux.

Base on this PR, we can confirm the Intel optimized libraries are work built well.
<img width="1039" alt="image" src="https://github.com/user-attachments/assets/838d5114-c778-4961-9cfe-39a814647089">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134973
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-03 20:10:21 +00:00
272f3b9fe1 [FlexAttention] Update tolerance for failing test (#135035)
Summary: Address: T198937061

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention -- --exact 'caffe2/test/inductor:flex_attention - test_no_q_info_compile_False (caffe2.test.inductor.test_flex_attention.TestBlockMask)' --run-disabled

Differential Revision: D62137797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135035
Approved by: https://github.com/Chillee
2024-09-03 20:09:21 +00:00
e7731b3f8a [TorchElastic] make torch elastic not have to realize TCPStore backend type and rely on c10d to decide which backend to use (#134882)
D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via
1) explicit argument passing in user code when instantiating `MastRendezvousHandler`
2) pass `--use_libuv` command line argument to `torchrun`.

The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch.

PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type:
when `USE_LIBUV="0"`, the non-libuv backend will be used.
when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option.

Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882
Approved by: https://github.com/shuqiangzhang
2024-09-03 19:43:21 +00:00
71383dd3da [MPS] Fix bachnorm_2d for channels last (#134618)
By skipping gather of input tensor if memory_layout is channels_last, which is a first step towards fixing  https://github.com/pytorch/pytorch/issues/134580

Though underlying problem is much more interesting, i.e. MPS does not have a generic support for channels last, but `c10::is_contiguoius()` is true for channels last layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134618
Approved by: https://github.com/albanD
2024-09-03 19:20:11 +00:00
758d787901 Added complex support for torch.logsumexp (#133187)
Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`.

Fixes #133047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-09-03 17:28:36 +00:00
6c3767452d Move auto functionalize tests in their own test file (#134834)
title + use `with torch.library._scoped_library as lib` when needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134834
Approved by: https://github.com/zou3519
ghstack dependencies: #134831
2024-09-03 17:09:03 +00:00
2e0b114c06 add a new Guage API with an empty backend to PyTorch core (#134883)
Summary:
The current use case is to continuously measure the total allocated and reserved CUDA memory size from CUDACachingAllocator, and export their distribution (min, max, p90 etc) over time as timeseries.

The current callback-based API does not work because the backend decides when the measurement is taken, so data points between two measurements may not be recorded. The distribution (e.g. max) as such will not be accurate.

This new API closely follow the design of the existing WaitCounter API otherwise.

This is not quite a synchronous version of DynamicCounter, as summing multiple data points does not make sense to my use case

Test Plan: CI

Differential Revision: D61837528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134883
Approved by: https://github.com/c-p-i-o
2024-09-03 17:08:47 +00:00
7804c089c6 [BE] Update numpy version to 2.0.2 (#134875)
It's long time to abandon pre-release version

Partially addresses https://github.com/pytorch/pytorch/issues/134868
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134875
Approved by: https://github.com/justinchuby, https://github.com/clee2000, https://github.com/kit1980, https://github.com/atalman, https://github.com/Skylion007
2024-09-03 17:02:35 +00:00
1b9f51bd88 [ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748)
Bump onnxscript version in CI to 0.1.0.dev20240831, and temporarily remove the fx consistency test. We will add a better version back later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133748
Approved by: https://github.com/titaiwangms
2024-09-03 16:30:07 +00:00
27677ead7c Revert "[ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748)"
This reverts commit 6eed63c8b9c4f54a573bb51960d252cd42bfab0c.

Reverted https://github.com/pytorch/pytorch/pull/133748 on behalf of https://github.com/ZainRizvi due to The version bump appears to be pulling in an unavailable numpy version? [GH job link](https://github.com/pytorch/pytorch/actions/runs/10686076754/job/29620426371) [HUD commit link](6eed63c8b9) ([comment](https://github.com/pytorch/pytorch/pull/133748#issuecomment-2326932868))
2024-09-03 16:19:47 +00:00
a258844a32 Properly handle empty CPUINFO variable (#134916)
Fixes https://github.com/pytorch/pytorch/issues/134915

But I did not root cause why CPUINFO is totally empty to begin with...

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134916
Approved by: https://github.com/Skylion007
2024-09-03 15:59:59 +00:00
f927bcb934 Revert "[Inductor] Apply loop split optimization in codegen_node (#132389)"
This reverts commit 3cb5d251224b3fb59b5a10c6fefbb4c84eb565a6.

Reverted https://github.com/pytorch/pytorch/pull/132389 on behalf of https://github.com/ZainRizvi due to Hi, this seems to be breaking in trunk. See test_dataloader.py::TestDataLoader::test_segfault [GH job link](https://github.com/pytorch/pytorch/actions/runs/10660461216/job/29556282081) [HUD commit link](de3a641476) ([comment](https://github.com/pytorch/pytorch/pull/132389#issuecomment-2326843129))
2024-09-03 15:40:45 +00:00
6eed63c8b9 [ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748)
Bump onnxscript version in CI to 0.1.0.dev20240831, and temporarily remove the fx consistency test. We will add a better version back later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133748
Approved by: https://github.com/titaiwangms
2024-09-03 15:33:09 +00:00
33ba952e31 [subclasses] Do not fakeTensor const prop subclass args (#134855)
The issue:

Const propagation checks only if arguments do not have FakeTensor. If argument is Subclass, it will pass this condition.

As a result Const Propogation execution happens without FakeTensorMode and having tensor factories inside Subclass.__torch_dispatch__ results that this Tensor is not Fakified.

Solution:

If we have subclasses arguments, do not count that const propagation is doable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134855
Approved by: https://github.com/zou3519
2024-09-03 13:31:49 +00:00
2a49296d75 Fix set_unbacked_bindings when list of Tensors is returned (#133585)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133585
Approved by: https://github.com/albanD
2024-09-03 12:23:31 +00:00
2443507acc Update torch-xpu-ops pin (ATen XPU implementation) (#134983)
Release cycle for PyTorch 2.5
1. Enable Windows build in latest torch-xpu-ops. Resolved large bin issue.
2. Refine test infrastructure for compatibility on different HW platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134983
Approved by: https://github.com/EikanWang
2024-09-03 12:14:37 +00:00
39935e0fde Update cpuinfo submodule (#134891)
Last time it was done in June by https://github.com/pytorch/pytorch/pull/127505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134891
Approved by: https://github.com/Skylion007
2024-09-03 09:29:59 +00:00
23a2161ad1 Changed addmv to be a decomposition and not a fallback (#134823)
Overall seems to be faster

![image](https://github.com/user-attachments/assets/0cbea76e-fb78-4634-9265-047de0291549)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134823
Approved by: https://github.com/jansel
ghstack dependencies: #134813, #134818, #134819
2024-09-03 06:33:31 +00:00
9856bc50a2 Switch nanmedian to not cuda synchronize (#134819)
Generally, this seems to be faster.

![image](https://github.com/user-attachments/assets/43a86c6f-236d-4ba1-aae0-14e3d88ae401)

And as an added benefit, it works great with cudagraphs and such :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134819
Approved by: https://github.com/Skylion007, https://github.com/eqy
ghstack dependencies: #134813, #134818
2024-09-03 06:33:31 +00:00
6fce1faa10 change multinomial to use async asserts instead of a synchronization (#134818)
Fixes https://github.com/pytorch/pytorch/issues/134442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134818
Approved by: https://github.com/ezyang
ghstack dependencies: #134813
2024-09-03 06:33:24 +00:00
db193d1e29 add msg to _assert_async (#134813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134813
Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/albanD
2024-09-03 06:33:18 +00:00
d14fe3ffed [Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487)
**Summary**
The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929.  Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487
Approved by: https://github.com/anijain2305, https://github.com/jgong5
2024-09-03 05:05:50 +00:00
a00fad0177 Add specializations for vectorized conversion between float and BF16/FP16 (#126500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126500
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-03 02:09:12 +00:00
45f11094b6 [ONNX] Delete op_level_debug from torch.onnx.ExportOptions (#134961)
op_level_debug helped to identify missing operators, and wrongly implemented operators at the time that dynamo exporter relied on nearest matching and torchlib was just created. However, right now, with dispatcher logic improved and torchlib becomes mature, we no longer need it.

PS: op-level-debug diagnostics rule is not deleted in this PR, as it auto generates lint error code, and need more time to fix. We can delete it when we retire sarif.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134961
Approved by: https://github.com/justinchuby
2024-09-02 23:38:39 +00:00
4c1dd13ba3 [BE] better type annotation for torch.types (#129559)
Closes #129525

- #129525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129559
Approved by: https://github.com/ezyang
2024-09-02 15:35:32 +00:00
76710d4f95 Corrected docstring of `solve_triangular` (#129766)
**Description**
The arguments docstring of [torch.linalg.solve_triangular](https://pytorch.org/docs/stable/generated/torch.linalg.solve_triangular.html#torch.linalg.solve_triangular) incorrectly describes the shape of the ``A`` argument if the option ``left=True``.

The argument ``A`` should have shape $k \times k$ if ``left=False`` in line with the rest of the docstring and the implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129766
Approved by: https://github.com/lezcano
2024-09-02 13:30:30 +00:00
ee03530fd9 Add a test to avoid decorator based regression for cprofile traces (#133086)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133086
Approved by: https://github.com/aorenste
2024-09-02 12:53:34 +00:00
FEI
16de25b1dc fix tensor_repr(at::Tensor) (#134762) (#134764)
Fixes #134762
@ezyang @antocuni
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134764
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-09-02 06:05:10 +00:00
3daca187aa [Inductor] Allow customizing the padding format (#133939)
Based on https://github.com/pytorch/pytorch/pull/130956.

Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
  - When we pad, it is always aligned to the next multiple of 128 bytes.
  - Strides smaller than 1024 are not padded.
  - Only intermediate values are padded, not outputs.

 The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.

 This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
   - `config.pad_outputs`: choose whether to pad outputs (default: `False`)
   - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
   - `config.padding_stride_threshold`:  choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)

 **Test plan**
 Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.

  These changes should not affect perf, because the defaults are identical to Inductor's current behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314

Co-authored-by: Yueming Hao <yhao@meta.com>
2024-09-02 05:56:33 +00:00
de3a641476 [executorch hash update] update the pinned executorch hash (#134914)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134914
Approved by: https://github.com/pytorchbot
2024-09-02 03:52:40 +00:00
3cb5d25122 [Inductor] Apply loop split optimization in codegen_node (#132389)
This PR applies loop split optimization in codegen_node to avoid non-contiguous load. When the vector is loaded in a non-contiguous manner due to a division in the index, we eliminate the division by splitting the loop to avoid non-contiguous load.

Example:
```
import torch
import torch.nn as nn

class GNReLU(torch.nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GNReLU, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return torch.nn.functional.relu(self.gn(x))

input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last)
m = GNReLU(32, 960).eval()
compiled_m = torch.compile(m)

with torch.no_grad():
    compiled_m(input)
```

Generated code:

- Before:
```
cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 14);
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)), 16);
                        auto tmp1 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp3 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16);
                        auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16);
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = static_cast<float>(276480.0);
                        auto tmp5 = at::vec::Vectorized<float>(tmp4);
                        auto tmp6 = tmp3 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = at::vec::Vectorized<float>(tmp7);
                        auto tmp9 = tmp6 + tmp8;
                        auto tmp10 = tmp9.rsqrt();
                        auto tmp11 = tmp2 * tmp10;
                        auto tmp13 = tmp11 * tmp12;
                        auto tmp15 = tmp13 + tmp14;
                        auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                        tmp16.store(out_ptr2 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)));
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg2_1, = args
    args.clear()
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3)
    del arg2_1
    return (buf3, )
```

- After:
```
cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 14);
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(32L); x2+=static_cast<long>(1L))
                    {
                        for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 16);
                            auto tmp1 = out_ptr0[static_cast<long>(x2 + (32L*x0))];
                            auto tmp4 = out_ptr1[static_cast<long>(x2 + (32L*x0))];
                            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30L*x2)), 16);
                            auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30L*x2)), 16);
                            auto tmp2 = at::vec::Vectorized<float>(tmp1);
                            auto tmp3 = tmp0 - tmp2;
                            auto tmp5 = static_cast<float>(276480.0);
                            auto tmp6 = tmp4 / tmp5;
                            auto tmp7 = static_cast<float>(1e-05);
                            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                            auto tmp9 = 1 / std::sqrt(tmp8);
                            auto tmp10 = at::vec::Vectorized<float>(tmp9);
                            auto tmp11 = tmp3 * tmp10;
                            auto tmp13 = tmp11 * tmp12;
                            auto tmp15 = tmp13 + tmp14;
                            auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                            tmp16.store(out_ptr2 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)));
                        }
                        for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 14);
                            auto tmp1 = out_ptr0[static_cast<long>(x2 + (32L*x0))];
                            auto tmp4 = out_ptr1[static_cast<long>(x2 + (32L*x0))];
                            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30L*x2)), 14);
                            auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30L*x2)), 14);
                            auto tmp2 = at::vec::Vectorized<float>(tmp1);
                            auto tmp3 = tmp0 - tmp2;
                            auto tmp5 = static_cast<float>(276480.0);
                            auto tmp6 = tmp4 / tmp5;
                            auto tmp7 = static_cast<float>(1e-05);
                            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                            auto tmp9 = 1 / std::sqrt(tmp8);
                            auto tmp10 = at::vec::Vectorized<float>(tmp9);
                            auto tmp11 = tmp3 * tmp10;
                            auto tmp13 = tmp11 * tmp12;
                            auto tmp15 = tmp13 + tmp14;
                            auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                            tmp16.store(out_ptr2 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 14);
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg2_1, = args
    args.clear()
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3)
    del arg2_1
    return (buf3, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132389
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
2024-09-02 00:28:34 +00:00
c140fa1426 Reorg cache code to make it simpler (#134911)
Summary:
Pull the big nested function out of the middle of cached_autotune() into its own class.

Also refactor creating the autotune cache itself out - which gets shared in the next diff.

Test Plan: unit tests

Differential Revision: D60677501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134911
Approved by: https://github.com/oulgen
2024-09-02 00:27:40 +00:00
0cbcef12bd Stop adding useless prefix to error message here, you're pushing the important info off the screen. (#133108)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133108
Approved by: https://github.com/Skylion007
2024-09-01 23:11:17 +00:00
208442ea18 Don't setup try-except handler when Dynamo compiling (#133239)
The reraise is not supported and so this just gunks up our actual exception handling. You can trigger this by hitting an exception inside of an NN module that has hooks on it. You end up graph breaking on the reraise here, and losing the inner stack trace from the actual exception that was raised.

This might be kind of controversial.  An alternate strategy is to support reraises in Dynamo or something but IDK this doesn't feel like the right place to apply force.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133239
Approved by: https://github.com/anijain2305
2024-09-01 22:26:46 +00:00
ea01aec8b1 Move FunctionSchema implementations to cpp file (#133856)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133856
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-09-01 19:50:35 +00:00
2dadc2c8fc Log fx graph cache bypass reasons (#134792)
Summary: Lets track when we bypass and why

Test Plan: unit tests

Differential Revision: D61994739

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134792
Approved by: https://github.com/jamesjwu
2024-09-01 19:02:09 +00:00
cyy
1595e755af [Reland] [Torchgen] Pass mutable to cpp.valuetype_type (#134549)
Reland of #121415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134549
Approved by: https://github.com/ezyang
2024-09-01 15:15:38 +00:00
eqy
b1a00b7b6d Abate -Wsign-compare warning spam in Indexing.cu (#134805)
Fix for warning spam like
```
 warning: comparison of integer expressions of different signedness: ‘long int’ and ‘uint64_t’ {aka ‘long unsigned int’} [-Wsign-compare]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134805
Approved by: https://github.com/janeyx99
2024-09-01 10:48:07 +00:00
cyy
d03f767cae Check function declarations of Vulkan code (#134550)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134550
Approved by: https://github.com/ezyang
2024-09-01 09:38:35 +00:00
c25b64a057 expose host_emptyCache to python, fix a bug in freeing cudaHostRegist… (#134919)
…ered memory

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134919
Approved by: https://github.com/eqy
2024-09-01 09:07:25 +00:00
caa04e0cae [ET] codegen: bool array as array ref (#134886)
Test Plan: CI

Differential Revision: D62046959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134886
Approved by: https://github.com/larryliu0820
2024-09-01 01:33:43 +00:00
29b7852dc1 drop gil in couple places (leads to deadlocks) (#134910)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134910
Approved by: https://github.com/eqy
2024-09-01 00:05:53 +00:00
7239b8a4f1 Clean up RemoteCache classes (#134032)
Summary:
The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in.

Update them to be more consistent:

1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile

2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only)

3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching.

Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134032
Approved by: https://github.com/oulgen, https://github.com/bhack
2024-08-31 20:18:59 +00:00
590d96be64 [inductor] move test_fuse_large_params to slow test. (#134900)
Move `test_fuse_large_params` to slow test. This case spend about 1.5 minutes.

<img width="855" alt="image" src="https://github.com/user-attachments/assets/adf16dcf-d398-4d66-8dda-0c9cafc4e351">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134900
Approved by: https://github.com/jansel
2024-08-31 18:08:11 +00:00
f4641ca481 [Inductor] Remove VecChecker and fallback non-supported Vec op to Scalar impl with a for loop (#134569)
Fall back non-vectorized op by scalar impl + for loop.

Example code:
```
cpp_fused_igammac_0 = async_compile.cpp_pybinding(['const double*', 'const double*', 'double*'], '''
#include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h"
extern "C"  void kernel(const double* in_ptr0,
                       const double* in_ptr1,
                       double* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(48L); x0+=static_cast<int64_t>(8L))
        {
            auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8);
            auto tmp1 = in_ptr1[static_cast<int64_t>(0L)];
            auto tmp2 = at::vec::VectorizedN<double,2>(tmp1);
            auto tmp3 =
            [&]()
            {
                __at_align__ std::array<double, 8> tmpbuf0;
                tmp0.store(tmpbuf0.data(), 8);
                __at_align__ std::array<double, 8> tmpbuf1;
                tmp2.store(tmpbuf1.data(), 8);
                __at_align__ std::array<double, 8> tmpbuf_out;
                for (int i = 0; i < 8; i++)
                {
                    tmpbuf_out[i] = calc_igammac(tmpbuf0[i], tmpbuf1[i]);
                }
                return at::vec::VectorizedN<double, 2>::loadu(tmpbuf_out.data(), 8);
            }
            ()
            ;
            tmp3.store(out_ptr0 + static_cast<int64_t>(x0), 8);
        }
        #pragma omp simd simdlen(4)
        for(int64_t x0=static_cast<int64_t>(48L); x0<static_cast<int64_t>(50L); x0+=static_cast<int64_t>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<int64_t>(x0)];
            auto tmp1 = in_ptr1[static_cast<int64_t>(0L)];
            auto tmp2 = calc_igammac(tmp0, tmp1);
            out_ptr0[static_cast<int64_t>(x0)] = tmp2;
        }
    }
}
''')

```

`frexp` are difficult to be handled by common `fallback` since it returns two `cse_var` 2ba60a1618/torch/_inductor/codegen/cpp.py (L752-L766)
So we added a special function to do that.
```
cpp_fused_frexp_0 = async_compile.cpp_pybinding(['const double*', 'double*', 'int32_t*'], '''
#include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h"
extern "C"  void kernel(const double* in_ptr0,
                       double* out_ptr0,
                       int32_t* out_ptr1)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(16L); x0+=static_cast<int64_t>(8L))
        {
            auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8);
            at::vec::Vectorized<int32_t> tmp1;
            at::vec::VectorizedN<double, 2> tmp2;
            [&]()
            {
                __at_align__ std::array<double, 8> tmpbuf;
                tmp0.store(tmpbuf.data(), 8);
                __at_align__ std::array<int32_t, 8> tmpbuf_exponent;
                __at_align__ std::array<double, 8> tmpbuf_mantissa;
                for (int i = 0; i < 8; i++)
                {
                    tmpbuf_mantissa[i] = std::frexp(tmpbuf[i], &tmpbuf_exponent[i]);
                }
                tmp1 = at::vec::Vectorized<int32_t>::loadu(tmpbuf_exponent.data(), 8);
                tmp2 = at::vec::VectorizedN<double, 2>::loadu(tmpbuf_mantissa.data(), 8);
            }
            ();
            tmp2.store(out_ptr0 + static_cast<int64_t>(x0), 8);
            tmp1.store(out_ptr1 + static_cast<int64_t>(x0), 8);
        }
        #pragma omp simd simdlen(4)
        for(int64_t x0=static_cast<int64_t>(16L); x0<static_cast<int64_t>(20L); x0+=static_cast<int64_t>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<int64_t>(x0)];
            int32_t tmp1;
            auto tmp2 = std::frexp(tmp0, &tmp1);
            out_ptr0[static_cast<int64_t>(x0)] = tmp2;
            out_ptr1[static_cast<int64_t>(x0)] = tmp1;
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134569
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-31 11:19:57 +00:00
16f119e62a Update compiled optimizer tests for tensor betas (#134169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134169
Approved by: https://github.com/anijain2305, https://github.com/eellison
ghstack dependencies: #134166, #134167, #134168
2024-08-31 10:24:39 +00:00
4e71418566 [dynamo] rewrite addcmul_ to remove graph break (#134168)
Context: Adding support for the beta parameters to be tensors

Details: Similarly to the previous two PRs addcmul_ is used with the tensor betas as the value argument. When this occurs, an item() call is invoked in the aten op. To avoid this graph break, addcmul_ is decomposed into its constrituent ops to avoid this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134168
Approved by: https://github.com/anijain2305
ghstack dependencies: #134166, #134167
2024-08-31 10:24:39 +00:00
3fb4c6bc38 [dynamo] Rewrite foreach pow to broadcast scalar argument (#134167)
Context: Adding support for the beta parameters to be tensors

Details:
In this PR similarly to the previous, foreach_pow calls item() on the first argument when it is a scalar tensor. In this case, we broadcast that scalar tensor into a list of aliases of that tensor to avoid the item() call, and this results in a device copy of the scalar tensor. Once again, I dont think we can change the foreach_pow API due to BC concerns, so this op rewrite allows us to avoid a graph break, generate semantically the same code, and not affect eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134167
Approved by: https://github.com/anijain2305
ghstack dependencies: #134166
2024-08-31 10:24:35 +00:00
471c33f007 [dynamo] Rewrite foreach_lerp to avoid aten item call (#134166)
Context: Adding support for the beta parameters to be tensors

Details:
In order to add support for the beta params to be tensors without graph breaks in the Adam family of optimizers it is necessary to support foreach_lerp(x, y, s) where s is a scalar tensor. Today, this isn't possible because when `s` is a scalar, internally the aten op calls item() on it to extract the value and distribute it to each of the ops on the individual list indices. To support this in dynamo without graph breaks, I decompose the lerp into its constituent ops which support a scalar tensor in the list argument positions which do not result in an item() call. To be clear the item() call is more performant for eager I think and for BC I don't think we can modify that API, so this allows us to have performance in eager and no graph breaks in compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134166
Approved by: https://github.com/anijain2305
2024-08-31 10:24:31 +00:00
eed0d76682 [dynamo][itertools] refactor itertools.islice to use polyfill (#133876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876
Approved by: https://github.com/jansel
ghstack dependencies: #133864, #133894
2024-08-31 10:08:07 +00:00
ec660c383e [dynamo] reduce overhead for PolyfilledFunctionVariable.call_function (#134842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134842
Approved by: https://github.com/jansel
2024-08-31 09:12:46 +00:00
d9cc693719 [jit] Change argument names (#134828)
It seems like a bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134828
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-08-31 08:42:30 +00:00
136badae64 [inductor] preload icx built in math libs (#134870)
Intel Compiler implenmented more math libraries than clang, for performance proposal.
We need preload them like openmp library.

reproduce UT:
```cmd
pytest test/inductor/test_cpu_cpp_wrapper.py -v -k test_silu_cpu_dynamic_shapes_cpp_wrapper
```

Depends of module:
<img width="804" alt="Image" src="https://github.com/user-attachments/assets/9a672e03-ebf5-4ebb-b182-09180e6f7841">

Local test pass:
<img width="857" alt="image" src="https://github.com/user-attachments/assets/afbb8c1c-8fcc-4d64-a3ad-c8521b137d2d">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134870
Approved by: https://github.com/jansel
2024-08-31 04:50:31 +00:00
090d9cf410 [Dynamo][autograd.Function][vmap] support torch._C._are_functorch_transforms_active (#134889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134889
Approved by: https://github.com/jansel
2024-08-31 04:39:09 +00:00
34b85d301f [executorch hash update] update the pinned executorch hash (#134894)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134894
Approved by: https://github.com/pytorchbot
2024-08-31 04:16:41 +00:00
64fad53b50 [Inductor] Support passing module map parameter to Triton make_ir API (#134774)
In https://github.com/triton-lang/triton/pull/4539 the `make_ir` API was modified to accept a new `module_map` parameter. Update the Inductor callsite accordingly, preserving backwards compatibility following the existing code.

Fixes #134674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134774
Approved by: https://github.com/EikanWang, https://github.com/zou3519, https://github.com/jansel
2024-08-31 03:38:08 +00:00
aef5da50f4 Cleanup unused pytorch.version (#134381)
This file doesn't seem to be used anywhere? checking CI...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134381
Approved by: https://github.com/zou3519
2024-08-31 02:50:05 +00:00
86e03a64e1 Revert "[Inductor] Allow customizing the padding format (#133939)"
This reverts commit 8b258b3b14408986a1d4142cff5a153c798ceecc.

Reverted https://github.com/pytorch/pytorch/pull/133939 on behalf of https://github.com/ZainRizvi due to sorry but this PR is causing issues with diff train imports reverting it for now but it can be merged back in as-is ([comment](https://github.com/pytorch/pytorch/pull/133939#issuecomment-2322635388))
2024-08-31 00:38:30 +00:00
f95085fd91 [BE][MPS] Prefer xfail to skip (#134858)
This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393

Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean

Before the change if run on MacOS 14:
```
 % python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.053s

OK (skipped=32)
```
After
```
% python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.229s

OK (skipped=10, expected failures=2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858
Approved by: https://github.com/janeyx99
2024-08-31 00:29:48 +00:00
050ad925f3 [benchmark] Add to torchbench relative path search (#134871)
Add to relative path search in benchmark. This enables user to run `torchbench.py` inside the `pytorch/benchmark/dynamo` folder when `torchbench` repo is cloned in the same level as `pytorch`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134871
Approved by: https://github.com/FindHao
2024-08-31 00:28:22 +00:00
a854c3a25e [dynamo] refactor builtins.enumerate to use polyfill (#133894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894
Approved by: https://github.com/jansel
ghstack dependencies: #133864
2024-08-31 00:17:27 +00:00
ebbdeeede1 [dynamo][itertools] refactor itertools.chain and itertools.chain.from_iterable to use polyfills (#133864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864
Approved by: https://github.com/jansel
2024-08-31 00:11:54 +00:00
5dad6a5a84 [ONNX][DORT] Lazy-import onnxruntime (#134662)
Currently, if installed, `onnxruntime` will be imported when importing `torch._inductor` (which will be imported by some other library, e.g. transformer-engine):

```
  /mnt/c.py(53)<module>()
-> from torch._inductor.utils import maybe_profile
  /usr/local/lib/python3.10/site-packages/torch/_inductor/utils.py(49)<module>()
-> import torch._export
  /usr/local/lib/python3.10/site-packages/torch/_export/__init__.py(25)<module>()
-> import torch._dynamo
  /usr/local/lib/python3.10/site-packages/torch/_dynamo/__init__.py(2)<module>()
-> from . import convert_frame, eval_frame, resume_execution
  /usr/local/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py(48)<module>()
-> from . import config, exc, trace_rules
  /usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py(52)<module>()
-> from .variables import (
  /usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/__init__.py(38)<module>()
-> from .higher_order_ops import (
  /usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py(14)<module>()
-> import torch.onnx.operators
  /usr/local/lib/python3.10/site-packages/torch/onnx/__init__.py(62)<module>()
-> from ._internal.onnxruntime import (
  /usr/local/lib/python3.10/site-packages/torch/onnx/_internal/onnxruntime.py(37)<module>()
-> import onnxruntime  # type: ignore[import]
```

This issue breaks generated triton kernel because it imported torch, and unexpected runtime libraries as well.

I've also added a test for this specific case under `test/onnx`, perhaps we should add more somewhere else?

Related issue: https://github.com/huggingface/accelerate/pull/3056
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134662
Approved by: https://github.com/justinchuby
2024-08-31 00:06:28 +00:00
2384f77d76 [XPU] Fix Windows XPU build (#134276)
Linker flag check doesn't work correctly with MSVC and linking torch_xpu with torch_cpu_library for windows MSVC works without any errors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134276
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-08-30 23:51:40 +00:00
e688b78791 [Dynamo][autograd.Function] Trace fwd graph under no_grad mode (#134872)
Fixes #134820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134872
Approved by: https://github.com/zou3519
2024-08-30 22:24:18 +00:00
8b258b3b14 [Inductor] Allow customizing the padding format (#133939)
Based on https://github.com/pytorch/pytorch/pull/130956.

Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
  - When we pad, it is always aligned to the next multiple of 128 bytes.
  - Strides smaller than 1024 are not padded.
  - Only intermediate values are padded, not outputs.

 The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.

 This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
   - `config.pad_outputs`: choose whether to pad outputs (default: `False`)
   - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
   - `config.padding_stride_threshold`:  choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)

 **Test plan**
 Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.

  These changes should not affect perf, because the defaults are identical to Inductor's current behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314

Co-authored-by: Yueming Hao <yhao@meta.com>
2024-08-30 20:34:11 +00:00
a1ba8e61d1 Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)"
This reverts commit 5e8bf29148a590318f678620f84be8f4d5ffff5c.

Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/ZainRizvi due to This still breaks linux binary builds. Added the appropriate labels to ensure tests can pass. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10626427003/job/29460479554) [HUD commit link](5e8bf29148) ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2322246198))
2024-08-30 20:00:41 +00:00
f6398eb0fa dynamic shapes for combo_kenel/foreach_kernel (#134477)
This PR add dynamic shapes support to foreach and combo kernels for horizontal fusion.
A flag `combo_kernel_foreach_dynamic_shapes` (default False to avoid disturb production workflows) is added to _inductor/config.py. Setting it to True enables automatic dynamic shapes for foreach kernels. It is always enabled for combo kernels cases. Added unit cases.

This PR also fixes a flaky test case for [T198833257](https://www.internalfb.com/intern/tasks/?t=198833257)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134477
Approved by: https://github.com/mlazos
2024-08-30 19:58:20 +00:00
db17a9898d regenerate ci workflows for binary builds with new g4dn runners (#133404)
Fixes #103104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133404
Approved by: https://github.com/ZainRizvi
2024-08-30 19:53:22 +00:00
98b813d0d4 Enable cudagraphs in cpp wrapper (#133885)
Fixes https://github.com/pytorch/pytorch/issues/130878

Summary: Enables cudagraphs in cpp wrapper by clearing inputs.

Generated, non-cpp wrapper code:
```python
def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (10, ), (1, ))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((10, ), (1, ), torch.float32)
        # Topologically Sorted Source Nodes: [sin], Original ATen: [aten.sin]
        stream0 = get_raw_stream(0)
        triton_poi_fused_sin_0.run(arg0_1, buf0, 10, grid=grid(10), stream=stream0)
        del arg0_1
    return (buf0, )
```
vs generated cpp wrapper code:
```python
def _wrap_func(f):
    def g(args):
        input_tensors = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args]
        input_handles = torch._C._aoti.unsafe_alloc_void_ptrs_from_tensors(input_tensors)
        # new:
        args.clear()
        # end new

        output_handles = f(input_handles)
        output_tensors = torch._C._aoti.alloc_tensors_by_stealing_from_void_ptrs(output_handles)
        return output_tensors

    return g

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133885
Approved by: https://github.com/eellison, https://github.com/desertfire
2024-08-30 18:48:37 +00:00
bdfa94b787 [RFC] Make fr trace script a console scripts (#134729)
We want to make fr analyzer script a command after users `pip install torch`, that's why we want to mimic what torchrun is doing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134729
Approved by: https://github.com/c-p-i-o, https://github.com/malfet
ghstack dependencies: #134528, #134780
2024-08-30 18:17:06 +00:00
a0d0c6b7e6 Used torch.equal in test_foreach_copy_with_multi_dtypes (#134861)
`self.assertEqual` allows some tolerance, but here, we want to show that `_foreach_copy_` gives bitwise equivalent results. Let us use `torch.equal` then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134861
Approved by: https://github.com/Skylion007, https://github.com/janeyx99, https://github.com/crcrpar
2024-08-30 18:04:41 +00:00
1993a2aa9e [FR] Make pg_name unique, show P2P collective status and fix bugs when running the script as command (#134780)
Fixes a bunches of bugs in the script when running with the generated command and 3D parallel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134780
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #134528
2024-08-30 18:03:17 +00:00
15f5a4858b [inductor] enable Intel Compiler(icx-cl) for inductor windows (#134772)
This PR is enable Intel Compiler (`icx-cl`) for Windows inductor, likes previous PR: https://github.com/pytorch/pytorch/pull/134444 which enable clang.

Changes:
1. Fix icx-cl crash by wrong decode args, the right decode should be "utf-8".
2. Add intel compiler check, and intel compiler Windows drivers check(icx-cl).
3. Add Intel compiler openmp args config.
4. Add intel compiler openmp binary preload.

For intel compiler openmp binary path:
<img width="788" alt="image" src="https://github.com/user-attachments/assets/54c76356-018d-4bef-a9b7-0ea150fd7aba">

For performance, Intel compiler(`icx-cl`) is much better performance than MSVC(`cl`):
<img width="875" alt="image" src="https://github.com/user-attachments/assets/67865faf-b1de-4535-917a-486b72527204">

Append `clang-cl` performance data:
<img width="821" alt="image" src="https://github.com/user-attachments/assets/476f4568-bf58-457f-b73d-4e57f49be384">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134772
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-30 17:51:46 +00:00
9e0ddc0e14 [inductor] don't allow triton config pre_hook (#134633)
The caching autotuner caches triton configs, and it doesn't try to hash or save the pre_hook from the config if it exists. If we had a config that had a pre_hook, then we might autotune -> save the config (without the pre_config) -> later load the saved config and try to run it, but this time without the pre_hook.

So this PR adds an assert and deletes the pre_hook handling. We can be confident that we didn't have functional pre_hooks, because the pre_hook handling tries to use `self.arg_name`, which doesn't exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134633
Approved by: https://github.com/shunting314, https://github.com/jansel
2024-08-30 17:39:37 +00:00
e21d7b77ce Update ForeachfuncInfo.sample_inputs_func to yield scalars & scalarlists that are more friendly to test_meta (#134552)
for `test_meta.py` to see more "PASSED" instead of "XFAIL".

`pytest test_meta.py -k "_foreach_"` ran 6400 test cases and:
- This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed
- main (92c4771853892193d73d87bd60eca4dc7efc51d8): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552
Approved by: https://github.com/janeyx99
2024-08-30 17:30:50 +00:00
577a93514f [dynamo][dynamic][heuristic] Mark tuple getitem integers as static (#134734)
Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134734
Approved by: https://github.com/jansel
ghstack dependencies: #134653, #134713
2024-08-30 17:06:57 +00:00
08184aa85c Add support for 32KB multi_tensor_apply kernel arguments (#134373)
## Benchmark

On H100 SXM (HBM2e, 500W TDP), CUDA Toolkit=12.2, Driver Version=535.154.05, with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa) (`torch._foreach_copy_`):

**Baseline**
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp0g_x4sys
device ms: 0.891, cpu ms: 7.200
memory bandwidth: 1457.727 GB/s
```

Single iteration trace:
<img width="1432" alt="image" src="https://github.com/user-attachments/assets/8ef54365-0265-4281-a0f0-d4c2f448300e">

**This PR**
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp3jqiugli
device ms: 0.683, cpu ms: 6.745
memory bandwidth: 1902.010 GB/s
```

Single iteration trace:
<img width="1074" alt="image" src="https://github.com/user-attachments/assets/e52acad1-d09b-492c-9611-6d69e339f3ac">

## Binary Size and Kernel Specialization
The binary size for `libtorch_cuda.so` increased 6MB (243MB -> 249MB).

```
// NOTE: [32KB kernel argument size support]
// 32KB kernel argument size support has three requirements:
// - CUDART_VERSION >= 12010
// - Driver version >= 530
// - GPU arch >= VOLTA
//
// Due to minor version compatibility, it possible for binaries built with
// CUDART_VERSION >= 12010 to run with driver version < 530. Since driver
// version can only be checked at runtime, if CUDART_VERSION >= 12010, we have
// to build both 4KB and 32KB kernels and determine the appropriate kernel to
// dispatch at runtime.
//
// - If CUDART_VERSION < 12010, only 4KB kernels will be instantiated.
//
// - If CUDART_VERSION >= 12010:
//   - Host code:
//     - We always instantiate the launching stub for both 4KB and 32KB kernels.
//   - Device code:
//     - If __CUDA_ARCH__ >= 700, we always instantiate both 4KB and 32KB
//     kernels.
//     - If __CUDA_ARCH__ < 700, it's not possible to even compile an empty
//     32KB kernel (formal parameter space overflowed). Thus, we only
//     instantiate a declaration for 32KB kernels. This is valid as long as the
//     declaration-only kernel is not launched.
//
// - At runtime, we dispatch to the 32KB kernel if driver version >= 530 and
// GPU arch >= VOLTA.
//
// - TODO(yifu): once there's a CUDART version that is not compatible with any
// driver version below 530, we can determine at compile time to not compile
// the kernels for 4KB kernel argument size.
//
// https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134373
Approved by: https://github.com/eqy, https://github.com/crcrpar, https://github.com/janeyx99
2024-08-30 16:52:28 +00:00
a19a7524f6 [export] Make sure getitem replacement are synced with module call graph. (#134830)
Summary: When we are placing nodes in the graph, we should also replace the references in module_call_graph.

Test Plan:
buck2 run 'fbcode//mode/opt' torchrec/fb/ir/tests:test_serializer -- --filter-regex test_serialize_deserialize_vlea
buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_serialize_empty_value_vlea' --run-disabled

buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_deserialized_device_vle' --run-disabled

Differential Revision: D62014035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134830
Approved by: https://github.com/angelayi
2024-08-30 16:47:05 +00:00
f5b0caee71 Rewrite unsafe_remove_auto_functionalized_pass using decompose_auto_functionalized (#134831)
`unsafe_remove_auto_functionalized_pass` can be written as using `decompose_auto_functionalized`, this way we do not have to update it each time we do a change to `auto_functionalize` (Ex https://github.com/pytorch/pytorch/pull/134409) , and we avoid duplicate logics implemented in two different ways.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134831
Approved by: https://github.com/zou3519
2024-08-30 16:27:53 +00:00
351ba3e67c Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)"
This reverts commit 65864d01341d006955579b145f78547314ceb14b.

Reverted https://github.com/pytorch/pytorch/pull/132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](https://github.com/pytorch/pytorch/pull/132931#issuecomment-2321862402))
2024-08-30 16:27:40 +00:00
994438040c Improvements for associative_scan - combine_mode (#133012)
This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `combine_mode`, which can be either `pointwise` (default) or `generic`. In case of `generic`, the `associative_scan` is more flexible and allows also to perform non-pointwise functions. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307.

@ydwu4 @Chillee @zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133012
Approved by: https://github.com/ydwu4
2024-08-30 16:09:53 +00:00
c6ecf57dd2 Revert "[dynamo] simplify implementation for functools.reduce (#133778)"
This reverts commit b5f1ffa7ab0988184497788f2738e1769888ab7d.

Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))
2024-08-30 16:06:10 +00:00
7a85c488a8 Revert "[dynamo] simplify implementation for builtins.sum (#133779)"
This reverts commit eaa449fbf0fe528a0827ee9b5bcfcd307a7c658d.

Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))
2024-08-30 16:06:10 +00:00
1ad08c7a5b Revert "[dynamo][itertools] refactor itertools.chain and itertools.chain.from_iterable to use polyfills (#133864)"
This reverts commit 1b703669576223024eb84a76c53b7ec5ed8bb270.

Reverted https://github.com/pytorch/pytorch/pull/133864 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))
2024-08-30 16:06:10 +00:00
8aa44e14cf Revert "[dynamo] refactor builtins.enumerate to use polyfill (#133894)"
This reverts commit a2566adfb6064235db6d950568435fb6ef58a598.

Reverted https://github.com/pytorch/pytorch/pull/133894 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))
2024-08-30 16:06:09 +00:00
10c31e96df Revert "[dynamo][itertools] refactor itertools.islice to use polyfill (#133876)"
This reverts commit 7d12e6dceb94a221288f21c0e79ce8ca667d657a.

Reverted https://github.com/pytorch/pytorch/pull/133876 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))
2024-08-30 16:06:09 +00:00
d261a1751a [HOP] fix export x inline_inbuilt_nn_modules (#133731)
TLDR; this PR supports exporting cond x inine_inbuilt nn modules flag by inling into tracing code in proxy_tensor.py _symbolic_trace.py (internally, the pattern is make_fx(record_module_stack)(torch.compile(f))).

We have two special treatments for following cases:

1. _ModuleStackTracer will wrap all the nn modules into _AttrProxy. This _AttrProxy has several subtiles which make it hard to inline in dynamo like overriding _modules with a property method and overrides the `__getattr__`,  which mutates captured states when calling `__getattr__`.

Solution to this is that we unwrap the _AttrProxy and get its corresponding nn_module (a 1-1 correspondence). So that dynamo symbolically traces the original nn module instead of tracing _AttrProxy.

2. The tracer applies a bunch of patches the `__getattr__` and `__call__` of nn.Module for tracking reasons. This doesn't work well with dynamo. The immediate error we see is `torch._dynamo.exc.Unsupported: 'inline in skipfiles: WeakKeyDictionary.__contains__ | __contains__ /home/yidi/.conda/envs/pytorch/lib/python3.10/weakref.py` caused by a weakdict in PythonKeyTracer.

Solution to this is that we remove the patches during dynamo symbolic convert temporally. So that dynamo has a clean environment. make_fx will be trace the transformed bytecode of dynamo and patches nn modules there instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133731
Approved by: https://github.com/anijain2305
ghstack dependencies: #134775
2024-08-30 15:58:20 +00:00
932c4ca5a0 make make_fx collective test single threaded (#134775)
make_fx is not thread-safe due to mutating and patching global states. It's difficult and low roi to make it thread-safe so just turn the tracing test into a single-thread test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134775
Approved by: https://github.com/yifuwang
2024-08-30 15:58:20 +00:00
eqy
c07e566baf [CUDA][P2P] Check device capability in requires_cuda_p2p_access (#134523)
Tests seem to fail on e.g., Volta without this given the compile time meacros used e.g., in 79b7fff188/torch/csrc/distributed/c10d/intra_node_comm.cu (L487)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134523
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
2024-08-30 14:08:55 +00:00
92f282ca52 Enable batch matmul for result sizes > 2**32 the tensor can be split along batch axis (#133430)
Fixes #131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert.

Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it:

```
import torch
device='mps'
a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device)
b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device)
res = torch.bmm(a, b)
```

Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 2**32. This lets us split up the computation along the batch axis to avoid going over the limit.

Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-30 14:08:43 +00:00
50efbb9f1e [DeviceMesh][Test] Add a unit test for get_local_rank for flattened mesh (#134603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134603
Approved by: https://github.com/fduwjj
ghstack dependencies: #133838, #133839, #134048
2024-08-30 08:13:37 +00:00
0f8bec4399 [dynamo] mark_static_nn_module (#134713)
Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656

With this API, we can mark the offending module as static in detectron2.

Today's world - Consider user defined nn module int attributes automatic dynamic. Use the API in this PR to make them static if you want.

Alternative work - Consider all int attributes of any user defined nn module class static. And then introduce an API - `torch._dynamo.mark_nn_module_attribute_dynamic`. The default being static is worrying if users have `counter` in their model which is updated in each forward invocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134713
Approved by: https://github.com/jansel
ghstack dependencies: #134653
2024-08-30 07:01:06 +00:00
a5630239ad [dynamo] Improve minifier error message when fp64 not supported (#134737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134737
Approved by: https://github.com/anijain2305
2024-08-30 06:42:32 +00:00
1011e0ae98 Generalize devices specific UTs for dynamo (#130714)
## Motivation
This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970, adding facility to run content for Intel Gaudi devices.
We intend to extend similar generalization for the rest of the content in test/dynamo  which is currently being written to work specifically for cuda devices. Other devices can add onto it if support is available.

## Changes
 carve out bert related content to another class
 use instantiate_device_type utility to instantiate this class for devices which support the functionality

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130714
Approved by: https://github.com/anijain2305
2024-08-30 05:02:47 +00:00
7a694f6683 [justknobs] Override __bool__ method (#134799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134799
Approved by: https://github.com/ezyang
2024-08-30 04:54:02 +00:00
75b86b1554 [executorch hash update] update the pinned executorch hash (#134736)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134736
Approved by: https://github.com/pytorchbot
2024-08-30 04:11:51 +00:00
5e8bf29148 [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-08-30 03:38:35 +00:00
1f1e2eeb9d [inductor] Install tlparse for test\dynamo\test_structured_trace.py UTs. (#134806)
Install tlparse for test\dynamo\test_structured_trace.py UTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134806
Approved by: https://github.com/ezyang
2024-08-30 03:16:03 +00:00
0d5f978795 add basic nn modules diff time benchmarks (#134658)
benchmarks several shapes of basic nn modules. in both eager and inductor

```
collecting compile time instruction count for basic_modules_ListOfLinears_inductor
compile time instruction count for iteration 0 is 48602516013
compile time instruction count for iteration 1 is 20424350269
compile time instruction count for iteration 2 is 20440350455
compile time instruction count for iteration 3 is 20419269999
compile time instruction count for iteration 4 is 20430782200
compile time instruction count for iteration 5 is 20455049622
compile time instruction count for iteration 6 is 20157290712
compile time instruction count for iteration 7 is 20455324001
compile time instruction count for iteration 8 is 20450158317
compile time instruction count for iteration 9 is 20492987748
collecting compile time instruction count for basic_modules_ListOfLinears_eager
compile time instruction count for iteration 0 is 961328334
compile time instruction count for iteration 1 is 958887896
compile time instruction count for iteration 2 is 958792214
compile time instruction count for iteration 3 is 958375977
compile time instruction count for iteration 4 is 958568525
compile time instruction count for iteration 5 is 958152305
compile time instruction count for iteration 6 is 959322800
compile time instruction count for iteration 7 is 958332703
compile time instruction count for iteration 8 is 958092100
compile time instruction count for iteration 9 is 958095277
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_inductor
compile time instruction count for iteration 0 is 3572145793
compile time instruction count for iteration 1 is 3503323973
compile time instruction count for iteration 2 is 3501962432
compile time instruction count for iteration 3 is 3501746084
compile time instruction count for iteration 4 is 3500687361
compile time instruction count for iteration 5 is 3822254676
compile time instruction count for iteration 6 is 3498356846
compile time instruction count for iteration 7 is 3499019157
compile time instruction count for iteration 8 is 3500780314
compile time instruction count for iteration 9 is 3500257458
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_eager
compile time instruction count for iteration 0 is 1844838754
compile time instruction count for iteration 1 is 1843476862
compile time instruction count for iteration 2 is 1844761450
compile time instruction count for iteration 3 is 1845371742
compile time instruction count for iteration 4 is 1845159665
compile time instruction count for iteration 5 is 1845035802
compile time instruction count for iteration 6 is 1844895007
compile time instruction count for iteration 7 is 1844697922
compile time instruction count for iteration 8 is 1844780885
compile time instruction count for iteration 9 is 1844493990
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_inductor
compile time instruction count for iteration 0 is 1597839479
compile time instruction count for iteration 1 is 1348225351
compile time instruction count for iteration 2 is 1347340818
compile time instruction count for iteration 3 is 1348170800
compile time instruction count for iteration 4 is 1348637747
compile time instruction count for iteration 5 is 1678366444
compile time instruction count for iteration 6 is 1348412420
compile time instruction count for iteration 7 is 1348461578
compile time instruction count for iteration 8 is 1347420149
compile time instruction count for iteration 9 is 1349748195
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_eager
compile time instruction count for iteration 0 is 137721777
compile time instruction count for iteration 1 is 139065517
compile time instruction count for iteration 2 is 137130552
compile time instruction count for iteration 3 is 137506030
compile time instruction count for iteration 4 is 137089838
compile time instruction count for iteration 5 is 137477395
compile time instruction count for iteration 6 is 138550452
compile time instruction count for iteration 7 is 137568409
compile time instruction count for iteration 8 is 136968468
compile time instruction count for iteration 9 is 137481664
collecting compile time instruction count for basic_modules_ModuleComparison_inductor
compile time instruction count for iteration 0 is 917209684
compile time instruction count for iteration 1 is 899154426
compile time instruction count for iteration 2 is 898145079
compile time instruction count for iteration 3 is 899817018
compile time instruction count for iteration 4 is 899184687
compile time instruction count for iteration 5 is 898172885
compile time instruction count for iteration 6 is 899958951
compile time instruction count for iteration 7 is 899348186
compile time instruction count for iteration 8 is 897745404
compile time instruction count for iteration 9 is 899581123
collecting compile time instruction count for basic_modules_ModuleComparison_eager
compile time instruction count for iteration 0 is 113165302
compile time instruction count for iteration 1 is 112724376
compile time instruction count for iteration 2 is 112774611
compile time instruction count for iteration 3 is 114465211
compile time instruction count for iteration 4 is 112689572
compile time instruction count for iteration 5 is 112726465
compile time instruction count for iteration 6 is 112853691
compile time instruction count for iteration 7 is 112295238
compile time instruction count for iteration 8 is 114022136
compile time instruction count for iteration 9 is 112664932
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134658
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649, #134652
2024-08-30 02:13:52 +00:00
a645a18d2e [reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509)
**Summary**
reland of https://github.com/pytorch/pytorch/pull/134294

Fixes #131446
Fixes #126852
Fixes #126868
Fixes #126493

The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green.

See the error message below:
```
2024-08-24T13:42:01.3228990Z ==================================== RERUNS ====================================
2024-08-24T13:42:01.3229530Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3229710Z Unexpected success
2024-08-24T13:42:01.3230235Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3230407Z Unexpected success
2024-08-24T13:42:01.3230594Z =================================== FAILURES ===================================
2024-08-24T13:42:01.3231128Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3231296Z Unexpected success
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509
Approved by: https://github.com/tianyu-l, https://github.com/wz337
2024-08-30 02:13:45 +00:00
27ffa67984 Support __class__ attr for tuple and list variables (#134099)
Fixes #134086

This supports __class__ attribute for TupleVariable and ListVariable. And allows to construct a tuple or list by using __class__ attribute. This patch also fix a bug in NamedTupleVariable which misses a return on calling super var_getattr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134099
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-08-30 01:57:49 +00:00
cf11fc0dcb dynamo: Only log if we've disabled eval_frame once. (#134529)
This spams logs pretty badly otherwise

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134529
Approved by: https://github.com/chuanhaozhuge, https://github.com/oulgen
2024-08-30 00:35:25 +00:00
8b68912dfc Correctly detect "Rate limit exceeded" error (#134785)
Currently all 403 errors are treated as "Rate limit exceeded":
https://github.com/pytorch/pytorch/actions/runs/10622019167/job/29445336924

[Github docs](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#exceeding-the-rate-limit) claim:
> If you exceed your primary rate limit, you will receive a 403 or 429 response, and the x-ratelimit-remaining header will be 0. You should not retry your request until after the time specified by the x-ratelimit-reset header.

After this change:
https://github.com/pytorch/pytorch/actions/runs/10622365327/job/29446456395

Note, the 403 error in the jobs above is a separate issue, this PR addresses only the logging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134785
Approved by: https://github.com/clee2000
2024-08-29 23:58:15 +00:00
3402a5d865 fix windows xpu build issue (#133845)
# Motivation
If build XPU via oneAPI 2024.2, it will fail because `sycl-preview.lib` exists in windows. And linking the unexpected lib results in `error LNK2019: unresolved external symbol`.

# Solution
Use explicitly `sycl-preview` in linux build only.

# Additional Context
For `find_library`, please note that the variable will not be updated if it has been stored.
```
If the library is found the result is stored in the variable and the search will not be repeated unless the variable is cleared.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133845
Approved by: https://github.com/min-jean-cho, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet
2024-08-29 23:53:32 +00:00
3775fc982d [Inductor][CPP] Fix Index name error (#134645)
**Summary**

Fix the comment: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2313930242. For all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect:

* `_node` is a FX Node with target in ["index_expr", "load", "store"]
* `_node.args[1 if _node.target == "index_expr" else 2]` is another FX node with target `get_index`
* `_node.args[1 if _node.target == "index_expr" else 2].args[0]` is a str for the name of this index expression

It seems not true in some FB internal testcase from the failure log posted in above link. So, add the condition check to work around it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134645
Approved by: https://github.com/jgong5, https://github.com/masnesral
2024-08-29 23:33:15 +00:00
d13ce2e2b5 [c10d] release gil lock during eager init (#134779)
Summary:
We found that if we init the pG in a background thread, it would block
the main thread till init is complete. This is because in the pybinding
we never release the GIL lock
Test Plan:
existing CI on eager init

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134779
Approved by: https://github.com/c-p-i-o
2024-08-29 23:25:33 +00:00
71ff168dbb pytorch: llvm_codegen: prefix JIT generated functions with 8B of data so jitted code can be called from ASAN+UBSAN on LLVM17 (llvm/llvm-project#65253) (#134572)
Summary:
Similar workaround was already applied elsewhere in pytorch https://github.com/pytorch/pytorch/pull/133623 {D61348865}

LLVM17 UBSAN change discussion https://github.com/llvm/llvm-project/issues/104505

Here we also have to associate the data with the function with `setPrefixData(dummyPrefixData)` to prevent this workaround being disabled by the `optimize(*module_);` call which  could change layout/remove the unused variable/etc.

Differential Revision: D61845799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134572
Approved by: https://github.com/atalman
2024-08-29 23:15:13 +00:00
496e57283d add add_loop benchmarks (#134652)
This benchmark measure the cost of compiling the following function in eager and inductor
its basically two benchmarks.

```
        @torch.compile(backend=self.backend, fullgraph=True)
        def f(a, b):
            result = a.clone()
            for i in range(1000):
                if i % 3 == 0:
                    result = result + b
                elif i % 3 == 1:
                    result = result + 8 * b
                else:
                    result = result.sin()
            return result
```

 PYTHONPATH=$(pwd) python benchmarks/add_loop.py out
 ```
collecting compile time instruction count for add_loop_eager
compile time instruction count for iteration 0 is 8286649663
compile time instruction count for iteration 1 is 2838971338
compile time instruction count for iteration 2 is 2834263023
compile time instruction count for iteration 3 is 2829447493
compile time instruction count for iteration 4 is 2830904231
compile time instruction count for iteration 5 is 2830281077
compile time instruction count for iteration 6 is 2831466595
compile time instruction count for iteration 7 is 2830732164
compile time instruction count for iteration 8 is 2831088056
compile time instruction count for iteration 9 is 2831204407

collecting compile time instruction count for add_loop_inductor
compile time instruction count for iteration 0 is 32585687849
compile time instruction count for iteration 1 is 11747553436
compile time instruction count for iteration 2 is 11746959875
compile time instruction count for iteration 3 is 11749479461
compile time instruction count for iteration 4 is 11750053711
compile time instruction count for iteration 5 is 11750793958
compile time instruction count for iteration 6 is 11751673576
compile time instruction count for iteration 7 is 11754552912
compile time instruction count for iteration 8 is 11753723127
compile time instruction count for iteration 9 is 11759059942
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134652
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649
2024-08-29 23:04:01 +00:00
65864d0134 [c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931
Approved by: https://github.com/H-Huang
2024-08-29 22:40:12 +00:00
8b4c487581 Fix AOTInductor complication on ROCM (#134522)
Summary:
Original PR (https://github.com/pytorch/pytorch/pull/124123) is broken by cpp_builder refactoring

So resubmit it to fix

Test Plan: Test with command here: https://www.internalfb.com/phabricator/paste/view/P1549765548

Differential Revision: D61827208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134522
Approved by: https://github.com/frank-wei
2024-08-29 21:59:04 +00:00
1e92d7b688 [inductor] move loop ordering after fusion (#126254)
Restart the work from PR https://github.com/pytorch/pytorch/pull/100331 in this new PR since it's hard to rebase. It would be expected that some code is copy/pasted from the previous PR and main idea is the same.

Previously we see relatively large compilation time increase due to too many loop orders being considered. This PR tries to continue the work by doing pruning and only considering loop orders that we know for sure are relevant (i.e. do it on demand).

Some manually created cases that loop ordering matters are added as unit tests. The PR can make sure inductor does not miss fusion opportunities for them.

This PR should solve the not-able to fusion problem in https://github.com/pytorch/pytorch/issues/130015

Right now there is still significant increase of compilation time. I'll disable the feature by default. Later on after the compilation time issue is resolved, I'll enable it  by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126254
Approved by: https://github.com/jansel
2024-08-29 21:50:07 +00:00
416a7894fe [Windows][XPU] Disable Kineto PTI on Windows only (#134620)
Disable Kineto + XPU PTI on Windows only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134620
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-29 20:58:55 +00:00
7d12e6dceb [dynamo][itertools] refactor itertools.islice to use polyfill (#133876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778, #133779, #133864, #133894
2024-08-29 20:56:16 +00:00
a2566adfb6 [dynamo] refactor builtins.enumerate to use polyfill (#133894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778, #133779, #133864
2024-08-29 20:56:16 +00:00
1b70366957 [dynamo][itertools] refactor itertools.chain and itertools.chain.from_iterable to use polyfills (#133864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778, #133779
2024-08-29 20:56:16 +00:00
eaa449fbf0 [dynamo] simplify implementation for builtins.sum (#133779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #133769, #133778
2024-08-29 20:56:16 +00:00
b5f1ffa7ab [dynamo] simplify implementation for functools.reduce (#133778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #133769
2024-08-29 20:56:16 +00:00
e09324e7da [dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769
Approved by: https://github.com/jansel
2024-08-29 20:56:16 +00:00
b977abd5de [Inductor] Fix error checking for scaled_mm lowering (#134765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134765
Approved by: https://github.com/Skylion007
2024-08-29 20:18:42 +00:00
6180574771 Move py 3.8->3.9 pull, trunk, inductor, prerioric CI tests (#133624)
Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718
Except XPU and ROCM jobs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133624
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi
2024-08-29 19:15:59 +00:00
202e5cc87d [inductor] Fix error in debug_str_extra (#134747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134747
Approved by: https://github.com/Skylion007, https://github.com/shunting314
2024-08-29 19:09:50 +00:00
43e1df64f8 register all entry_point backends on first attempt (#132546)
fixes: https://github.com/pytorch/pytorch/issues/131360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132546
Approved by: https://github.com/jansel
2024-08-29 18:59:29 +00:00
5470fcd5b9 [5/N] Reconcile barrier and NaN checker (#134707)
By using a zeros() tensor instead of empty() tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134707
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: #134345, #134357, #134701
2024-08-29 18:51:12 +00:00
d91b49dbaa expandable_segments <-> other allocator options (#134338)
Previously setting  garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338
Approved by: https://github.com/ezyang
2024-08-29 18:43:59 +00:00
3fc6e47d42 [AOTI] Fix cosmetic indentation issue in cuda cpp wrapper codegen for DeferredCudaKernelLine/GridLine (#134705)
Summary:
Follow up fix for D61018114, D61800622

Increase indentation for `loadKernel` `launchKernel` and `Grid` lines.

Test Plan:
```
TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_unbacked_symbols_abi_compatible_cuda
```
```
TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_backed_symbols_abi_compatible_cuda
```

Differential Revision: D61927248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134705
Approved by: https://github.com/ColinPeppler
2024-08-29 18:38:45 +00:00
5573c17877 [BE][Ez]: Update ruff to 0.6.3 (#134769)
Mostly bugfix release, updating because it fixes an edgecase with a rule we are using

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134769
Approved by: https://github.com/albanD
2024-08-29 18:35:47 +00:00
ce96146623 [PT2] Fix node metadata setting in group_batch_fusion_aten (#134543)
Summary: Current impl results in `meta` missing fields like`val`, use `FakeTensorProp` to update the information

Differential Revision: D61832932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134543
Approved by: https://github.com/frank-wei
2024-08-29 18:32:04 +00:00
348d02a983 Changed masked out rows logsumexp to be -inf and not zero (#134650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134650
Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng, https://github.com/drisspg
2024-08-29 17:22:52 +00:00
36a6516290 [export] use single FQN for param_buffer_mapping (#134500)
Fixes #133252

In strict mode, we have this routine for mapping traced parameters to their FQNs using tensor ids. Currently we assume there's at least 1 unique FQN for each traced parameter, but this seems to break with parameter reuse when call_module nodes are present. Adding a test case where this breaks.

Fixes this by assigning the same FQN to all traced parameters with the same tensor id. This is fine because we return the original state_dict for the EP, and the unflattener has its own routine of handling aliasing: https://github.com/pytorch/pytorch/pull/125758
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134500
Approved by: https://github.com/angelayi
2024-08-29 17:06:31 +00:00
d9d95dc55e [4/N] Test NaN checker against broadcast (#134701)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134701
Approved by: https://github.com/wconstab
ghstack dependencies: #134345, #134357
2024-08-29 17:00:07 +00:00
ab646cd805 Revert "[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509)"
This reverts commit ba5aec88c678fe4b9ad101602c29726724f56e21.

Reverted https://github.com/pytorch/pytorch/pull/134509 on behalf of https://github.com/ZainRizvi due to Sorry but this fails internally. For details see D61953754 ([comment](https://github.com/pytorch/pytorch/pull/134509#issuecomment-2318323161))
2024-08-29 16:39:19 +00:00
26aea277f7 [3/N] Set correct device to CUDA guards (#134357)
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134345
2024-08-29 16:25:27 +00:00
d503217ea4 [inductor] calibration inductor windows uts (15/N) (#134586)
Fix `test_logs_out` UT on Windows. make `test/dynamo/test_logging.py` all UTs pass on Windows.

Changes:
1. Close `NamedTemporaryFile` to release file handle to avoid PermissionError issue.
2. `PermissionError` setup as `delete=False`, let file not be auto deleted.
3. Open log file as "utf-8" to align with Linux.
4. Process wrap difference for Windows.
5. Delete tmp file manually.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134586
Approved by: https://github.com/jansel
2024-08-29 16:18:40 +00:00
9953f55f4c [2/N] Add flag to control which rank should perform NaN check (#134345)
Fixes https://github.com/pytorch/pytorch/issues/134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-29 16:13:15 +00:00
387d3fc296 [AOTI] Switch benchmarking to use export non-strict mode (#130977)
Summary: Switch the export part used by AOTInductor benchmarking from strict to non-strict, and switch it from producing torch IR to aten IR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130977
Approved by: https://github.com/angelayi
ghstack dependencies: #134639
2024-08-29 16:08:52 +00:00
0dbc72887b [CPU][flash attention] make the stride of output align with input (#134656)
Fixes #133671

Currently, the output of CPU flash attention has a fixed layout, no matter what the input is. This PR makes the stride of output align with input q/k/v, which is the same behavior as math backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134656
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-08-29 16:04:25 +00:00
4fcd15a667 Fix test_sgd_weight_decay_xpu accuracy error (#134744)
Fixes #134743

This PR adds `test_sgd_weight_decay_xpu` in `KERNEL_COUNT_OVERRIDES` to override.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134744
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2024-08-29 15:12:40 +00:00
594162f7ab [dynamo] Support reading attributes from pybind objects (#134630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134630
Approved by: https://github.com/jansel
2024-08-29 15:06:52 +00:00
92e38a476f preserve aten::to device in export training (#134622)
Summary:
With training IR, we cannot rely on trapping `to()` in `FunctionalTensor` because the regular decomposition kicks it first, and that can cause it to be optimized away.

So instead we preserve it until we functionalize, and then replace it explicitly with `_to_copy()`.

Test Plan: expected test failures go away

Differential Revision: D61883878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134622
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2024-08-29 14:53:30 +00:00
092349dcdd Never CSE aten.empty in the partitioner (#134703)
aten.empty is almost always fusible into its consumer, so we never CSE
it. This fixes a bug that looks like the following:

```py
@torch.library.custom_op("_reinplacing::sin_cos", mutates_args={"out_sin", "out_cos"})
def sin_cos(x: torch.Tensor, out_sin: torch.Tensor, out_cos: torch.Tensor) -> None:
    out_sin.copy_(x.sin())
    out_cos.copy_(x.cos())

@torch.compile
def f(x):
    out0 = torch.empty_like(x)
    out1 = torch.empty_like(x)
    sin_cos(x, out0, out1)
    return x.clone(), out0, out1

x = torch.randn(3, requires_grad=True)
f(x)
```

- cse would de-duplicate the empty nodes
- reinplacing would add an additional clone (because it can't write to
  both tensors at the same time)
- the clone lowers into a new buffer + a copy_ kernel
- the copy_ kernel is unnecessary because "empty" is special - all reinplacing needed was an additional
  buffer, it doesn't matter what the values are.

We could attempt to fix this on the reinplacing side but this seemed
better as a partitioner heuristic and the reinplacing fix is a bit more
tricky (we'd need to identify that the op never reads from the empty
node).

Test Plan:
- new test (the old number was 27, the new number is 21, so this PR
  helped).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134703
Approved by: https://github.com/yf225
ghstack dependencies: #134466, #134490, #134491
2024-08-29 13:51:19 +00:00
70853b792a [dynamo][itertools] support itertools.tee (#133771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771
Approved by: https://github.com/jansel
ghstack dependencies: #133801
2024-08-29 13:36:52 +00:00
9e806c1a60 [dynamo] simplify implementation for os.fspath (#133801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801
Approved by: https://github.com/anijain2305
2024-08-29 13:36:52 +00:00
d01a7a9faa [dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134614
Approved by: https://github.com/awgu, https://github.com/yf225
ghstack dependencies: #134610, #134590, #134621
2024-08-29 09:14:42 +00:00
fb35d1e01f [raland][dynamo][exceptions] Support raise from None (#134621)
The PR was reverted because this PR traced more code and surfaced a latent bug. Resubmitting w/o any changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134621
Approved by: https://github.com/jansel
ghstack dependencies: #134610, #134590
2024-08-29 09:14:42 +00:00
2bf622685d [dynamo][dicts] Support hasattr on dicts (#134590)
Fixes - https://github.com/pytorch/pytorch/issues/134577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590
Approved by: https://github.com/Skylion007
ghstack dependencies: #134610
2024-08-29 09:14:42 +00:00
2446dead35 [dynamo][exceptions] Use exception subclass whenever possible (#134610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134610
Approved by: https://github.com/drisspg, https://github.com/jansel
2024-08-29 09:14:42 +00:00
cfb642bb6b [DTensor] Extend implicit replication to replicate DTensor for foreach ops so model doesn't have to be fully tp-ed when using 2D (#134551)
Fixes [134212](https://github.com/pytorch/pytorch/issues/134212)

Currently, when we use 2D FSDP with TP, `optimizer.step()` would fail if the model were not fully tensor parallelized. If we don't have the entire model tensor parallelized when doing 2D, we would have both 1D and 2D DTensor parameters. As foreach is turned on by default, `optimizer.step()` would fail as cross mesh op is not allowed. Error as follows:

```
NotImplementedError: aten._foreach_mul_.Scalar: DTensor does not support cross-mesh operation yet!Got meshes: DeviceMesh('cuda', [[0, 1], [2, 3]], mesh_dim_names=('dp', 'tp')) DeviceMesh('cuda', [1, 3], mesh_dim_names=('dp',))
```

In this PR, we extend implicit_replication to replicate DTensor in missing dimensions for foreach ops. If users don't want to fully tensor parallelize the model when using 2D, they have the option of using the `implicit_replication()` context manager for `optimizer.step()`. In this case, we would swap out the 1D DTensorSpec and replace it with 2D DTensorSpec. However, we don't want to turn this on by default yet, as we want the users to be aware that the tp dimension is replicated if a layer is not tp-ed.

With implicit implication turning on, try replicate dtensor spec in missing dimension would work for most cases for foreach case except when the first DTensor in the list is one that also need to be replicated. This is currently a limitation, which I don't have a good solution yet. Currently, with this change, we can handle most of the cases except the case that the first DTensor's ndim is not the largest.
```
[2D_DTensor, 1D_DTensor...] ---> Implicit_replication() can handle this.
[1D_DTensor, 2D_DTensor...] ---> Implicit_replication() can't handle this.
```

This change doesn't affect the existing default behavior, as `implicit_replication()` is not turned on by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134551
Approved by: https://github.com/tianyu-l
2024-08-29 09:01:31 +00:00
3645634f3c [1/N] Move NaN check onto NCCL stream (#134300)
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-29 08:28:49 +00:00
578b8d75e5 [2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539)
The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539
Approved by: https://github.com/ckluk2, https://github.com/yanboliang
2024-08-29 06:28:16 +00:00
834d8b0965 [Inductor][mkldnn] Bug fix: incorrect codegen arg order for qconv (#134579)
Fixes #133448

The arg order for mkldnn qconv IR became incorrect after PR #132367 . This PR fixes the bug.

**Test plan**
`python test/inductor/test_mkldnn_pattern_matcher.py -k qconv`
`python test/inductor/test_cpu_cpp_wrapper.py -k qconv`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134579
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-08-29 06:20:52 +00:00
b0a6d9ad27 [DTensor] Add pointwise ops strategy for aten.isinf, aten.isneginf, aten.isposinf (#134699)
Fixes #ISSUE_NUMBER

Need it for https://github.com/facebookresearch/optimizers/blob/main/distributed_shampoo/utils/shampoo_preconditioner_list.py#L671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134699
Approved by: https://github.com/tianyu-l
2024-08-29 06:01:12 +00:00
da9e61ef70 Get accumulate dtype for Intel GPU (#134465)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

There are two function variants to get accumulated dtype for a given dtype:

- Func1: `c10::ScalarType toAccumulateType(c10::ScalarType type, c10::DeviceType device)`
- Func2: `c10::ScalarType toAccumulateType(c10::ScalarType type, bool is_cuda)`

The Func1 is general enough to support different devices, while the Func2 only supports CUDA and CPU. This PR intends to add the Intel GPU path in the Func1. And we expect users to invoke the Func1 to ensure compatibility for different devices.

* __->__ #134465

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134465
Approved by: https://github.com/Skylion007, https://github.com/atalman
2024-08-29 05:27:57 +00:00
94db935749 Add torch.serialization.skip_data context manager (#134504)
## Semantic

The semantic is
(1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint).

```python
import torch
import torch.nn as nn

sd = nn.Linear(3, 5).state_dict()
with torch.serialization.skip_data():
    torch.save(sd, 'foo.pt')
print(torch.load('foo.pt', weights_only=True))
```

(2)  With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor)

```python
import torch
import torch.nn as nn
from torch._subclasses.fake_tensor import FakeTensorMode

with FakeTensorMode():
    m = nn.Linear(3, 5, dtype=torch.float16, device='cuda')

sd = m.state_dict()
with torch.serialization.skip_data(materialize_fake_tensors=True):
    torch.save(sd, 'bla.pt')
print(torch.load('bla.pt', weights_only=True))
# OrderedDict([('weight', tensor([[0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))])

```

## Follow Ups

- [ ] `torch.load` semantic for skip_data context manager
- [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504
Approved by: https://github.com/albanD
2024-08-29 04:52:52 +00:00
297b42012d [PyTorch] Use pinned memory for zero_cuda_out (#134712)
Summary: This diff creates a pinned tensor for copying from device for the zero_out op.

Differential Revision: D61759262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134712
Approved by: https://github.com/zyan0
2024-08-29 04:46:08 +00:00
a32255481b [caffe2][hipify] remove un-used flag from pybind_utils.h (#134404)
Summary:
Encountered issues related to AMD build when working on https://www.internalfb.com/diff/D60739324?dst_version_fbid=2203158110057105 (see stack trace P1545717562)

Looking at the file history, seems that the flag is no longer used so I propose to remove it.  Alternatively, I could change the `#ifdef` to check both `USE_C10D_NCCL` and  `USE_ROCM` and include the corresponding AMD header files.

Let me know what is more preferred way.

Test Plan: Sandcastle

Differential Revision: D61762129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134404
Approved by: https://github.com/malfet
2024-08-29 04:09:44 +00:00
4655eb3ee2 Uses MemPoolContext to route allocations from CUDACachingAllocator (#134685)
Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685
Approved by: https://github.com/ezyang
2024-08-29 03:56:31 +00:00
4b4ba7ab06 [NJT] Support NJT SDPA + meta-device flop counting (#134289)
A user wants to use the flop counter with meta devices. This previously caused problems for SDPA+NJT:

1. autocast check: `torch.is_autocast_enabled("meta")` fails because `meta` is not valid for autocasting. If we skip this, we run into the next error
2. math backend: conversion to NST requires getting concrete offsets in a list of python integers, which doesn't work on a meta tensor b2eb0e8c6a/torch/nested/_internal/sdpa.py (L809-L815)
3. (fixed in the previous PR, #134288) - if we force using flash attention backend for flop counting, `_flash_attention_forward` previously didn't support meta tensors.

In this PR, we check specifically for FlopCounterMode, and, if it's enabled and combined with meta tensors, (a) skip autocasting and (b) force it down the flash attention path. This isn't generally safe for tracing (e.g. if you actually care which kernels you are running), but in the absence of actual device information, we have to make some assumptions. By specifically checking for FlopCounterMode, this should reduce the chance of unintended side effects for other meta tensor users.

Note: fake tensor would solve a bunch of these issues, but it's not a viable solution right now for the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134289
Approved by: https://github.com/soulitzer
ghstack dependencies: #134288
2024-08-29 03:43:42 +00:00
17e9c2d1e7 Add oneDNN support for Half LSTM on CPU (#132607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132607
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-08-29 03:40:10 +00:00
41e36e2b46 Reflect check_labels status as a signal (#134711)
Fixes the workflow when meta-exported diff (co-dev) doesn't have the required labels, but the signal is suppressed due to job failure (e.g. [see this run](https://github.com/pytorch/pytorch/actions/runs/10590994706/job/29347663526?pr=134484)).

With this change the workflow status correctly reflects the status of the check.

# Testing
* [illegal pr_num](https://github.com/pytorch/pytorch/actions/runs/10603163898/job/29386843591)
* [successful run](https://github.com/pytorch/pytorch/actions/runs/10603279052/job/29387230110) (topic label present)
* no labels: [check fails](https://github.com/pytorch/pytorch/actions/runs/10603310368/job/29387333864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134711
Approved by: https://github.com/clee2000
2024-08-29 03:11:16 +00:00
4f9c68454a [inductor]Let output or input_as_strided match exact strides (#130956)
Fixes #130394

TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue.  This PR enables dense and non-dense outputs' strides follow the strides required by semantics.

The comparison between the original and after this fix for the test is the below.

```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 128
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 8
    x1 = (xindex // 8)
-   x2 = xindex
    tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
    tmp1 = tmp0 + tmp0
-   tl.store(out_ptr0 + (x2), tmp1, xmask)
+   tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (16, 8), (16, 1))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
-       buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+       buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
        del arg0_1
    return (buf1, )
```

The buf1 is created with exact stride required by users, and its values are written in same stride with the input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/desertfire
2024-08-29 03:06:58 +00:00
4811dc3de9 Revert "[dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)"
This reverts commit cc3a76edbac4a48381db6ccc44a83927f80c545b.

Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/ZainRizvi due to Sorry but this has been discovered to be causing a performance regression internally ([comment](https://github.com/pytorch/pytorch/pull/133769#issuecomment-2316620213))
2024-08-29 03:00:47 +00:00
f65df5edae Revert "[dynamo][itertools] support itertools.tee (#133771)"
This reverts commit 1dbd3476de07d7f07489e243cb7a43073e8c25c1.

Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to Sorry, have to revert this in order to be able to revert https://github.com/pytorch/pytorch/pull/133769 ([comment](https://github.com/pytorch/pytorch/pull/133771#issuecomment-2316611158))
2024-08-29 02:49:30 +00:00
eaec9e80b8 Revert "[dynamo] simplify implementation for os.fspath (#133801)"
This reverts commit 74341e1150f10b8aaddd33a165e686724424071f.

Reverted https://github.com/pytorch/pytorch/pull/133801 on behalf of https://github.com/ZainRizvi due to Sorry, have to revert this in order to be able to revert https://github.com/pytorch/pytorch/pull/133769 ([comment](https://github.com/pytorch/pytorch/pull/133771#issuecomment-2316611158))
2024-08-29 02:49:30 +00:00
76f975948e [inductor] Cleanup generate_node_schedule (#134306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134306
Approved by: https://github.com/shunting314
2024-08-29 02:45:14 +00:00
cccb121d4e [Inductor] add inductor config: masked_vec (#134566)
This PR adds inductor config: masked_vec to control enable/disable masked vectorization for the tail_loop, and enable by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134566
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-29 02:29:06 +00:00
c5f114747e fix flakiness in update_hint_benchmark.py (#134649)
```
compile time instruction count for iteration 1 is 10732129038
compile time instruction count for iteration 2 is 10719776783
compile time instruction count for iteration 3 is 10729546868
compile time instruction count for iteration 4 is 10737655132
compile time instruction count for iteration 5 is 10732564252
compile time instruction count for iteration 6 is 10728721234
compile time instruction count for iteration 7 is 10733354271
compile time instruction count for iteration 8 is 10719588972
compile time instruction count for iteration 9 is 10706311856
```
1. add torch.manual_seed(0), inputs was not the same across iterations
2. disable gc.
3. remove loop (not needed since compilation happen once only)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134649
Approved by: https://github.com/aorenste
ghstack dependencies: #133834, #134635
2024-08-29 02:22:05 +00:00
f0fceed432 Revert "[dynamo][exceptions] Use exception subclass whenever possible (#134610)"
This reverts commit 880e3d18a406777dbea6aeaf14443b0e3a8b441c.

Reverted https://github.com/pytorch/pytorch/pull/134610 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))
2024-08-29 02:02:12 +00:00
67d7040fce Revert "[dynamo][dicts] Support hasattr on dicts (#134590)"
This reverts commit c566f2465f41b8081caed205fcf5fe973fd970b3.

Reverted https://github.com/pytorch/pytorch/pull/134590 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))
2024-08-29 02:02:12 +00:00
40cebde3bc Revert "[raland][dynamo][exceptions] Support raise from None (#134621)"
This reverts commit e96dc3665a1d48434c02e17f7faed41f779cee2c.

Reverted https://github.com/pytorch/pytorch/pull/134621 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))
2024-08-29 02:02:12 +00:00
c35d1f7b3a Revert "[dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614)"
This reverts commit e4a5958ab58e2f9b5b9c336a1d2a6449784d88d3.

Reverted https://github.com/pytorch/pytorch/pull/134614 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))
2024-08-29 02:02:12 +00:00
25531eb735 Revert "[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539)"
This reverts commit 26e392132d3039345de6aaf8643e7330f7fc3cbc.

Reverted https://github.com/pytorch/pytorch/pull/134539 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134539#issuecomment-2316568257))
2024-08-29 01:59:02 +00:00
cbf5ba1e97 Revert "[1/N] Move NaN check onto NCCL stream (#134300)"
This reverts commit 94caba4899096f160eca9628acddba6032755b3b.

Reverted https://github.com/pytorch/pytorch/pull/134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))
2024-08-29 01:50:22 +00:00
33d0c11b26 Revert "[2/N] Add flag to control which rank should perform NaN check (#134345)"
This reverts commit 2fe7e332c7a61f025ccbcdbbb4875c6bf0b9afdf.

Reverted https://github.com/pytorch/pytorch/pull/134345 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))
2024-08-29 01:50:22 +00:00
43dc17fd00 Revert "[3/N] Set correct device to CUDA guards (#134357)"
This reverts commit afc76c6f2d46d7726012507ec5c67b4c04e21723.

Reverted https://github.com/pytorch/pytorch/pull/134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))
2024-08-29 01:50:22 +00:00
503c0dd923 Revert "Add MaskedTensor support to *_like API (#128637)"
This reverts commit b6e51711a0ea6174806e75ab6e208d2d910b45f5.

Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/ZainRizvi due to Actually, seems like it was this commit that introduced the failure: test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604690725/job/29392898277) [HUD commit link](b6e51711a0) ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2316554188))
2024-08-29 01:42:52 +00:00
1285443994 Revert "Add torch.serialization.skip_data context manager (#134504)"
This reverts commit 202600bc2384cb19a29b8fca503bafc289158c32.

Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/mikaylagawarecki due to This is breaking Windows docs tests due to NamedTemporaryFile on Windows not working well ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2316543901))
2024-08-29 01:30:49 +00:00
e7711d6c7d [MPS] Fix SDP training (#134719)
Check whether the input tensors require grad. If required, then we don't get into the fast path and fall back to composite implicit.

Fixes #134678
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134719
Approved by: https://github.com/malfet
2024-08-29 01:28:53 +00:00
ca03a14cf7 hang dim hint constants off Dim (#134702)
Summary: Retry landing https://github.com/pytorch/pytorch/pull/134484

Test Plan: (see original)

Differential Revision: D61925860

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134702
Approved by: https://github.com/pianpwk
2024-08-29 01:02:01 +00:00
7a554e96b4 [AOTI][Tooling] Follow up to print location of saved file path for torch.pickle_save() (#134651)
Summary:
- Follow up to add torch.pickle_save() log for saved file path

- Minor debug printer code refine

Test Plan: CI

Differential Revision: D61883239

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134651
Approved by: https://github.com/muchulee8
2024-08-28 23:58:37 +00:00
202600bc23 Add torch.serialization.skip_data context manager (#134504)
## Semantic

The semantic is
(1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint).

```python
import torch
import torch.nn as nn

sd = nn.Linear(3, 5).state_dict()
with torch.serialization.skip_data():
    torch.save(sd, 'foo.pt')
print(torch.load('foo.pt', weights_only=True))
```

(2)  With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor)

```python
import torch
import torch.nn as nn
from torch._subclasses.fake_tensor import FakeTensorMode

with FakeTensorMode():
    m = nn.Linear(3, 5, dtype=torch.float16, device='cuda')

sd = m.state_dict()
with torch.serialization.skip_data(materialize_fake_tensors=True):
    torch.save(sd, 'bla.pt')
print(torch.load('bla.pt', weights_only=True))
# OrderedDict([('weight', tensor([[0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))])

```

## Follow Ups

- [ ] `torch.load` semantic for skip_data context manager
- [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504
Approved by: https://github.com/albanD
2024-08-28 23:53:17 +00:00
f997b2b8e6 Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)"
This reverts commit f685018ea9d08f98cbd7106028db134f967f74d3.

Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](f685018ea9) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))
2024-08-28 23:10:07 +00:00
6dd3f81aaf Add export_for_training as public API (#134677)
Differential Revision: [D61912084](https://our.internmc.facebook.com/intern/diff/D61912084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134677
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2024-08-28 22:32:10 +00:00
a7933acd5a Improve custom ops aliasing error message (#134688)
Fixes https://github.com/pytorch/pytorch/issues/134278

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134688
Approved by: https://github.com/yushangdi
ghstack dependencies: #134466, #134490, #134491, #134690, #134692
2024-08-28 22:22:04 +00:00
dd443f418a Improve opcheck docs. (#134692)
Fixes https://github.com/pytorch/pytorch/issues/134119
From user feedback, it's difficult to understand what the tests do. We
clarify the docs more.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134692
Approved by: https://github.com/albanD
ghstack dependencies: #134466, #134490, #134491, #134690
2024-08-28 22:22:04 +00:00
afc76c6f2d [3/N] Set correct device to CUDA guards (#134357)
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134300, #134345
2024-08-28 22:17:11 +00:00
5ff97e79ee Skip test_mutable_custom_op_fixed_layout2 on ROCM (#134690)
ROCM doesn't trigger the layout optimization that makes the test case
valid so we're going to skip the checks.

Should fix the following (I'll close them later)
- https://github.com/pytorch/pytorch/issues/134481
- https://github.com/pytorch/pytorch/issues/134519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134690
Approved by: https://github.com/FindHao
ghstack dependencies: #134466, #134490, #134491
2024-08-28 22:12:24 +00:00
2fe7e332c7 [2/N] Add flag to control which rank should perform NaN check (#134345)
Fixes https://github.com/pytorch/pytorch/issues/134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: #134300
2024-08-28 21:53:39 +00:00
26ec06e45d [amd][lowering] hipify shim v2 headers (#134689)
Summary: The default c_shim version was switched to 2 for HIP in D60674018. This results in some linking errors where shim function symbols are missing from the compiled .so file (eg. P1551186492) when building lowering benchmark scripts since the required files aren't included. Hipify the shim v2 generated header files as well since they're needed during codegen when the buck binaries are executed.

Reviewed By: frank-wei, zoranzhao, henryoier

Differential Revision: D61865202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134689
Approved by: https://github.com/zoranzhao
2024-08-28 21:53:24 +00:00
7b3da5f297 Revert "[dynamo] Cache _dynamo.disable results (#134272)"
This reverts commit dbef2b05b4d81e891f7497f92f730a22bebe445d.

Reverted https://github.com/pytorch/pytorch/pull/134272 on behalf of https://github.com/anijain2305 due to Peak mem increase detected internally ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2316308170))
2024-08-28 21:51:43 +00:00
20b62fed21 Create processes in parallel in mp.start_processes for forkserver (#134629)
Summary:
This is to fix the pytorch issue filed https://github.com/pytorch/pytorch/issues/133010
one way to fix this problem is to enable parallel start processes in mp.start_processes.
What else in the diff:
refactored a test case api_test which was repeating a lot of tests due to the inheritance.
added unit test for forkserver when parallel start is on.

Test Plan: Added unit tests

Differential Revision: D61878552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134629
Approved by: https://github.com/d4l3k
2024-08-28 21:34:32 +00:00
f685018ea9 Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.

I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
2024-08-28 21:30:39 +00:00
b6e51711a0 Add MaskedTensor support to *_like API (#128637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637
Approved by: https://github.com/cpuhrsch
2024-08-28 21:28:23 +00:00
4c16797e71 [c10d FR analyzer] Output a meaningful debug report for users (#134528)
- This PR generates a more useful output log for users: P1552399180.
- It also fixes the logic when we check the all-gather size mismatch.
- Add dtype check for collective input/output
- We store more context information for error match_state so that we can report them in the file.
- Disable the size match for alltoall because we don't log the size for all inputs/outputs.
- Correct some types for func args specification.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134528
Approved by: https://github.com/c-p-i-o
2024-08-28 21:22:47 +00:00
de35d3062f Runtime Estimator for estimating GPU compute time (#134243)
This PR adds a basic Runtime Estimator for single-device models.
It estimates the GPU runtime in milliseconds using various estimation methods under the ``FakeTensorMode``.
It provides a ``TorchDispatchMode`` based context manager that can estimate the eager runtime of PyTorch functions. It supports two estimation modes, benchmarking (`operator-level-benchmark`) and roofline cost modeling (`operator-level-cost-model`).
For modules executed under this context manager, it agggregates the forward and backward operation runtimes and records their execution orders.

```
import torch
from torch import nn, optim
from torch._subclasses.fake_tensor import FakeTensorMode
from torch.distributed._tools.runtime_estimator import RuntimeEstimator
from torch.testing._internal.distributed._tensor.common_dtensor import (
    ModelArgs,
    Transformer,
)

if __name__ == "__main__":
    def _train_step(
        model: nn.Module,
        optimizer: optim.Optimizer,
        inp: torch.Tensor,
    ):
        out = model(inp)
        loss = out.sum()
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    dev = torch.cuda.current_device()
    vocab_size = 8192
    bsz, seq_len = 32, 1024
    model_args = ModelArgs(
        n_layers=4,
        n_heads=12,
        vocab_size=vocab_size,
        max_seq_len=seq_len,
        dim=768,
        dropout_p=0.1,
    )
    runtime_estimator = RuntimeEstimator()

    with FakeTensorMode():
        with torch.device(dev):
            model = Transformer(model_args)
        optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True)
        inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev)
        with runtime_estimator("operator-level-benchmark"):
            _train_step(model, optimizer, inp)
        with runtime_estimator("operator-level-cost-model"):
            _train_step(model, optimizer, inp)

    # Actual model runtime
    with torch.device(dev):
        model = Transformer(model_args)
    optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True)
    inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev)
    warmup_iters, actual_iters = 2, 5
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    for _ in range(warmup_iters):
        _train_step(model, optimizer, inp)
    start_event.record()
    for _ in range(actual_iters):
        _train_step(model, optimizer, inp)
    end_event.record()
    torch.cuda.synchronize()
    measured_time = start_event.elapsed_time(end_event) / actual_iters
    print(f"Actual total_time: {measured_time:.3f} ms")
  ```

<img width="506" alt="Screenshot 2024-08-26 at 11 27 15 PM" src="https://github.com/user-attachments/assets/04d243c9-21a6-4389-8c20-80958980788c">

@weifengpy @xuanzhang816 @gnadathur

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134243
Approved by: https://github.com/weifengpy
2024-08-28 20:06:54 +00:00
cae817c862 [ET][CodeGen] Remove TORCH_API from NativeFunctions.h declarations (#134245)
Summary:
Remove TORCH_API from the generated executorch/kernels/portable/NativeFunctions.h declarations

These generated declarations are using ET tensors. They don't need to have the TORCH_API macro prefixed to them, since in this case TORCH_API is just empty. See [codegen/macros.h](https://www.internalfb.com/code/fbsource/[d12d7d3accfb12932368e0216124f2d735c51d73]/fbcode/executorch/codegen/macros.h)

Test Plan: CI

Differential Revision: D61490943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134245
Approved by: https://github.com/larryliu0820
2024-08-28 19:58:37 +00:00
b07d0a22f5 [hop] require hops to override __call__. (#134352)
Fixes https://github.com/pytorch/pytorch/issues/133719 by making `__call__` of hops an abstractmethod.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134352
Approved by: https://github.com/zou3519
2024-08-28 19:56:40 +00:00
66c33d5989 Revert "[2/N] Add flag to control which rank should perform NaN check (#134345)"
This reverts commit be7752ead3824e79f5ede6a2f59715b415a2f776.

Reverted https://github.com/pytorch/pytorch/pull/134345 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134345#issuecomment-2316133024))
2024-08-28 19:51:59 +00:00
23e26b84af Revert "[3/N] Set correct device to CUDA guards (#134357)"
This reverts commit 13114da4ef9d14978ea1dfc0fefb236cb4000435.

Reverted https://github.com/pytorch/pytorch/pull/134357 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134357#issuecomment-2316121423))
2024-08-28 19:44:55 +00:00
3b40b07efb Update PyTorch for XNNPACK 87ee0b4 (#134518)
Summary: Update XNNPACK library version.

Test Plan: Combined diff CI is clean: D61586079 (all changes, has to be split out for export).

Differential Revision: D61822610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134518
Approved by: https://github.com/mcr229
2024-08-28 19:24:04 +00:00
042b733ddd [dynamo][freezing] Set is_static_type to false after marking an input static (#134653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134653
Approved by: https://github.com/mlazos
2024-08-28 19:22:37 +00:00
aa31e7019a [FSDP] Made clip_grad_norm_ norm compute order deterministic (#134673)
Fixes https://github.com/pytorch/pytorch/issues/134393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134673
Approved by: https://github.com/weifengpy
ghstack dependencies: #134152
2024-08-28 18:44:11 +00:00
47ba47a81f [compiled autograd] error instead of deadlock on reentrant autograd (#134530)
reentrant calls autograd multiple times using the same thread, so it passes all the thread checks and hangs waiting for the lock it holds in another scope

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134530
Approved by: https://github.com/jansel
ghstack dependencies: #134514
2024-08-28 17:54:31 +00:00
c352b6aaaf [compiled autograd][cpp node] point c++ custom autograd functions tracing error to google doc (#134514)
`RuntimeError: Attempting to trace a potentially unsafe C++ autograd function: torch::autograd::CppNode<CustomOpAutogradFunction>. It may be possible to trace it safely, please refer to the instructions in: https://docs.google.com/document/d/11VucFBEewzqgkABIjebZIzMvrXr3BtcY1aGKpX61pJY/.`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134514
Approved by: https://github.com/yf225
2024-08-28 17:54:31 +00:00
ba5aec88c6 [reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509)
**Summary**
reland of https://github.com/pytorch/pytorch/pull/134294

Fixes #131446
Fixes #126852
Fixes #126868
Fixes #126493

The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green.

See the error message below:
```
2024-08-24T13:42:01.3228990Z ==================================== RERUNS ====================================
2024-08-24T13:42:01.3229530Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3229710Z Unexpected success
2024-08-24T13:42:01.3230235Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3230407Z Unexpected success
2024-08-24T13:42:01.3230594Z =================================== FAILURES ===================================
2024-08-24T13:42:01.3231128Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3231296Z Unexpected success
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509
Approved by: https://github.com/tianyu-l, https://github.com/wz337
2024-08-28 17:51:44 +00:00
310eb6d8c6 [AOTI] Fix test_aoti_inference CPU build issue (#134675)
Summary: Fixes https://github.com/pytorch/pytorch/issues/130311. We need to guard CUDA-only code in test_aoti_inference with macros so that it won't fail for CPU-only platform.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134675
Approved by: https://github.com/atalman, https://github.com/chunyuan-w
2024-08-28 17:42:19 +00:00
633a9a3b13 add back sum_floordiv benchmark. (#134635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134635
Approved by: https://github.com/avikchaudhuri, https://github.com/oulgen
ghstack dependencies: #133834
2024-08-28 17:38:24 +00:00
b8859dc4b8 [PyTorch Pin Memory Allocator] Optimize the free list implementation and add lock sharding (#134154)
Summary: This diff addresses the lock contention issue in free list implementation of CachingHost/Pinned allocator. We add a different data structure for free list and also add lock sharding based on allocation size.

Differential Revision: D61623367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134154
Approved by: https://github.com/guangyey, https://github.com/jgong5, https://github.com/zyan0, https://github.com/EikanWang, https://github.com/jiayisuse
2024-08-28 17:12:01 +00:00
40de63be09 parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749)
Fixes #123451

This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127.
The test failure is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749
Approved by: https://github.com/janeyx99
2024-08-28 16:34:06 +00:00
c7338f457c [DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 (#134158)
The original DCP doesn't flattening all the containers, which can cause issues, https://github.com/pytorch/pytorch/pull/125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

@pradeepfn Please let me know if this summary matches our discussion.

Fixes https://github.com/pytorch/pytorch/issues/133923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134158
Approved by: https://github.com/wz337, https://github.com/pradeepfn
2024-08-28 16:31:44 +00:00
13d40f6fc5 Revert "hang dim hint constants off Dim (#134484)"
This reverts commit c142af7209a423a05504fdec50680333f5a37629.

Reverted https://github.com/pytorch/pytorch/pull/134484 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134484#issuecomment-2315749549))
2024-08-28 16:05:42 +00:00
2c88a923a7 Revert "Refactor caching device allocator utils (#130923)"
This reverts commit c45ca8092dddf718563a1a754de798ad25eae1ee.

Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be causing internal tests to fail with errors like `error: no type named 'DeviceStats' in namespace 'xxx::xxx:xxxAllocator'; did you mean 'DeviceStatus'?` ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2315730155))
2024-08-28 15:56:08 +00:00
d52aff3e73 Revert "Adding entry-point based support for out-of-tree rendezvous plugins (#132633)"
This reverts commit 136b19b062f62c81ea3ed8fb306debe9d7720e93.

Reverted https://github.com/pytorch/pytorch/pull/132633 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing internal tests to fail with the error `ImportError: cannot import name '_register_out_of_tree_handlers' from 'torch.distributed.elastic.rendezvous.registry'` ([comment](https://github.com/pytorch/pytorch/pull/132633#issuecomment-2315716201))
2024-08-28 15:49:18 +00:00
85d9946001 [CI] change conda to miniforge for XPU images (#134455)
The `.ci/docker` change with `ciflow/xpu` label will trigger docker images rebuild on xpu runner, but xpu runner can't use miniconda, change to miniforge. Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134455
Approved by: https://github.com/atalman
2024-08-28 15:16:14 +00:00
208b922327 [Intel GPU] Remove special dispatch logic for xpu in adaptive_avg_pooling (#132217)
We now align the dispatch logic for XPU with CUDA in the adaptive average pooling operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132217
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/albanD, https://github.com/malfet
2024-08-28 15:06:35 +00:00
e6bf1710ff [Inductor][Refactor] Rename CPU benchmark test configs (#134639)
Summary: benchmarks/dynamo/ci_expected_accuracy/update_expected.py expects a benchmark run config is named as {config}_{benchmark}, and CPU tests should follow the same naming convention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134639
Approved by: https://github.com/huydhn
2024-08-28 14:49:55 +00:00
c142af7209 hang dim hint constants off Dim (#134484)
Summary: Recently https://github.com/pytorch/pytorch/pull/133620 added support for automatic dynamic shapes, where a new enum, `DIM`, was introduced to provide hints like `AUTO` and `STATIC`. This PR is a nominal change where we expose the hints via the existing public `Dim` API, and remove `DIM` from the public API. The main motivation is to avoid having users need to import too many things.

Test Plan: existing

Differential Revision: D61807361

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134484
Approved by: https://github.com/angelayi
2024-08-28 14:35:40 +00:00
3e42f21eee Bucketize fix to include number and tensor inputs (#133652)
Fixes #132222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133652
Approved by: https://github.com/ezyang
2024-08-28 13:35:41 +00:00
bb22132c8d [aotd] Make effects op registry WeakKeyDictionary (#134470)
Op is used as a Dictionary Key, while op can be deregistered as a result this Key will be holding this op from deallocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134470
Approved by: https://github.com/zou3519
2024-08-28 12:12:00 +00:00
97c8a0739e [Dynamo] Support inspect.signature.Parameter getattr (#134636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134636
Approved by: https://github.com/Chillee, https://github.com/anijain2305
2024-08-28 09:59:41 +00:00
26e392132d [2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539)
The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539
Approved by: https://github.com/ckluk2, https://github.com/yanboliang
2024-08-28 08:57:56 +00:00
8693322ef0 [Dynamo][autograd.Function] Support mark_non_differentiable (#134087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134087
Approved by: https://github.com/zou3519
2024-08-28 08:12:37 +00:00
d01415409b [PGNCCL] Improve logic to infer device for barrier (#134617)
Fixes #134391, #124714

The above issues reported that `dist.barrier()` could hang in some cases.
The culprit is that ProcessGroupNCCL inferred a wrong device to perform the dummy all-reduce.

After the PR, the following will be the order of device selection:
- 1st choice: `opts.device_ids`, if provided by user via `barrier(opts)`.
- 2nd choice: bound device id, if provided to `init_process_group` via `device_id` arg.
- 3rd choice: `usedDeviceIdxs_` recorded in current PG. Will have a value from previous collectives.
- 4th choice: `globalRank() % localDeviceCount_`. This can only happen when `dist.barrier()` is the first call of the PG.

What's new:
- Added the 2nd choice.
- In the 4th choice, we use `globalRank()` instead of group-local rank, because the group-local rank can be offset wrt the device id if intra-node GPUs are sharded into multiple dimensions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134617
Approved by: https://github.com/yifuwang, https://github.com/shuqiangzhang
2024-08-28 08:12:09 +00:00
e4a5958ab5 [dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134614
Approved by: https://github.com/awgu, https://github.com/yf225
ghstack dependencies: #134610, #134590, #134621
2024-08-28 07:35:24 +00:00
e96dc3665a [raland][dynamo][exceptions] Support raise from None (#134621)
The PR was reverted because this PR traced more code and surfaced a latent bug. Resubmitting w/o any changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134621
Approved by: https://github.com/jansel
ghstack dependencies: #134610, #134590
2024-08-28 07:35:23 +00:00
c566f2465f [dynamo][dicts] Support hasattr on dicts (#134590)
Fixes - https://github.com/pytorch/pytorch/issues/134577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590
Approved by: https://github.com/Skylion007
ghstack dependencies: #134610
2024-08-28 07:35:18 +00:00
880e3d18a4 [dynamo][exceptions] Use exception subclass whenever possible (#134610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134610
Approved by: https://github.com/drisspg, https://github.com/jansel
2024-08-28 07:35:12 +00:00
bf7db4e4f9 [Inductor UT] Generalize inductor UT for intel GPU (#133309)
[Inductor UT] Generalize Inductor test case for Intel GPU.

- Reuse `test/inductor/test_decompose_mem_bound_mm.py`
- Reuse `test/inductor/test_inplacing_pass.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133309
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/etaf
2024-08-28 06:17:43 +00:00
2ba60a1618 fix torch.prod vectorized path for bool (#128009)
Fix https://github.com/pytorch/pytorch/issues/127866.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128009
Approved by: https://github.com/jgong5, https://github.com/albanD
2024-08-28 05:27:50 +00:00
89929d9abc [AOTI][Tooling][4/n] Add torch.save() for individual intermediate tensor (#133871)
Differential Revision: D61415304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133871
Approved by: https://github.com/ColinPeppler
2024-08-28 04:48:00 +00:00
ca77f0a986 [executorch hash update] update the pinned executorch hash (#133386)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133386
Approved by: https://github.com/pytorchbot
2024-08-28 04:16:42 +00:00
e3308d835d [audio hash update] update the pinned audio hash (#134632)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134632
Approved by: https://github.com/pytorchbot
2024-08-28 04:16:25 +00:00
cyy
bb4dfe90b8 [Reland] [1/N] Fix clang-tidy warnings in inductor (#134544)
Reland #131979 and exclude aoti_torch_index_put_out changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134544
Approved by: https://github.com/ColinPeppler
2024-08-28 04:05:06 +00:00
71d0eff6e7 Back out "[pytorch][PR] [export] Schematize nn_module_stack serialization" (#134628)
Summary: Breaking backward compatibilities for serialization and deserialization

Differential Revision: D61888223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134628
Approved by: https://github.com/angelayi
2024-08-28 03:45:46 +00:00
cyy
ec3f52dd27 [21/N] Fix clang-tidy warnings in jit (#134537)
Follows #133399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134537
Approved by: https://github.com/Skylion007
2024-08-28 03:22:01 +00:00
5beb859e74 [BE] no need to print stream in comm abort (#134362)
Strictly speaking, NCCL communicator has nothing to do with CUDA streams. Thus, we don't need to print stream in comm abort's message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134362
Approved by: https://github.com/fduwjj, https://github.com/wconstab
2024-08-28 02:14:18 +00:00
f33bcbe5fd c10d/logging: add C10D_LOCK_GUARD (#134131)
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.

This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.

This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.

Test plan:

existing CI for regressions

will add unit tests on `C10D_LOCK_GUARD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-28 01:40:42 +00:00
c45ca8092d Refactor caching device allocator utils (#130923)
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy
2024-08-28 01:35:23 +00:00
d96254631e [CD] Fix docker builds by installing setuptools after python build (#134631)
Follow up after https://github.com/pytorch/pytorch/pull/134595

Same error happens silently before the error addressed in the above PR (and build continues and builds invalid Docker):
```
#47 457.5 Traceback (most recent call last):
#47 457.5   File "<string>", line 1, in <module>
#47 457.5   File "/opt/_internal/cpython-3.12.0/lib/python3.12/site-packages/wheel/pep425tags.py", line 3, in <module>
#47 457.5     import distutils.util
#47 457.5 ModuleNotFoundError: No module named 'distutils'
#47 457.5 + local abi_tag=
#47 457.5 + ln -s /opt/_internal/cpython-3.12.0 /opt/python/
#47 457.5 + rm -f Python-3.12.0.tgz
```

The fix in  https://github.com/pytorch/pytorch/pull/134595 is no longer needed since we will install setuptools right after python installation.

Link: https://github.com/pytorch/pytorch/actions/runs/10584642913/job/29329366729#step:6:6041
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134631
Approved by: https://github.com/kit1980
2024-08-28 01:17:41 +00:00
2b95da7ef4 allow conv_bn mixed dtype folding in post-grad (#133968)
This PR relaxes the condition to allow conv_bn mixed dtype folding in post-grad.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133968
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-28 01:02:09 +00:00
f7467c3b95 using new device-agnostic api instead of old api like torch.cpu or torch.cuda (#134448)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134448
Approved by: https://github.com/guangyey, https://github.com/shink, https://github.com/albanD
2024-08-28 01:01:49 +00:00
0c7856973b [export] enumerate unsupported sympy.Functions (#134271) (#134598)
Summary:
There's 2 concepts of unsupported sympy.Functions in symbolic_shapes:
1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions
2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis

This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

Differential Revision: D61863394

Pulled By: pianpwk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134598
Approved by: https://github.com/angelayi
2024-08-28 00:34:38 +00:00
3b33f26513 Add device daemon (#131814)
Base implementation aiming towards https://github.com/pytorch/rfcs/pull/64

Details of the implementation and next steps in https://github.com/pytorch/pytorch/blob/gh/albanD/3/head/test/cpp_extensions/open_registration_extension/README.md

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131814
Approved by: https://github.com/ezyang
2024-08-27 23:32:07 +00:00
d6091c8726 Add compile time instruction count metric (#133834)
PYTHONPATH=$(pwd) python benchmarks/update_hint_benchmark.py out
as of this diff, compile_time_instruction_count counts the number of instruction from within
convert_frame.compile_inner
```
update_hint_regression,compile_time_instruction_count,10522459165
```
 will add result from CI once populated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133834
Approved by: https://github.com/aorenste
2024-08-27 23:29:02 +00:00
ef0f5919c7 [ROCm][Inductor][CK] Fix codegen after ck signature change (#134483)
MakeArgument signature was changed in https://github.com/ROCm/composable_kernel/pull/1453 adding splitK argument to universal gemm templates which are used to codegen addmm and matmul

(part of the series started at #125453 )

# Testing
`pytest test/inductor/test_ck_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134483
Approved by: https://github.com/ColinPeppler
2024-08-27 23:25:42 +00:00
5ead965026 [export] don't duck size for DIM.AUTO (#134486)
Summary: apparently DIM.AUTO leads to duck sizing, I didn't catch this. Doing the least intrusive fix possible by using `torch._dynamo.maybe_mark_dynamic()` under the hood.

Test Plan: added test

Differential Revision: D61809344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134486
Approved by: https://github.com/avikchaudhuri
2024-08-27 23:00:26 +00:00
30094bedbc Revert "[dynamo][dicts] Support hasattr on dicts (#134590)"
This reverts commit d23c0150f3ba5fd1162358e9e7b0e72e7308c87e.

Reverted https://github.com/pytorch/pytorch/pull/134590 on behalf of https://github.com/anijain2305 due to causing trunk CI failures ([comment](https://github.com/pytorch/pytorch/pull/134590#issuecomment-2313705582))
2024-08-27 22:52:52 +00:00
d966d91e37 [FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134538
Approved by: https://github.com/yanboliang
ghstack dependencies: #134507, #134511
2024-08-27 22:04:57 +00:00
f5c67917d3 [FlexAttention] Remove unused code (#134511)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134511
Approved by: https://github.com/yanboliang
ghstack dependencies: #134507
2024-08-27 22:04:57 +00:00
856a8410f2 [FlexAttention] Create new variables for the subgraphs (#134507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134507
Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng
2024-08-27 22:04:57 +00:00
41e512a4cd [EZ] Restore test_unicode_comments (#134589)
This reverts changes introduced by test_jit.py by 43737bd78a and adds lint suppression for this it

As test name suggests it should have an unicode comment to make sure our parser can handle it

Part of the fix for https://github.com/pytorch/pytorch/issues/134422
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134589
Approved by: https://github.com/aorenste, https://github.com/Skylion007
2024-08-27 21:51:06 +00:00
1ba39ec1d0 Add test case test_arange_length_with_float32_dtype (#134415)
Adding a test as a followup from https://github.com/pytorch/pytorch/pull/134296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134415
Approved by: https://github.com/ezyang
2024-08-27 21:36:23 +00:00
b58a0c3c4d [split build] fix distributed problems (#134502)
Should fix the issue where USE_C10D_NCCL was not getting propagated to libtorch_python.so
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134502
Approved by: https://github.com/yifuwang
2024-08-27 21:12:58 +00:00
289486d007 Move attention kernels back from fake_impls to meta_registrations (#134288)
See #121528 for additional context.

In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA).

Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels.

Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR.

Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288
Approved by: https://github.com/drisspg
2024-08-27 21:10:36 +00:00
39ca96398b Update label_to_label with oncall: pt2 hierarchy. (#134582)
Test Plan:
- None
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134582
Approved by: https://github.com/clee2000
2024-08-27 21:05:40 +00:00
cyy
b567ca0f51 Remove unused imported names in python files (#134438)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134438
Approved by: https://github.com/zou3519
2024-08-27 20:44:04 +00:00
d23c0150f3 [dynamo][dicts] Support hasattr on dicts (#134590)
Fixes - https://github.com/pytorch/pytorch/issues/134577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590
Approved by: https://github.com/Skylion007
ghstack dependencies: #134039
2024-08-27 20:43:40 +00:00
16b8146c9e Exclude test_transformers and unit tests which require recent GPU arch (#132895)
This PR is to exclude test_transformers on ROCm temporarily and skip some unit tests which require recent GPU arch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132895
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2024-08-27 20:40:53 +00:00
44dadf2506 [Fix] Check name when registering privateuse1 backend (#134071)
do some checks when registering privateuse1 backend to avoid using in-tree deivce names

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134071
Approved by: https://github.com/albanD
2024-08-27 20:28:30 +00:00
f754c0ae1b [easy] rm duplicate definition for inductor in TORCH_LOGS documentation (#134480)
already defined in
2eb9339b71/torch/_logging/_internal.py (L286-L287)

Test Plan: Sandcastle run

Differential Revision: D61806088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134480
Approved by: https://github.com/eellison, https://github.com/mlazos
2024-08-27 20:15:10 +00:00
fe6d0e3a04 Do not compute unnecessary tensor!=0 for bool tensors in count_nonzero (#134254)
Updated aten/src/ATen/native/TensorAdvancedIndexing.cpp to only reduce non-bool tensors before computing a sum

Since I have no expertise for MPS, I did leave the MPS backend untouched. Also, in `count_nonzero_impl` for CPU, I assumed the comparison can be optimized by the compiler for boolean values? 90c821814e/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L2262-L2264) Fixes #133983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134254
Approved by: https://github.com/albanD
2024-08-27 20:09:29 +00:00
b744ed6816 Add a cpu_dispatch_key parameter to the cpu_fallback function (#134321)
Fixes #134322
Add a cpu_dispatch_key parameter to the cpu_fallback function to support fallback, for example, to SparseCPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134321
Approved by: https://github.com/albanD
2024-08-27 19:57:57 +00:00
adf401f822 Links to contributors' GitHub accounts (#133787)
Maintainers have the links to their GitHub profiles, but the major contributors do not have them.
I added the links to the contributors' GitHub accounts in case anyone wants to follow them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133787
Approved by: https://github.com/albanD
2024-08-27 19:56:08 +00:00
534f43ddce [Doc] Fix rendering of the unicode characters (#134597)
https://github.com/pytorch/pytorch/pull/124771 introduced unicode escape sequences inside raw strings, which were not rendered correctly. Also fix typo in `\uue0 ` escape sequence (should have been `\u00e0`)
Fix it by relying on [string literal concatenation](https://docs.python.org/3/reference/lexical_analysis.html#string-literal-concatenation) to join raw and regular strings together during lexical analysis stage

Fixes https://github.com/pytorch/pytorch/issues/134422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134597
Approved by: https://github.com/aorenste, https://github.com/Skylion007
2024-08-27 19:52:46 +00:00
3ef4c27ab3 Update pt2e numeric debugger to use node.meta["custom"] field (#134040)
Summary:
With https://github.com/pytorch/pytorch/pull/131912 we now have a "custom" field in node.meta that can be preserved
in

* copy/deepcopy
* run_decompositions()
* serialization
* re-exporting

So we refactored numeric debugger to use this.

Test Plan:
python test/test_quantization.py TestNumericDebugger

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134040
Approved by: https://github.com/tarun292
2024-08-27 19:51:03 +00:00
ed494603c7 [inductor] calibration inductor windows uts (16/N) (#134587)
skip UT for `test/inductor/test_compiled_autograd.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134587
Approved by: https://github.com/jansel
2024-08-27 19:45:02 +00:00
b094972051 [inductor] calibration inductor windows uts (17/N) (#134588)
skip UTs for `test/inductor/test_minifier_isolate.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134588
Approved by: https://github.com/jansel
2024-08-27 19:41:17 +00:00
9d0e0e6f1d [inductor] calibration inductor windows uts (14/N) (#134585)
skip UT for `test/dynamo/test_exc.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134585
Approved by: https://github.com/jansel
2024-08-27 19:40:56 +00:00
05ac7cd760 [MPS] Remove superfluous label/link (#134090)
This was probably intended to be a comment. I removed it since the issue is already linked in the warning below.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134090
Approved by: https://github.com/albanD
2024-08-27 19:37:33 +00:00
d5aefadb17 [CD] Fix docker builds by installing setuptools (#134595)
Seeing failures like this:
```
#49 844.6 //build_scripts/manylinux1-check.py:6: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
.....
[python 3/3] RUN bash build_scripts/build.sh && rm -r build_scripts:
846.9 ...it did, yay.
846.9 + for PYTHON in '/opt/python/*/bin/python'
846.9 + /opt/python/cpython-3.12.0/bin/python build_scripts/manylinux1-check.py
847.0 Traceback (most recent call last):
847.0   File "//build_scripts/manylinux1-check.py", line 55, in <module>
847.0     if is_manylinux1_compatible():
847.0        ^^^^^^^^^^^^^^^^^^^^^^^^^^
847.0   File "//build_scripts/manylinux1-check.py", line 6, in is_manylinux1_compatible
847.0     from distutils.util import get_platform
847.0 ModuleNotFoundError: No module named 'distutils'
------
```
PR: https://github.com/pytorch/pytorch/pull/134455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134595
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet
2024-08-27 19:31:44 +00:00
a4b44dd2ef [AOTI] Introduce DeferredCudaGridLine for cuda cpp wrapper (#129268)
Summary: Similar to https://github.com/pytorch/pytorch/pull/129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper.

Differential Revision: [D61800622](https://our.internmc.facebook.com/intern/diff/D61800622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129268
Approved by: https://github.com/angelayi
2024-08-27 19:23:25 +00:00
5fd670e0ef [ROCM] Properly disable Flash Attention/Efficient Attention with environment variables (#133866)
Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can compile correctly

Fixes #125230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133866
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/malfet
2024-08-27 18:24:29 +00:00
5b392d22c6 Revert "fix stuck floordiv (#134150)"
This reverts commit 92c4771853892193d73d87bd60eca4dc7efc51d8.

Reverted https://github.com/pytorch/pytorch/pull/134150 on behalf of https://github.com/anijain2305 due to compile time regression internal ([comment](https://github.com/pytorch/pytorch/pull/134150#issuecomment-2313230404))
2024-08-27 18:23:44 +00:00
0159ebb654 [dtensor] add test for local_map decorator (#127752)
**Summary**
This PR is a follow-up of #126924 to address reviewer's comments:
1) add a test case to show the use of `local_map` as a function decorator.
2) simplify the logic of handling different data types of `out_placements`.
3) correct variable naming in test cases to match math formulas.

**Test**
see #126924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127752
Approved by: https://github.com/wanchaol
2024-08-27 18:22:23 +00:00
8de0d7690c Use newer toAccumulateType signature in Normalization.cpp (#134540)
Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS`  in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests

Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"`

Fixes https://github.com/pytorch/pytorch/issues/134423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-08-27 18:09:20 +00:00
68b1a09422 Integrate device agnostic APIs in FSDP library [1/n] (#134337)
Summary: For MTIA FSDP support, we need to ensure the FSDP library code handles accelerator devices not limited to CUDA.

Test Plan: CI

Reviewed By: hanzlfs

Differential Revision: D60587415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134337
Approved by: https://github.com/LucasLLC, https://github.com/awgu
2024-08-27 17:31:11 +00:00
13049cd6e5 [aotinductor][UserDefinedTritonKernel] fix case with non-constexpr params declared after autotuned params (#134520)
## Context
In some user Triton kernels, we have this set-up for whatever reason.
```
@triton.jit
def mykernel(
  param0,
  param1,
  param2,
  param3: tl.constexpr,   # autotuned
  param4,                 # non-constexpr
):
  ...
```

This is an edge case because it's a general practice to declare all constexprs params at the end.

And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument...

```
>     void* kernel_args_var_0[] = {&var_0, &var_1, &var_2};
---
<     CUdeviceptr var_3;
<     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void**>(&var_3)));
<     void* kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3};
```

## Root-cause
* `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params.
* `raw_args`, that gets passed to wrapper codegen, excludes autotuned args.
* In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature.

79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)

## Fix
We try to fix this, by calculating the right constexprs wrt `raw_args`.

An illustration
```
         raw_args: [arg0, arg1, arg2, arg4]
 kernel.arg_names: [param0, param1, param2, param3, param4]
kernel.constexprs: [3]                      # param3 is autotuned; this is correct wrt kernel.arg_names
constexpr_indices: []                       # this is correct wrt raw_args
```

Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520
Approved by: https://github.com/oulgen
2024-08-27 17:20:27 +00:00
13114da4ef [3/N] Set correct device to CUDA guards (#134357)
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134300, #134345
2024-08-27 16:38:15 +00:00
be7752ead3 [2/N] Add flag to control which rank should perform NaN check (#134345)
Fixes https://github.com/pytorch/pytorch/issues/134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: #134300
2024-08-27 16:33:59 +00:00
9dc4bd7466 Create a JustknobConfig for use in config (#134161)
This is designed to be a more ergonomic interface on top of justknob_feature (see https://github.com/pytorch/pytorch/pull/134151 for just the PR with the base commits).

The idea is that people stop having to think about this as much, and can just do JustkobsConfig("//the:thing", "FORCE_THING") and it'll do the right thing.

Primarily sending this to see how people feel about the API, and using it for new config changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134161
Approved by: https://github.com/ezyang
2024-08-27 16:07:33 +00:00
94caba4899 [1/N] Move NaN check onto NCCL stream (#134300)
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-27 16:02:27 +00:00
c582602245 Update partitioner's is_fusible heuristic to respect triton kernels (#134491)
mutated arguments to triton kernels are fusible into the triton kernel.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134491
Approved by: https://github.com/Chillee
ghstack dependencies: #134364, #134466, #134490
2024-08-27 15:57:32 +00:00
761cf91e3c [DeviceMesh] Add get_all_submeshes in _MeshEnv (#134275)
Adding a private helper method for Shampoo HSDP use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134275
Approved by: https://github.com/XilunWu
2024-08-27 14:51:19 +00:00
d028b810fe Fix flaky GroupNorm ModuleInfo test (#133899)
Fixes https://github.com/pytorch/pytorch/issues/98677

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133899
Approved by: https://github.com/albanD
2024-08-27 14:45:51 +00:00
2033934ff8 Clarify error messages for NEWOBJ and BUILD in weights_only unpickler (#134346)
Clarify that `add_safe_globals` will allow types for these instructions

Some types do not appear as `GLOBAL` and are only caught in `BUILD`, example from hf slack is `numpy.dtypes.UInt32DType`

```python
import torch
import numpy as np
from tempfile import TemporaryDirectory
from pathlib import Path
from codecs import encode

torch.serialization.add_safe_globals([encode, np.dtype, np.core.multiarray._reconstruct, np.ndarray])

with TemporaryDirectory() as tempdir:
    p = Path(tempdir)
    r2 = np.random.get_state()
    torch.save(r2, p / "r2.pkl")
    torch.load(p / "r2.pkl", weights_only=True)
```

Yields (error comes from BUILD)
```
UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
 Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, parameter or OrderedDict objects, but got <class 'numpy.dtypes.UInt32DType'>
```

The reasoning is that `numpy.dtypes.UInt32DType` is constructed via `REDUCE` with `func =<class 'numpy.dtype'>` and `args= ('u4', False, True)`, clarify the error message that doing `add_safe_globals` on these will also allow them

After this PR error message becomes

```
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, Parameter, OrderedDict or types allowlisted via `add_safe_globals`, but got <class 'numpy.dtypes.UInt32DType'>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134346
Approved by: https://github.com/albanD
2024-08-27 14:45:39 +00:00
2ac710e667 Make torch.serialization.set_default_mmap_options usable as a context manager (#134371)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134371
Approved by: https://github.com/albanD
2024-08-27 14:45:29 +00:00
0fa0ac80e4 Do not use <filesystem> on Linux (#134494)
Because right now it leads to symbol conflict from binary builds.
Use of `std::filesystem::file_exists` was introduced by https://github.com/pytorch/pytorch/pull/126601 and in this PR it is replaced with a very straightforward implementation that calls `stat` on the given path, which is a classic C-way of checking for the file existence.

This PR should be reverted once one figures out how to keep `std::filesystem` methods linked into the binary private

Fixes symptoms of https://github.com/pytorch/pytorch/issues/133437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134494
Approved by: https://github.com/atalman, https://github.com/d4l3k
2024-08-27 14:44:10 +00:00
3418708abf Revert "[FlexAttention] Create new variables for the subgraphs (#134507)"
This reverts commit 4d0a44d34a46af6dcc764d55269b30ac537822a0.

Reverted https://github.com/pytorch/pytorch/pull/134507 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))
2024-08-27 13:05:27 +00:00
87a3f664e1 Revert "[FlexAttention] Remove unused code (#134511)"
This reverts commit 767c47d3c0ee3fc7804918a08de3f94874143a03.

Reverted https://github.com/pytorch/pytorch/pull/134511 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))
2024-08-27 13:05:27 +00:00
3e10a1eb5a Revert "[FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538)"
This reverts commit a34320a6f225061a3b5fe130a5a8fe35ed7a40f9.

Reverted https://github.com/pytorch/pytorch/pull/134538 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))
2024-08-27 13:05:27 +00:00
c7cbcdad76 Update partitioner's is_fusible heuristic to respect auto_functionalized (#134490)
We say Node a is fusible into node b if node b is an auto_functionalized
node that may reinplace node a later on.

This PR also changes aten.empty to be recomputable w.r.t the Partitioner
(it is, like aten.zeros, cheap to recompute and fusible into other ops).

Fixes https://github.com/pytorch/pytorch/issues/134468

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134490
Approved by: https://github.com/Chillee
ghstack dependencies: #134364, #134466
2024-08-27 13:05:01 +00:00
dde5974b13 Implementation for rng ops on hpu and xpu (#133068)
implementation for high_order_op::run_and_save_rng_state and high_order_op::run_with_rng_state on hpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133068
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/anijain2305
2024-08-27 11:34:37 +00:00
FEI
ef8236f12b Provide default value None for the attn_bias parameter(#133981) (#133986)
Fixes #133981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133986
Approved by: https://github.com/ezyang
2024-08-27 11:10:43 +00:00
a34320a6f2 [FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134538
Approved by: https://github.com/yanboliang
ghstack dependencies: #134495, #134507, #134511
2024-08-27 09:53:19 +00:00
767c47d3c0 [FlexAttention] Remove unused code (#134511)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134511
Approved by: https://github.com/yanboliang
ghstack dependencies: #134495, #134507
2024-08-27 09:53:19 +00:00
4d0a44d34a [FlexAttention] Create new variables for the subgraphs (#134507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134507
Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng
ghstack dependencies: #134495
2024-08-27 09:53:13 +00:00
f480385277 Remove explicit Amz2023 reference from jobs (#134355)
Changes jobs to go back to using the default AMI.

Note: This is only a cleanup PR. It does NOT introduce any behavior changes in CI

Now that the default variant uses the Amazon 2023 AMI and has been shown to be stable for a week, it's time to remove the explicit amz2023 references and go back to using the default variant.

After a week or two, when this is rolled out to most people, we can remove the variants from scale config as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134355
Approved by: https://github.com/jeanschmidt
2024-08-27 08:51:42 +00:00
0916d72e99 Fix the warning for cat operators with same qparams (#133999)
Summary:
Currently the warning is printed when the cat inputs have same qparam, leading to a flood of warnings.
This diff emits the warning only when cat inputs don't have the same qparam.

Test Plan: CI

Reviewed By: aprotopopov

Differential Revision: D60638609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133999
Approved by: https://github.com/tarun292
2024-08-27 08:21:39 +00:00
3515090006 Fix TypeError when itering NoneType in instantiate_device_type_tests() (#134457)
Fixes #134454

Fix TypeError introduced by https://github.com/pytorch/pytorch/pull/133082, which uses iter for NoneType of default args ``except_for`` and ``only_for``.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134457
Approved by: https://github.com/shink, https://github.com/albanD
2024-08-27 07:13:36 +00:00
136b19b062 Adding entry-point based support for out-of-tree rendezvous plugins (#132633)
Fixes #127519

Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages.

#### AUTHORING NEW PLUGIN
Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows:

```
plugin_root
|_ pyproject.toml
|_ src
   |_ redis
      |_ __init__.py
      |_ redis_store.py
      |_ redis_backend.py
```

The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows:

```
[project]
name = "redis"
version = "0.0.1"

[project.entry-points.'torchrun.plugins']
redis = 'redis'
```

The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows:

```
def getPluginHandler():
    def _create_redis_handler(params: RendezvousParameters):
        from redis_rendezvous_backend import create_backend
        backend, store = create_backend(params)
        return create_handler(store, backend, params)
    return _create_redis_handler
```

The files `redis_store` and `redis_backend` contain the implementation of [Store](41189b0da4/torch/_C/_distributed_c10d.pyi (L171)) and [RendezvousBackend](e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)) respectively.

#### USER EXPERIENCE
Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`.

Once installed, the new backend can be used in torchrun as follows:

```
torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633
Approved by: https://github.com/wconstab
2024-08-27 07:09:41 +00:00
4a18fcf7af [inductor] calibration inductor windows uts (12/N) (#134428)
enable Windows inductor UTs for `test/inductor/test_torchinductor_codegen_dynamic_shapes.py`

Failed by depends on https://github.com/pytorch/pytorch/pull/134429, need to rebase after https://github.com/pytorch/pytorch/pull/134429 merged.
```cmd
2024-08-25T23:57:23.2747794Z Windows CI does not have necessary dependencies for test_torchinductor_dynamic_shapes yet
2024-08-25T23:57:23.2748541Z Traceback (most recent call last):
2024-08-25T23:57:23.2749593Z   File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_codegen_dynamic_shapes.py", line 30, in <module>
2024-08-25T23:57:23.2750688Z     from inductor.test_torchinductor_dynamic_shapes import (
2024-08-25T23:57:23.2751877Z   File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_dynamic_shapes.py", line 46, in <module>
2024-08-25T23:57:23.2752876Z     raise unittest.SkipTest("requires sympy/functorch/filelock")
2024-08-25T23:57:23.2753545Z unittest.case.SkipTest: requires sympy/functorch/filelock
2024-08-25T23:57:23.2754077Z Got exit code 1
2024-08-25T23:57:23.2754874Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra)
```

Local test pass:
<img width="1892" alt="image" src="https://github.com/user-attachments/assets/241ab082-6026-4f33-b3ac-7e9ef7da744d">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134428
Approved by: https://github.com/jansel
2024-08-27 05:43:07 +00:00
0b81f700aa [PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765)
Summary:
We want to add compile IDs and frames to each Torch-Compiled Region in order to help users cross reference the section they are checking alongside data obtained from tools, such as tlparse.
This diff operates on the assumption that each graph section will enter and exit a CompileContext before it is ran to either compile the graph or look it up in the cache. Based on this assuption, we can save the value of the graph section from the exited CompileContext in eval_frame.c using a Python C API. After this, we can create a new interface in cpp shim to wrap around the record_function in order to pass in the new keyword argument for "context".

Test Plan:
Enhance test_profiler_dynamo_compiled_region to look for kwinputs as well as a name to see that the context is now labeled. Also changed test to run graph with more contexts so that we test a wider range of profiling.

Differential Revision: D60803317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132765
Approved by: https://github.com/anijain2305
2024-08-27 04:55:04 +00:00
de57a6e806 Back out "[dynamo][exception] Support raise exception from None (#134028)" (#134513)
Summary:
The original diff is causing the error "attempting to assign a gradient with dtype 'c10::BFloat16' to a tensor with dtype ‘float".

The context is in: https://fb.workplace.com/groups/1075192433118967/permalink/1491357138169159/

Test Plan: After reverting, the above issue is gone, details are in https://fb.workplace.com/groups/1075192433118967/permalink/1491357138169159/

Differential Revision: D61820520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134513
Approved by: https://github.com/anijain2305
2024-08-27 02:57:14 +00:00
02b0b524b5 [inductor] Turn on UT: test_randint_int64_mod (#134510)
It fixed by https://github.com/pytorch/pytorch/pull/134229, turn on it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134510
Approved by: https://github.com/ezyang
2024-08-27 02:33:07 +00:00
d0147290d8 [BE][Easy][dynamo] ensure trace_rules.MOD_INLINELIST in alphabetical order (#134246)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #134246
* #133987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134246
Approved by: https://github.com/yanboliang
2024-08-27 02:29:43 +00:00
cyy
2ee201a7d0 [CMake] Remove BUILDING_WITH_TORCH_LIBS (#134434)
Since BUILDING_WITH_TORCH_LIBS is not used now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134434
Approved by: https://github.com/ezyang
2024-08-27 01:48:21 +00:00
bdfc1d3987 Remove unnecessary expect_true in split_with_sizes (#133439)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133439
Approved by: https://github.com/albanD
2024-08-27 01:34:00 +00:00
c7ca89a11a Improve print stack/locals printing in comptime (#133651)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133651
Approved by: https://github.com/anijain2305
2024-08-27 01:29:50 +00:00
58771315d3 Unify lowerings for auto_functionalized and triton_kernel_wrapper_functional (#134466)
Fixes https://github.com/pytorch/pytorch/issues/134372

The triton_kernel_wrapper_functional lowering was causing problems (it
was generating small kernels with nans in it, probably from realizing
aten.empty nodes. Instead of having its own manual lowering, we change
triton_kernel_wrapper_functional to go the same route as
auto_functionalized where we decompose the node into clone + mutation
nodes.

Test Plan:
- new test
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134466
Approved by: https://github.com/oulgen, https://github.com/eellison
ghstack dependencies: #134364
2024-08-27 00:53:05 +00:00
141a9c7204 Revert "[export] enumerate unsupported sympy.Functions (#134271)"
This reverts commit ddd71e34797f3bb56a048058e007a2df87c5755f.

Reverted https://github.com/pytorch/pytorch/pull/134271 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134271#issuecomment-2311353460))
2024-08-27 00:45:00 +00:00
4df10a6340 [FlexAttention] Fix bug when checking whether to return LSE (#134495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134495
Approved by: https://github.com/yanboliang, https://github.com/Chillee, https://github.com/BoyuanFeng
2024-08-27 00:31:46 +00:00
b98d33c155 [inductor] calibration inductor windows uts (13/N) (#134429)
enable Windows inductor UTs for `test/inductor/test_torchinductor_dynamic_shapes.py`

Local test pass:
<img width="1885" alt="image" src="https://github.com/user-attachments/assets/4b96b6d9-715f-4c94-8059-9ee0afaaa574">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134429
Approved by: https://github.com/jansel
2024-08-27 00:16:16 +00:00
74341e1150 [dynamo] simplify implementation for os.fspath (#133801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801
Approved by: https://github.com/anijain2305
ghstack dependencies: #133771
2024-08-27 00:08:04 +00:00
1dbd3476de [dynamo][itertools] support itertools.tee (#133771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771
Approved by: https://github.com/jansel
2024-08-27 00:08:04 +00:00
43bbd781f2 Back out "[Traceable FSDPS] Allow tracing through FSDP2 impl in trace_rules.py (#133532)" (#134478)
Summary:
Original commit changeset: 0215a41433e9

Original Phabricator Diff: D61432583

D61432583 causes FSDP2 stuck in PT2 compilation when applied to FB-FM-v4.

With D61432583:
https://www.internalfb.com/mast/job/aps-ckluk-745e763d6a

After backing out D61432583:
https://www.internalfb.com/mast/job/aps-ckluk-f9604ea1f9

Test Plan:
hg graft D61774888
scripts/ckluk/aps/mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2_qps.sh

Differential Revision: D61802689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134478
Approved by: https://github.com/yf225
2024-08-27 00:07:28 +00:00
46ecc673ae [ROCm] Prevent accidental enablement of efficient attention. (#133331)
Currently Efficient attention and Flash attention share the same set of GPU
kernels on ROCM and have common limitations on head sizes.

Fixes https://github.com/pytorch/pytorch/issues/132004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133331
Approved by: https://github.com/malfet, https://github.com/jithunnair-amd
2024-08-27 00:03:45 +00:00
0be6584203 [Inductor UT] Refine test case test_codegen_upcast_to_fp32_upcast to pass on XPU. (#134474)
[Inductor UT] Refine test case test_codegen_upcast_to_fp32_upcast to pass on XPU.
Fix issue: #134476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134474
Approved by: https://github.com/jansel
2024-08-26 23:59:26 +00:00
1565940114 [MPS] Add test/test_nn.py to test suite (#134184)
This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS.

Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs.

```bash
$ python test/run_test.py --mps --verbose -k TestNN
Running test batch 'tests to run' cost 84.76 seconds
```

Ref #133520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184
Approved by: https://github.com/albanD, https://github.com/malfet
2024-08-26 23:48:23 +00:00
79b7fff188 Fix docstring for torch.signal.windows.nuttall (#134512)
This partially fixes regression introduced by https://github.com/pytorch/pytorch/pull/124771 but also just improves `z_n` rendering, by using MathML
In 2.3 it was [rendered](https://pytorch.org/docs/2.3/generated/torch.signal.windows.nuttall.html#torch.signal.windows.nuttall)
as
<img width="177" alt="image" src="https://github.com/user-attachments/assets/2c15d1f9-13ad-483f-bb66-41fa3fa4ba9c">

With this change it'll be [rendered](https://docs-preview.pytorch.org/pytorch/pytorch/134512/generated/torch.signal.windows.nuttall.html#torch.signal.windows.nuttall) as
```math
z_n = \frac{2 \pi n}{M}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134512
Approved by: https://github.com/kit1980, https://github.com/aorenste, https://github.com/atalman
2024-08-26 22:51:43 +00:00
ddd71e3479 [export] enumerate unsupported sympy.Functions (#134271)
There's 2 concepts of unsupported sympy.Functions in symbolic_shapes:
1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions
2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis

This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases.

Differential Revision: D61677956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134271
Approved by: https://github.com/avikchaudhuri
2024-08-26 22:44:12 +00:00
55236d0cb7 TestForeach::test_parity: Remove check for error message text (#134251)
Previously, error messages were expected to be string equivalent to
error messages thrown by the ref function.  This check fails for dozens
of torch functions, and doesn't appear to add much value for the end
user.  This commit removes this check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251
Approved by: https://github.com/amjames, https://github.com/janeyx99
ghstack dependencies: #134253, #134344
2024-08-26 22:40:54 +00:00
ef8c474fcf Add the fast path for bfloat16 lgamma (#134344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134344
Approved by: https://github.com/amjames, https://github.com/janeyx99
ghstack dependencies: #134253
2024-08-26 22:40:54 +00:00
3c5883e550 Fix test_parity xfail for sigmoid (#134253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134253
Approved by: https://github.com/amjames, https://github.com/janeyx99
2024-08-26 22:40:54 +00:00
a23dae22d5 Update AC pass use_reentrant message (#134472)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134472
Approved by: https://github.com/albanD
2024-08-26 21:57:38 +00:00
dbef2b05b4 [dynamo] Cache _dynamo.disable results (#134272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134272
Approved by: https://github.com/yf225, https://github.com/jansel
2024-08-26 21:04:15 +00:00
28a4db84f2 [ARM] Fix infinite recursion in unwind (#134387)
Fixes #119905

The `TORCH_SHOW_CPP_STACKTRACES=1` setting on ARM causes infinite recursive unwind because on failure a `StackTraceFetcher` attempts to unwind the <ins>failed instruction</ins>: 5ad759ca33/torch/csrc/profiler/combined_traceback.cpp (L25)
then the unwind itself fails:
5ad759ca33/torch/csrc/profiler/unwind/unwind.cpp (L10-L12)
and it causes another attempt to unwind the failure in `unwind()`...

In summary, the executed instruction is equivalent to:
```C++
std::vector<void*> unwind() {
  // some instructions ...
  return unwind();
}
```
This PR replaces `TORCH_CHECK` by `TORCH_WARN_ONCE` as it will not cause an uncontrolled recursion. The only side effect would be an empty back-trace.

Huge thanks to @nWEIdia who found the root cause!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134387
Approved by: https://github.com/eqy, https://github.com/nWEIdia, https://github.com/malfet
2024-08-26 21:02:31 +00:00
900c5083ed [inductor] calibration inductor windows uts (9/N) (#134425)
enable Windows inductor UTs of `test/inductor/test_binary_folding.py`

Failed UT depends on https://github.com/pytorch/pytorch/pull/134427
Need to rebase after https://github.com/pytorch/pytorch/pull/134427 merged.
```cmd
2024-08-25T23:32:23.0905727Z Traceback (most recent call last):
2024-08-25T23:32:23.0906516Z   File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_binary_folding.py", line 18, in <module>
2024-08-25T23:32:23.0908200Z     from inductor.test_inductor_freezing import TestCase
2024-08-25T23:32:23.0909883Z   File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_inductor_freezing.py", line 39, in <module>
2024-08-25T23:32:23.0911128Z     raise unittest.SkipTest("requires sympy/functorch/filelock")
2024-08-25T23:32:23.0911801Z unittest.case.SkipTest: requires sympy/functorch/filelock
2024-08-25T23:32:23.0912370Z Got exit code 1
2024-08-25T23:32:23.0913155Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra)
```

Local test pass:
<img width="1898" alt="image" src="https://github.com/user-attachments/assets/4a6e3f66-4bbc-4aab-8f0d-2e2318046e53">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134425
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-08-26 20:57:41 +00:00
68624cf089 [dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354)
Hard to write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134354
Approved by: https://github.com/jansel
2024-08-26 20:48:57 +00:00
af82dc816a Fix lint failures (#134488)
Introduced by https://github.com/pytorch/pytorch/pull/131000

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134488
Approved by: https://github.com/Skylion007, https://github.com/msaroufim, https://github.com/albanD, https://github.com/atalman
2024-08-26 20:13:21 +00:00
2588b5e51a Move module_tracker to logging for confused hierarchy (#134467)
Fixes https://github.com/pytorch/pytorch/issues/134242

Make sure to never raise an error when confused. Logs for confusion can be enabled with `TORCH_LOGS="torch.utils.module_tracker"` or the usual python systems.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134467
Approved by: https://github.com/malfet
2024-08-26 19:39:08 +00:00
a0e062c6f1 Add mean.dtype_out (#133506)
Give it a try and see if CI is happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133506
Approved by: https://github.com/bdhirsh
2024-08-26 19:26:11 +00:00
eqy
3541e450af Support larger page sizes with use_mmap_weights (#131000)
Fixes e.g., `test_large_mmaped_weights_non_abi_compatible_cuda` on machines with 64K page size

CC @malfet @tinglvv @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131000
Approved by: https://github.com/malfet
2024-08-26 18:35:55 +00:00
3322ee236d [aoti] remove c_shim_version v1 logic (#134283)
Summary: Previously, https://github.com/pytorch/pytorch/pull/132750 and https://github.com/pytorch/pytorch/pull/133105 set c_shim_version to 2 for all cases. So removing c_shim_version logic.

Test Plan: ci

Differential Revision: D61574695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134283
Approved by: https://github.com/desertfire
2024-08-26 18:29:40 +00:00
1d231ff8ba [HOO] add hints_wrapper to support passing context hints (#132860)
Fixes #126393

The implementation code is based on feedback here (https://github.com/pytorch/pytorch/pull/121639#issuecomment-2223948842).

Hints are passed as kwargs of hints_wrapper op. It also supports nested hints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132860
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-08-26 18:21:22 +00:00
1ccc8f0200 [dynamo][super] Improve handling of getattr on super (#134039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134039
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-08-26 18:20:39 +00:00
1dd4b9221b [inductor] enable clang for Windows inductor (#134444)
Changes:
1. Add Windows clang-cl compiler check.
2. Add openmp config for clang-cl.
3. Preload libomp.dll when use clang.
4. Add compiler flags syntax check for `clang` and `clang++`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134444
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet
2024-08-26 18:19:59 +00:00
0a3c064c12 [inductor] fix _maybe_subprocess_run not support Windows path (#134365)
Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux.

Reproduce UTs:
```cmd
pytest test\dynamo\test_minifier.py -v -k test_after_dynamo_cpu_accuracy_error
```

Error message:
```cmd
____________________________________________________________________________________________________________ MinifierTests.test_after_dynamo_cpu_accuracy_error _____________________________________________________________________________________________________________
Traceback (most recent call last):
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 40, in test_after_dynamo_cpu_accuracy_error
    self._test_after_dynamo(
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 27, in _test_after_dynamo
    self._run_full_test(run_code, "dynamo", expected_error, isolate=False)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 235, in _run_full_test
    self.assertIn(expected_error, test_proc.stderr.decode("utf-8"))
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1112, in assertIn
    self.fail(self._formatMessage(msg, standardMsg))
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail
    raise self.failureException(msg)
AssertionError: 'AccuracyError' not found in 'Traceback (most recent call last):\n  File "C:\\Users\\Xuhan\\.conda\\envs\\win_mkl_static\\lib\\site-packages\\torch\\_dynamo\\test_minifier_common.py", line 114, in _maybe_subprocess_run\n    exec(code, {"__name__": "__main__", "__compile_source__": code})\n  File "<string>", line 9\n    torch._dynamo.config.debug_dir_root = "C:\\Users\\Xuhan\\AppData\\Local\\Temp\\tmpufu9t3pc"\n                                                                                         ^\nSyntaxError: (unicode error) \'unicodeescape\' codec can\'t decode bytes in position 2-3: truncated \\UXXXXXXXX escape\n'

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_minifier.py MinifierTests.test_after_dynamo_cpu_accuracy_error

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
--------------------------------------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------------------------------------
test stdout:
test stderr: Traceback (most recent call last):
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 114, in _maybe_subprocess_run
    exec(code, {"__name__": "__main__", "__compile_source__": code})
  File "<string>", line 9
    torch._dynamo.config.debug_dir_root = "C:\Users\Xuhan\AppData\Local\Temp\tmpufu9t3pc"
                                                                                         ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

--------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------
running test
```
Local test passed:
<img width="849" alt="image" src="https://github.com/user-attachments/assets/4a4eecc2-7c08-4de6-9395-546b69803b16">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134365
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-08-26 17:48:11 +00:00
78128cbdd8 [CD] Use ephemeral arm64 runners for nightly and docker builds (#134473)
Follow up after adding linux arm64 ephemeral instances: https://github.com/pytorch/pytorch/pull/134469
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134473
Approved by: https://github.com/malfet
2024-08-26 17:47:20 +00:00
0f5b052dba [inductor] calibration inductor windows uts (11/N) (#134427)
enable Windows inductor UTs of `test/inductor/test_inductor_freezing.py`

Local test pass:
<img width="1891" alt="image" src="https://github.com/user-attachments/assets/f3a873b4-abb5-4047-92f8-8e6da7c67315">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134427
Approved by: https://github.com/jansel
2024-08-26 17:43:58 +00:00
cyy
73604eed0c [20/N] Fix clang-tidy warnings in jit (#133399)
Follows #133067

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133399
Approved by: https://github.com/Skylion007
2024-08-26 17:43:52 +00:00
019b80855f [inductor] calibration inductor windows uts (10/N) (#134426)
enable Windows inductor UT of `test/inductor/test_efficient_conv_bn_eval.py`

Local test pass:
<img width="1892" alt="image" src="https://github.com/user-attachments/assets/8a94c5e4-68bf-4a6f-8a1b-60d6ede14882">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134426
Approved by: https://github.com/jansel
2024-08-26 17:43:36 +00:00
7ff576072f [inductor] calibration inductor windows uts (8/N) (#134424)
enable Windows inductor UTs of `test/inductor/test_benchmark_fusion.py`

Local test pass:
<img width="1912" alt="image" src="https://github.com/user-attachments/assets/5be34b0c-9411-4430-927e-3313245f7c13">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134424
Approved by: https://github.com/ezyang
2024-08-26 17:38:53 +00:00
adcce538b7 Revert "Allow mp.start_processes to create processes in parallel (#133707)"
This reverts commit 3546628a2a167ace6060737eeccf8ee8fd87ddc0.

Reverted https://github.com/pytorch/pytorch/pull/133707 on behalf of https://github.com/ZainRizvi due to sorry but trunk has been consistently broken since this PR was merged. See: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10529617600/job/29191757055) [HUD commit link](3546628a2a) ([comment](https://github.com/pytorch/pytorch/pull/133707#issuecomment-2310709523))
2024-08-26 17:31:10 +00:00
d0ac5d55ba Memory optimization for DSD for TorchTune LoRA (#134025)
Optimize memory cost at [PR#129635](https://github.com/pytorch/pytorch/pull/129635)

There are 2 main part of the optimization here:
1. optimize the tensor distributing part, postpone the full_tensor generation, which avoids the memory overlap, saves around 50% peak memory at 2 param test case.
2. apply `assign=True` for the `load_state_dict`, saves memory cost at the state dict loading by assigning the input param, around 50% peak memory at loading part.

Future work:
Memory optimization to the opt will be conducted in the next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134025
Approved by: https://github.com/fegin

Co-authored-by: Rachel Guo <guorachel@meta.com>
2024-08-26 17:24:25 +00:00
fc61aae70f Remove color in CI (#133517)
Remove color by default to make CI logs easier to read

Example of color
<img width="569" alt="image" src="https://github.com/user-attachments/assets/0da13544-98b1-47be-8383-64a5b3fd8951">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133517
Approved by: https://github.com/ZainRizvi
2024-08-26 16:58:06 +00:00
42955e04f1 Revert "[dynamo] Cache _dynamo.disable results (#134272)"
This reverts commit a699bd11551e9755bb9238c6b82c369880789397.

Reverted https://github.com/pytorch/pytorch/pull/134272 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))
2024-08-26 16:57:53 +00:00
e94bdc7876 Revert "[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354)"
This reverts commit cdb9df5efe78142b7a612ae9c938ddf8a8850d10.

Reverted https://github.com/pytorch/pytorch/pull/134354 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))
2024-08-26 16:57:53 +00:00
a6fac0e969 Use ephemeral runners for windows nightly builds (#134463)
This is definition of windows.4xlarge:

```
  windows.4xlarge:
    disk_size: 256
    instance_type: c5d.4xlarge
    is_ephemeral: true
    max_available: 420
    os: windows
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134463
Approved by: https://github.com/jeanschmidt
2024-08-26 16:33:19 +00:00
b417e32da2 [CD] fix xpu nightly wheel test env (#134395) (#134464)
Due to the https://github.com/pytorch/builder/pull/1972 landed, it will source xpu env duplicated in nightly wheel test.
Works for https://github.com/pytorch/pytorch/issues/114850

Realnd of #134395 to be landed with pytorchmergebot
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134464
Approved by: https://github.com/jeanschmidt

Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
2024-08-26 15:35:48 +00:00
c507f402f1 Add linux arm64 ephemeral runners (#134469)
Should be landed with: https://github.com/pytorch/test-infra/pull/5593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134469
Approved by: https://github.com/jeanschmidt, https://github.com/clee2000
2024-08-26 15:32:45 +00:00
17e8a51ff2 Revert "[inductor]Let output or input_as_strided match exact strides (#130956)"
This reverts commit a63efee5cd422db0aabe5d02d2fe35fef9be7978.

Reverted https://github.com/pytorch/pytorch/pull/130956 on behalf of https://github.com/ZainRizvi due to sorry but this seems to cause internal tests to fail. Please see D61771533 for details ([comment](https://github.com/pytorch/pytorch/pull/130956#issuecomment-2310490049))
2024-08-26 15:31:23 +00:00
1c4780e69a Revert "c10d/logging: add C10D_LOCK_GUARD (#134131)"
This reverts commit 4c28a0eb0ba437c1b7db559f63f8bec17bd48f69.

Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/ZainRizvi due to Sorry but this causes formatting errors internally which make it fail to build. See D61759282 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2310455878))
2024-08-26 15:19:27 +00:00
50e90d7203 Revert "[dynamo] simplify implementation for functools.reduce (#133778)"
This reverts commit 6c0b15e3828b8e2a0bd726a3e5d4e98c8ced5efe.

Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
472c7cf962 Revert "[dynamo] simplify implementation for builtins.sum (#133779)"
This reverts commit 8d90392fb02ce5e6854e6b4dbcdc4a7bbd55f8e2.

Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
3d7f3f6a55 Revert "[dynamo][itertools] support itertools.tee (#133771)"
This reverts commit 0e49b2f18e78386c8ed9ce540a8017411c7ab0cd.

Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
e1fc4362fb Revert "[dynamo] simplify implementation for os.fspath (#133801)"
This reverts commit c5f6b72041144c00e240bcfdc783a5597c3d8928.

Reverted https://github.com/pytorch/pytorch/pull/133801 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
bb67ff2ba7 Migrate Windows bin jobs to runner determinator (#134231)
Update Windows binary workflows to use the runner determinator script.

Closes: pytorch/ci-infra#262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134231
Approved by: https://github.com/ZainRizvi
2024-08-26 14:56:00 +00:00
27d97b9649 Remove unnecessary test skip (#134250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134250
Approved by: https://github.com/amjames, https://github.com/janeyx99
2024-08-26 14:34:53 +00:00
be96ccf77c Revert "[CD] fix xpu nightly wheel test env (#134395)" (#134461)
This reverts commit 96738c9d756fbd64e6f2eba67f711d3e18f1630c.

Merged without pytorchmergebot command by mistake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134461
Approved by: https://github.com/jeanschmidt
2024-08-26 13:40:17 +00:00
96738c9d75 [CD] fix xpu nightly wheel test env (#134395) 2024-08-26 08:53:15 -04:00
1ff226d88c [inductor] support vec for atomic add (#131314)
Depends on https://github.com/pytorch/pytorch/pull/130827 to have correct `index_expr` dtype

Support vec for atomic add by scalar implementation.
TestPlan:
```
python test/inductor/test_cpu_repro.py -k test_scatter_using_atomic_add_vec
```
Generated code for `test_scatter_using_atomic_add_vec`
```
cpp_fused_scatter_0 = async_compile.cpp_pybinding(['const float*', 'const int64_t*', 'const float*', 'float*'], '''
#include "/tmp/torchinductor_root/nn/cnnpkaxivwaa5rzng6qsyc4ao42vschogi3yk33ukwv3emlvxeqq.h"
extern "C"  void kernel(const float* in_ptr0,
                       const int64_t* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0), 16);
            tmp0.store(out_ptr0 + static_cast<long>(x0));
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(16L); x0<static_cast<long>(25L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<long>(x0)];
            out_ptr0[static_cast<long>(x0)] = tmp0;
        }
    }
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr1 + static_cast<long>(x0), 16);
            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x0), 16);
            auto tmp1 = 25L;
            auto tmp2 = c10::convert<int64_t>(tmp1);
            auto tmp3 = at::vec::VectorizedN<int64_t,2>(tmp2);
            auto tmp4 = tmp0 + tmp3;
            auto tmp5 = static_cast<int64_t>(0);
            auto tmp6 = at::vec::VectorizedN<int64_t,2>(tmp5);
            auto tmp7 = at::vec::VecMask<int64_t,2>(tmp0 < tmp6);
            auto tmp8 = decltype(tmp4)::blendv(tmp0, tmp4, tmp7.template cast<int64_t,2>());
            auto tmp9 =
            [&]
            {
                __at_align__ std::array<int64_t, 16> tmpbuf;
                tmp8.store(tmpbuf.data());
                return tmpbuf;
            }
            ()
            ;
            auto tmp10 =
            [&]
            {
                __at_align__ std::array<int64_t, 16> tmpbuf;
                #pragma GCC unroll 16
                for (long x0_inner = 0; x0_inner < 16; x0_inner++)
                {
                    tmpbuf[x0_inner] = static_cast<long>(tmp9[x0_inner]);
                }
                return at::vec::VectorizedN<int64_t,2>::loadu(tmpbuf.data(), 16);
            }
            ()
            ;
            TORCH_CHECK((at::vec::VecMask<int64_t,2>((at::vec::VectorizedN<int64_t,2>(0) <= tmp10) & (tmp10 < at::vec::VectorizedN<int64_t,2>(25L)))).all_masked(), "index out of bounds: 0 <= tmp10 < 25L");
            atomic_add_vec(out_ptr0, tmp8, tmp12);
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(16L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr1[static_cast<long>(x0)];
            auto tmp9 = in_ptr2[static_cast<long>(x0)];
            auto tmp1 = 25L;
            auto tmp2 = c10::convert<int64_t>(tmp1);
            auto tmp3 = decltype(tmp0)(tmp0 + tmp2);
            auto tmp4 = tmp0 < 0;
            auto tmp5 = tmp4 ? tmp3 : tmp0;
            auto tmp6 = tmp5;
            auto tmp7 = c10::convert<int64_t>(tmp6);
            TORCH_CHECK((0 <= tmp7) & (tmp7 < 25L), "index out of bounds: 0 <= tmp7 < 25L");
            atomic_add(&out_ptr0[static_cast<long>(tmp5)], static_cast<float>(tmp9));
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131314
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-08-26 10:36:51 +00:00
bf5c7bf06d [FR] Fix the bug in FR script (e.g., checking all ranks dump check) (#134383)
We somehow convert the rank to string which makes the ranks check fail. This fix now convert them all to int.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134383
Approved by: https://github.com/c-p-i-o
2024-08-26 08:21:14 +00:00
92c4771853 fix stuck floordiv (#134150)
Summary: Fixes https://github.com/pytorch/pytorch/issues/134133

Test Plan:
Tested on the small repro in the linked issue with different lengths N (replacing 100), recording N vs. time taken in nanoseconds:
10 127268319
20 220839662
30 325463125
40 429259441
50 553136055
60 670799769
70 999170514
80 899014103
90 997168902
100 1168202035
110 1388556619
120 1457488235
130 1609816470
140 2177889877
150 1917560313
160 2121096113
170 2428502334
180 4117450755
190 4003068224

So N ~ 200 takes ~5s. Previously even smaller N would go for >1 min.

Didn't add a perf test because ezyang is planning to build a benchmark.

Also tested on https://www.internalfb.com/diff/D61560171, which now gets past the stuck point.

Differential Revision: D61619660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134150
Approved by: https://github.com/ezyang
2024-08-26 07:27:59 +00:00
c5f6b72041 [dynamo] simplify implementation for os.fspath (#133801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801
Approved by: https://github.com/anijain2305
ghstack dependencies: #133769, #133778, #133779, #133771
2024-08-26 07:12:15 +00:00
38f97ec8e3 [pt2] Add meta for poisson (#134103)
Because aten.poisson doesn't have meta function registered, there is one additional eager execution of this op during compilation phase of torch.compile.

There are more ops without meta registration. Is there any reason for it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134103
Approved by: https://github.com/ezyang
2024-08-26 06:14:38 +00:00
ed86ac2f25 [BE] typing for decorators - fx/_compatibility (#134054)
Summary: See #131429

Test Plan: unit tests pass

Differential Revision: D61493706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134054
Approved by: https://github.com/oulgen
2024-08-26 04:00:27 +00:00
7b6b10417d Remove ansi escape chars in assertExpectedInline and add options to skip comments and to skip empty lines (#134248)
I had a night mare rewriting tests in test_misc.py specifically :
1. graphs can have comments that refers to my files "/lsakka/.." we really dont care about comments add option to ignore comments.
2. empty lines added when EXPECTTEST_ACCEPT=1  are changed with linter causing tests to fail or linter fail!
add flag to ignore empty lines.
3. EXPECTTEST_ACCEPT fails when the text have some not readable characters. those should not effect comparing strings, also those causes weird diffs comments when tests fails. I removed ansi_escape chars https://github.com/pytorch/pytorch/pull/133045

this is used in

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134248
Approved by: https://github.com/aorenste
ghstack dependencies: #133639, #134364
2024-08-26 02:03:44 +00:00
2ec149cd3e [inductor] fix test_functional_call_sequential_params_and_buffers expectation on Windows (#134394)
This UT actual code only one empty line wrap difference(`linear` and `add`) between Windows/Linux, and the context is right.
Reproduce UTs:
```cmd
pytest test\dynamo\test_higher_order_ops.py -v -k test_functional_call_sequential_params_and_buffers
```

We can add `empty_line_normalizer` to fix it.

```cmd
______________________________________________________________________________________________ FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers _______________________________________________________________________________________________
Traceback (most recent call last):
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 3676, in test_functional_call_sequential_params_and_buffers
    self.assertExpectedInline(
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2871, in assertExpectedInline
    return super().assertExpectedInline(actual if isinstance(actual, str) else str(actual), expect, skip + 1)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 271, in assertExpectedInline
    self.assertMultiLineEqualMaybeCppStack(expect, actual, msg=help_text)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 292, in assertMultiLineEqualMaybeCppStack
    self.assertMultiLineEqual(expect, actual, *args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1226, in assertMultiLineEqual
    self.fail(self._formatMessage(msg, standardMsg))
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail
    raise self.failureException(msg)
AssertionError: 'clas[509 chars]one\n        add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n        add: "f32[1, 1]" = linear + l_b[71 chars],)\n'
  class GraphModule(torch.nn.Module):
      def forward(self, L_params_l1_weight_: "f32[1, 1]", L_params_l1_bias_: "f32[1]", L_buffers_buffer_: "f32[1]", L_inputs_: "f32[1, 1]"):
          l_params_l1_weight_ = L_params_l1_weight_
          l_params_l1_bias_ = L_params_l1_bias_
          l_buffers_buffer_ = L_buffers_buffer_
          l_inputs_ = L_inputs_

          linear: "f32[1, 1]" = torch._C._nn.linear(l_inputs_, l_params_l1_weight_, l_params_l1_bias_);  l_inputs_ = l_params_l1_weight_ = l_params_l1_bias_ = None
+ <<<< (difference is here )
          add: "f32[1, 1]" = linear + l_buffers_buffer_;  linear = l_buffers_buffer_ = None
          return (add,)
 : To accept the new output, re-run test with envvar EXPECTTEST_ACCEPT=1 (we recommend staging/committing your changes before doing this)

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_higher_order_ops.py FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.4275s] test/dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers - AssertionError: 'clas[509 chars]one\n        add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n        add: "f32[1, 1]" = linear + l_b[71 chars],)\n'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134394
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2024-08-26 01:41:20 +00:00
7af38eb98b Fix unexpected inference_mode interaction with torch.autograd.functional.jacobian (#130307)
Fixes #128264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130307
Approved by: https://github.com/soulitzer
2024-08-25 22:14:02 +00:00
dc1959e6a7 [inductor] calibration inductor windows uts (7/N) (#134420)
Disable UTs on Windows: `test/dynamo/test_misc.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134420
Approved by: https://github.com/jansel
2024-08-25 20:39:54 +00:00
97fd087cdb [inductor] calibration inductor windows uts (6/N) (#134419)
Disable UTs for Windows: `test/dynamo/test_aot_autograd_cache.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134419
Approved by: https://github.com/jansel
2024-08-25 20:39:34 +00:00
b5dd60fa75 Fix namespace issues with qnnpack (#134336)
After this I think all `using namespace` will have been eliminated from PyTorch header files. Internally, `-Wheader-hygiene` will prevent more from being added.

Test Plan: Sandcastle

Differential Revision: D61679037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134336
Approved by: https://github.com/Skylion007
2024-08-25 19:50:01 +00:00
7940f2428f [torch/package_importer] add compatibility name mapping (#134376)
Summary:
This enables patching extern modules to provide compatibility with serialized code depending on different versions of those extern modules.

The main motivation is to enable Numpy upgrade. In the recent release many alias to builtin types were deprecated and removed [1]. This breaks loading pickled modules that reference the removed aliases. While the proper solution is to re-generate pickled modules, it's not always feasible.

This proposes a way to define mapping with a new type, for a module member. It is only set if it's not present in the loaded module, thus removes the need to check for exact versions.

https://numpy.org/doc/stable/release/1.20.0-notes.html#using-the-aliases-of-builtin-types-like-np-int-is-deprecated

Differential Revision: D61556888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134376
Approved by: https://github.com/SherlockNoMad
2024-08-25 19:34:46 +00:00
816061843a [Distributed/Profiler] Fix input/output dimension overflow (#134360)
Summary: When using ParamCommsDebugInfo, the input elements and output elements are stored in `int` instead of `int64_t`

Test Plan: Run HTA with new outputted values and make sure overflow does not occur

Reviewed By: fengxizhou

Differential Revision: D61728747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134360
Approved by: https://github.com/fengxizhou, https://github.com/jeanschmidt
2024-08-25 16:25:56 +00:00
eqy
e93ca12c88 [CUDNN][SDPA] Fix unsupported trivial stride-1 transpose case (#134031)
Fixes #134001
Incorrect assumption that two same-shape tensors being contiguous meant that they would have the same stride

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134031
Approved by: https://github.com/drisspg, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-08-25 14:31:30 +00:00
08d111250a [ez][c10d] change ERROR to WARNING (#134349)
Summary:
Change error to warning because TCPStore can be torn down during a normal shutdown. It's OK if we're unable to access TCPStore. Should not be an error.

Test Plan:
Ran locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134349
Approved by: https://github.com/fduwjj, https://github.com/wconstab
2024-08-25 14:22:55 +00:00
4648848696 Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)"
This reverts commit f71c3d265ab52589f983dd252d61461db4e7dbbd.

Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/jeanschmidt due to seems to have introduced breakages in linux binary builds ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2308787310))
2024-08-25 11:20:30 +00:00
e5563f7ad7 Revert "[dtensor][MTPG] make sharding prop lru cache not shared among threads (#134294)"
This reverts commit eb15b1a016c6facaf8605dde2c20b5de1586542d.

Reverted https://github.com/pytorch/pytorch/pull/134294 on behalf of https://github.com/jeanschmidt due to seems to have introduced https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658 ([comment](https://github.com/pytorch/pytorch/pull/134294#issuecomment-2308785949))
2024-08-25 11:16:04 +00:00
268092db83 [DeviceMesh] Allow _flatten() to take in an optional mesh_dim_name (#134048)
If a mesh_dim_name is given, we will use the given mesh_dim_name to name the new flattened dim.
Otherwise, the default is a string concatentaing the mesh_dim_names of the given submesh with each mesh_dim_name separated by "_".

For example, if we have a 3D mesh DeviceMesh([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], mesh_dim_names=("dp", "cp", "tp")), calling mesh_3d["dp", "cp"]._flatten() will create a 1D submesh DeviceMesh([0, 1, 2, 3], mesh_dim_names=("dp_cp",)) on rank 0, 1, 2, 3 and a 1D submesh DeviceMesh([4, 5, 6, 7], mesh_dim_names=("dp_cp",)) on rank 4, 5, 6, 7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134048
Approved by: https://github.com/fegin
ghstack dependencies: #133838, #133839
2024-08-25 10:36:01 +00:00
326db8af4c Replace sympy Min/Max with reimplementations (#133319)
Sympy's implementation of Min/Max displays asymptotically bad behavior on `TORCH_COMPILE_CPROFILE=1 python torchrec/distributed/tests/test_pt2_multiprocess.py TestPt2Train.test_compile_multiprocess`. Evidence profile:

![image](https://github.com/user-attachments/assets/142301e9-3a18-4370-b9db-19b32ece7ee8)

On this test case, we spend 42% of all time compiling the network on ShapeEnv.replace, which in turn spends all of its time in xreplace.

The problem appears to be find_localzeros call. By vendoring the implementations of Min/Max, we can potentially reduce the cost of this operation.

The implementation is copy-pasted sympy/functions/elementary/miscellaneous.py but with some adjustments:

* I deleted logic related to differentatiation, evalf and heaviside, as it's not relevant to PyTorch reasoning
* There's some massaging to appease PyTorch's linters, including a lot of noqa and type: ignore (which I could potentially refactor away with substantive changes, but that's better as its own change)
* I deleted the second loop iteration for is_connected, as an attempt at initial optimization (this also simplifies the port, since I can omit some code). I'll comment at that point what the exact difference is.

Before this change, the test in question takes 100s with 40 features; post this change, afterwards, it takes only 69s.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133319
Approved by: https://github.com/Skylion007
2024-08-25 05:05:59 +00:00
8db8ac700d line by line logging (#134298)
Summary:
Today there is no good mechanism to detect progress of non-strict export line-by-line in user code. This caused some pain recently in trying to find the exact line of user code that was triggering a bug where the process appeared stuck because deep down something was calling some symbolic shapes code that was suffering some exponential blowup.

This PR adds a environment variable for extended debugging that will log the line of user code corresponding to every torch function call. It only works in non-strict export for now. Prefix setting this environment variable with `TORCH_LOGS`  enabled for `export` logs at `DEBUG` level (i.e., with a `+` prefix), i.e.,.:

```
TORCHEXPORT_EXTENDED_DEBUG_CURRENT_LOC=1 TORCH_LOGS="+export" ...
```

This will show logs with something like:
```
...
prim::device called at .../example.py:4284 in foo
TensorBase.item called at .../example.py:4277 in bar
...
```

We already have an existing place to intercept torch functions where we process data-dependent errors in non-strict, so parking the logging there. An alternative place we could be doing this is where we add `stack_trace` metadata when generating code, but unfortunately at least the example that motivated this gets stuck before generating code, so that would be too late.

Test Plan: ran it on some sample commands

Differential Revision: D61692156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134298
Approved by: https://github.com/angelayi
2024-08-25 02:57:11 +00:00
907c32faac [inductor] calibration inductor windows uts (4/N) (#134401)
skip failed UTs of `test/dynamo/test_unspec.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134401
Approved by: https://github.com/ezyang
2024-08-25 00:32:29 +00:00
74ef74be36 [inductor] calibration inductor windows uts (3/N) (#134400)
skip Windows UT of `test/dynamo/test_trace_rules.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134400
Approved by: https://github.com/ezyang
2024-08-24 23:48:50 +00:00
d33d68e326 [Profiler] Add test to make sure FunctionEvents are processed lazily (#134359)
Summary: Create simple test that checks that FunctionEvent build tree happens lazily by checking that the metrics for it changes before and after call.

Test Plan: Make sure test passes in CI

Reviewed By: briancoutinho

Differential Revision: D61685429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134359
Approved by: https://github.com/briancoutinho
2024-08-24 23:03:19 +00:00
af4c87953e [inductor] calibration inductor windows uts (5/N) (#134402)
skip UTs of `test/dynamo/test_repros.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134402
Approved by: https://github.com/ezyang
2024-08-24 23:00:11 +00:00
94f92fbd88 Use integer divison in arange length calculation when start/end/step are integral (#134296)
Fixes #133338

Test Plan:

```
TORCH_LOGS=dynamic python
import torch

torch._dynamo.config.capture_scalar_outputs = True

@torch.compile()
def f(x):
    y = x.item()
    torch._check_is_size(y)
    r = torch.arange(y, dtype=torch.float32)
    torch._check(r.size(0) == y)
    return r

f(torch.tensor([300]))
```

Before and after diff. Verify the following line

```
I0813 11:05:44.890000 652898 torch/fx/experimental/symbolic_shapes.py:5198] [0/0] runtime_assert Eq(CeilToInt(IntTrueDiv(u0, 1)), u0) [guard added] at aa.py:10 in f (_dynamo/utils.py:2092 in run_node), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(CeilToInt(IntTrueDiv(u0, 1)), u0)"
```

no longer shows in the logs. Also verify CI passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134296
Approved by: https://github.com/aorenste
2024-08-24 21:09:28 +00:00
1a0d00f1f4 [traced-graph][sparse] enable to_dense() for compressed (#133371)
Fixes https://github.com/pytorch/pytorch/issues/133174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133371
Approved by: https://github.com/ezyang
2024-08-24 20:33:23 +00:00
050aa67e41 [traced-graph][sparse] fix restrictive assert for sparse add (#134037)
exporting sparse addition can be CPU/Meta this fixes the overly restrictive assert and adds an exporting test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134037
Approved by: https://github.com/ezyang
2024-08-24 20:26:47 +00:00
90fb83749e [inductor] fix test torch package working with trace on windows (#134397)
Current temporary directory path is hard code. Fixed by get temporary directory path by API.

Reproduce UTs:
```cmd
python test/dynamo/test_dynamic_shapes.py -v -k test_torch_package_working_with_trace_dynamic_shapes
```

Error message:
```cmd
________________________________________________________________________________________________ DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes ________________________________________________________________________________________________
Traceback (most recent call last):
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_misc.py", line 7199, in test_torch_package_working_with_trace
    with package.PackageExporter(path) as exp:
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\package\package_exporter.py", line 237, in __init__
    self.zip_file = torch._C.PyTorchFileWriter(f)
RuntimeError: Parent directory /tmp does not exist.

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_dynamic_shapes.py DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.0080s] test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_package_working_with_trace_dynamic_shapes - RuntimeError: Parent directory /tmp does not exist.
==================================================================================================================== 1 failed, 1665 deselected in 4.00s =====================================================================================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134397
Approved by: https://github.com/ezyang
2024-08-24 20:25:44 +00:00
9cd53b3212 Add Arm copyright line to LICENSE (#133982)
Some historical commits from arm:
- 2021 664126bab5f3f2a275e82b7bde127132cff7f34e
- 2023 2630144786e906b40abbe017294d404bcfe3c6ae
- 2024 ce6130014156fa9555ce3d16c5f9a84cbdadf8f4

See https://github.com/pytorch/pytorch/pull/126687 for initial discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133982
Approved by: https://github.com/malfet
2024-08-24 18:41:06 +00:00
50d5aa8c10 Enable optimized dynamic quantization on aarch64 (#126687)
oneDNN+ACL has optimized kernels for s8s8 matmul, so input is signed. This change leaves behaviour on all other platforms the same. This change requires https://github.com/intel/ideep/pull/313 to go in, and oneDNN 3.5 for the optimized kernels. This change speeds up dynamic quantized linear by ~10x.

Also, do you have a policy on copyright headers? Arm's usual policy when contributing to open source projects is to include a copyright header on any file which is modified. Would this be acceptable? If not, is there somewhere else suitable to note copyright?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126687
Approved by: https://github.com/jgong5, https://github.com/malfet, https://github.com/snadampal

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-24 18:40:12 +00:00
f71c3d265a [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-08-24 18:26:49 +00:00
6245d5b87b [CI] Update XPU ci test python version to 3.9 (#134214)
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134214
Approved by: https://github.com/EikanWang, https://github.com/malfet
2024-08-24 18:11:36 +00:00
a63efee5cd [inductor]Let output or input_as_strided match exact strides (#130956)
Fixes #130394

TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue.  This PR enables non-dense outputs' strides follow the strides required by semantics.

The comparison between the original and after this fix for the test is the below.

```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 128
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 8
    x1 = (xindex // 8)
-   x2 = xindex
    tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
    tmp1 = tmp0 + tmp0
-   tl.store(out_ptr0 + (x2), tmp1, xmask)
+   tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (16, 8), (16, 1))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
-       buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+       buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
        del arg0_1
    return (buf1, )
```

The buf1 is created with exact stride required by users, and its values are written in same stride with the input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister
2024-08-24 17:04:05 +00:00
cdb9df5efe [dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354)
Hard to write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134354
Approved by: https://github.com/jansel
ghstack dependencies: #134272
2024-08-24 15:17:56 +00:00
d433a603af [BE] use torch.amp.autocast instead of torch.cuda.amp.autocast (#134291)
torch.cuda.amp.autocast / torch.cpu.amp.autocast are deprecated and spew a ton of warnings when these tests run. This PR: Update to just use torch.amp.autocast(device).

Note: this uncovers a bug in the test: when `device` is CUDA, it actually shows up as "cuda:0" - so previously, this test was _always_ using `torch.cpu.amp.autocast` even for `cuda` device. This PR fixes this, and uncovers additional bugs in `pinverse` and `linalg.pinv`; `linalg.pinv` was already failing before on CPU, but now the test also catches failures on CUDA, (and this PR adds to the skipped-test list).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134291
Approved by: https://github.com/YuqingJ
2024-08-24 15:07:49 +00:00
a1061009c9 [PT2] use statically_known_true in slice_noop (#134270)
Summary:
# context
* when fixing the graph break in _maybe_compute_kjt_to_jt_dict, we encountered this issue P1539489731:
```
[rank0]:   ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
[rank0]:   Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.
[rank0]:
[rank0]:   Potential framework code culprit (scroll up for full backtrace):
[rank0]:     File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_inductor/fx_passes/post_grad.py", line 671, in slice_noop
[rank0]:       if start == 0 and end >= 2**63 - 1 and step == 1:
```
* change the condition logic to be compatible with SymInt

Test Plan:
# commands
* run test
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 | tee -a `date +"%Y.%m.%d.%H.%M"`.`sl whereami`.log
```
* tlparse
```
ls -thl /var/tmp/tt | head -9 && tlparse `ls -t /var/tmp/tt/* | head -1`
```

Differential Revision: D61677207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134270
Approved by: https://github.com/ezyang
2024-08-24 13:58:51 +00:00
ff77c67d16 Use ephemeral runners for linux nightly builds (#134367)
Should be landed with https://github.com/pytorch/test-infra/pull/5590
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134367
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/seemethere
2024-08-24 12:49:07 +00:00
ff7d94c67e [compiled autograd] fix saved tensor hook firing count (#134361)
SavedVariable constructor calls the pack hooks, we don't want to call them for the proxy tensor since it is proxying a tensor that already had called the pack hook during forward.

Using the same fix as https://github.com/pytorch/pytorch/pull/123196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134361
Approved by: https://github.com/jansel
ghstack dependencies: #134186, #134200, #134205, #134286, #134290, #134162, #134163
2024-08-24 12:06:36 +00:00
929de1d0d4 Re-enable skipped compiled autograd eager tests (#134163)
Originally disabled in: https://github.com/pytorch/pytorch/pull/131700#discussion_r1727153445, but the failure is no longer in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134163
Approved by: https://github.com/soulitzer
ghstack dependencies: #134186, #134200, #134205, #134286, #134290, #134162
2024-08-24 12:06:36 +00:00
ad8bdfae1e add compiled_autograd to programmatic set_logs API (#134162)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134162
Approved by: https://github.com/yf225, https://github.com/jansel
ghstack dependencies: #134186, #134200, #134205, #134286, #134290
2024-08-24 12:06:36 +00:00
1431663693 [compiled autograd] finish classifying tests (#134290)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134290
Approved by: https://github.com/yf225
ghstack dependencies: #134186, #134200, #134205, #134286
2024-08-24 12:06:36 +00:00
0b228a2af8 [compiled autograd] match eager behavior for ctx.saved_variables (#134286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134286
Approved by: https://github.com/jansel
ghstack dependencies: #134186, #134200, #134205
2024-08-24 12:06:36 +00:00
6cc57c64b2 [compiled autograd] match eager behavior for post acc grad hooks (#134205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134205
Approved by: https://github.com/jansel
ghstack dependencies: #134186, #134200
2024-08-24 12:06:36 +00:00
d7a25e1d8c [compiled autograd] add config patching for certain eager tests (#134200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134200
Approved by: https://github.com/jansel
ghstack dependencies: #134186
2024-08-24 12:06:36 +00:00
0d9208a398 [compiled autograd] match eager behavior for inplace detached activations (#134186)
Fixes `TestAutograd.test_saved_variable_saved_original_inplace_detach` when ran under compiled autograd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134186
Approved by: https://github.com/jansel
2024-08-24 12:06:36 +00:00
ccafc93be5 [AOTI][CPU] Make int8 qlinear work (#134368)
Summary:
This diff will decompose torch.ops._quantized.wrapped_quantized_linear into torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked for AOTI, and added the corresponding impl into shim

The way it works will be similar to what we did previously for fbgemm fp16 dynamic qlinear. We will do constant folding for packed weight during runtime (warm up) to achieve the speed up

Reviewed By: desertfire

Differential Revision: D61396144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134368
Approved by: https://github.com/houseroad
2024-08-24 08:25:25 +00:00
eb15b1a016 [dtensor][MTPG] make sharding prop lru cache not shared among threads (#134294)
**Summary**
Before this PR, `sharding propagator` is shared among threads. The result is the cache result of rank 0 would be accessible by other ranks e.g. rank 1 and this could lead to wrong DTensor resharding. This PR fixes it by making the cache a local variable at thread level, and it fixes `dstack` test (#126493), `inner` (https://github.com/pytorch/pytorch/issues/126852), and `vstack` (https://github.com/pytorch/pytorch/issues/126868). It also fixes `poisson_nll` (https://github.com/pytorch/pytorch/issues/131446) as a bi-product.

**Test**
`pytest test/distributed/_tensor/test_dtensor_ops.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134294
Approved by: https://github.com/wz337, https://github.com/awgu
2024-08-24 05:56:45 +00:00
1034f456ef [inductor] fix munge_exc not support windows path (#134348)
Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux.

Reproduce UT:
```cmd
pytest test\dynamo\test_higher_order_ops.py -v -k test_vmap_grad_vmap_guard_fail
```
Error msg:
```cmd
________________________________________________________________________________________________________ HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail _________________________________________________________________________________________________________
Traceback (most recent call last):
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\logging_utils.py", line 89, in test_fn
    fn(self, records)
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 2714, in test_vmap_grad_vmap_guard_fail
    munge_exc(record.getMessage()),
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 5252, in munge_exc
    s = re.sub(file, os.path.basename(file), s)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 209, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 303, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 955, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 526, in _parse
    code = _escape(source, this, state)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 370, in _escape
    raise source.error("incomplete escape %s" % escape, len(escape))
re.error: incomplete escape \x at position 2

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_higher_order_ops.py HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
--------------------------------------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------------------------------------
frames [('total', 2), ('ok', 2)]
inductor []
inline_call []
stats [('calls_captured', 38), ('unique_graphs', 2)]
--------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] Recompiling function fn in D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py:2699
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles]     triggered by the following guard failure(s):
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles]     - 0/0: torch._functorch.pyfunctorch.compare_functorch_state([('Vmap', 1, 'error')])  # _dynamo\output_graph.py:479 in init_ambient_guards
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.7452s] test/dynamo/test_higher_order_ops.py::HigherOrderOpVmapGuardTests::test_vmap_grad_vmap_guard_fail - re.error: incomplete escape \x at position 2
```
Local test passed:
<img width="860" alt="image" src="https://github.com/user-attachments/assets/90f0d780-0639-4c03-8d7c-6f227c93a3fc">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134348
Approved by: https://github.com/jansel
2024-08-24 05:51:35 +00:00
0694918aeb [export] Temporarily bypass torch_fn in partitioner (#134292)
Summary:
"torch_fn" is not correct for the decomposed add node from batch norm. This is a temporary workaround to bypass torch fn.

For example, for the graph below (test_qat_conv2d_unary graph):
```
graph():
    %conv_weight : [num_users=1] = get_attr[target=conv.weight]
    %bn_weight : [num_users=1] = get_attr[target=bn.weight]
    %bn_bias : [num_users=1] = get_attr[target=bn.bias]
    %bn_running_mean : [num_users=1] = get_attr[target=bn.running_mean]
    %bn_running_var : [num_users=1] = get_attr[target=bn.running_var]
    %bn_num_batches_tracked : [num_users=1] = get_attr[target=bn.num_batches_tracked]
    %x : [num_users=1] = placeholder[target=x]
    %conv2d : [num_users=1] = call_function[target=torch.ops.aten.conv2d.default](args = (%x, %conv_weight, None, [1, 1], [1, 1]), kwargs = {})
    %add_ : [num_users=0] = call_function[target=torch.ops.aten.add_.Tensor](args = (%bn_num_batches_tracked, 1), kwargs = {})
    %batch_norm : [num_users=1] = call_function[target=torch.ops.aten.batch_norm.default](args = (%conv2d, %bn_weight, %bn_bias, %bn_running_mean, %bn_running_var, True, 0.1, 1e-05, True), kwargs = {})
    %relu : [num_users=1] = call_function[target=torch.ops.aten.relu.default](args = (%batch_norm,), kwargs = {})
    %max_pool2d : [num_users=1] = call_function[target=torch.ops.aten.max_pool2d.default](args = (%relu, [3, 3], [3, 3]), kwargs = {})
    return (max_pool2d,)
```

the add_ node has `'torch_fn': ('add__1', 'method_descriptor.add_'),` in its meta.

If we run the line below in `_annotate_qat_conv2d_bn_binary_unary`, we'll have a partition without output nodes.

```
 find_sequential_partitions(
            gm, [torch.nn.Conv2d, torch.nn.BatchNorm2d, operator.add, torch.nn.ReLU]
        )
````

```
partition_list
[
SourcePartition(nodes=[conv_weight, conv2d], source=<class 'torch.nn.modules.conv.Conv2d'>, input_nodes=[x], output_nodes=[conv2d], params=[conv_weight]),

SourcePartition(nodes=[bn_weight, bn_bias, bn_running_mean, bn_running_var, bn_num_batches_tracked, add_, batch_norm], source=<class 'torch.nn.modules.batchnorm.BatchNorm2d'>, input_nodes=[conv2d], output_nodes=[batch_norm], params=[bn_num_batches_tracked, bn_running_var, bn_bias, bn_weight, bn_running_mean]),

SourcePartition(nodes=[add_], source='add_', input_nodes=[bn_num_batches_tracked], output_nodes=[], params=[])
]
```
We should not have the last partition.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv2d
```

Differential Revision: D61569049

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134292
Approved by: https://github.com/angelayi
2024-08-24 05:50:18 +00:00
f260cc2edf Enable DTensor sharding propagation of native_layer_norm_backward to more fully accommodate optional args (#133502)
Fixes #133499

### The issue

Testing a variety of TP `requires_grad` patterns (validating maximally flexible finetuning) revealed `DTensor` sharding propagation of `aten.native_layer_norm_backward` (default) fails with an `IndexError` for certain `requires_grad` patterns (pattern 1) (e.g. `output_mask` `[True, False, False]`) and an `AssertionError` for others (pattern 2) (e.g. output mask `[False, True, *]`). Please see issue #133499 for a full description of the observed failure patterns along with reproduction.

### Use Cases and Remediation

Failure pattern 1 is potentially problematic for a variety of finetuning scenarios. Though failure pattern 2 is really an xfail right now since it's not fully supported, IMHO there are use cases (e.g. especially wrt to mechanistic interpretability research, but certain finetuning scenarios too potentially) that justify supporting this output mask (especially since supporting it is fairly straightforward I think).

In this PR I propose some modest changes that:
  * Address the aforementioned failure modes.
  * Add a couple tests that I'm hopeful will help ensure `DTenso`r op dispatch (which is so well implemented and such a pleasure working with btw! 🚀 🎉) accommodates a wide variety of (potentially unanticipated) `requires_grad` patterns as it evolves.

To address both failure modes, I'm proposing the following changes:
1. To [`torch.distributed._tensor.ops._math_ops.layer_norm_bwd_strategy`](7b269cc484/torch/distributed/_tensor/ops/_math_ops.py (L873)):
  - Refactor conditional `output_mask` handling such that the input and output specs in the`PlacementStrategy`s of the returned `output_strategy.strategies` list remain aligned with the `op_schema.args_spec` (whose definition does not change at runtime based upon unused optional args).
2. To [`torch.distributed._tensor._sharding_prop.propagate_op_sharding_non_cached`](7b269cc484/torch/distributed/_tensor/_sharding_prop.py (L256-L262)):
  - When iterating through the active `op_schema.args_spec` to build the relevant `expected_input_specs` list, filter any `None` `desired_specs`.
3. To [`torch/distributed/_tensor/_op_schema.OpSchema._inplace_rewrap_schema_suggestion`](7b269cc484/torch/distributed/_tensor/_op_schema.py (L418))
  - When inputs need a redistribute, for runtime-unrequired (`None` arguments in the aligned `suggestion_args_schema`), ignore the associated `suggestion_args_spec`

### Implementation considerations:

- Regarding `1`, to avoid changing the op strategy return args ([`op_strategy`](cf81180007/torch/distributed/_tensor/_sharding_prop.py (L234))), the change in `1` allows `None` elements to exist temporarily in `PlacementStrategy.input_specs` (treating it as `Sequence[DTensorSpec | None] | None` when it's `Sequence[DTensorSpec] | None`. This could be addressed in any number of ways but I thought it best to leave that for a subsequent PR since it could have broader ramifications (e.g. allowing op_strategies to return an output_strategy.input_specs` mask explicitly, explicitly allowing `None`s in `PlacementStrategy.input_specs`, creating a `Null` DTensorSpec etc.). That's why I'm using an ignore arg-type directive there for now.
- Regarding `2` and `3` above, I don't introspect `op_schema.op._schema.arguments` to verify any `None` arguments are `torch.OptionalType`, leaving adherence to the schema contract the responsibility of the given op. Regarding `2`, I assume any `desired_spec` will be either a `DTensorSpec` or `None`, so only `None` can be Falsy in this context.
- I considered altering the active `args_schema`, which could be inspected and aligned with the active `output_strategy.input_specs` in some cases and avoid the changes in `3`, but I think that would rely on one of (among other possibilities):
    - all supported op signatures having optional Tensors (`DTensorSpec`) args after required tensors (which isn't a planned required as far as I know),
    -  (somewhat brittle) heuristic-driven arg alignment
    -  only supporting kwargs etc.

### Added Tests

To facilitate detection of future `requires_grad` pattern op failure modes as `DTensor` evolves, I added the following two tests:

1. `test/distributed/_tensor/test_math_ops.py DistMathOpsTest.test_layer_norm_bwd_req_grad`
    - Tests `native_layer_norm_backward` specifically with 20 subtests that sweep valid `output_mask` patterns along in different LayerNorm dimensionality and `elementwise_affine` configurations.

2. `test/distributed/tensor/parallel/test_tp_examples.py DistTensorParallelExampleTest.test_transformer_req_grad`
    - Samples a subset of `requires_grad` patterns in a more realistic (relative to the `LayerNorm`-specific test) Transformer usage context with different `dtype` and `is_seq_parallel` configurations. Note since there was substantial overlap with the existing `test_transformer_training` test, I took the opportunity to refactor that test to allow relevant code-sharing. I also added an `ExpCommCounts` `NamedTuple` to facilitate the addition of additional `requires_grad` patterns that we may want to test in the future which may result in different comm counts. I created the separate `requires_grad` test to allow decoupling the multi-iteration `test_transformer_training` test and allow addition of new `requires_grad` scenarios as desired while being mindful of resources.

Thanks again to the PyTorch distributed team for your immensely valuable contributions to the open-source ML community!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133502
Approved by: https://github.com/XilunWu
2024-08-24 05:49:54 +00:00
8d3c6494ae [Inductor][FlexAttention] Rename IS_LAST_BLOCK to CHECK_BLOCK_BOUNDARY (#134378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134378
Approved by: https://github.com/drisspg
2024-08-24 04:40:01 +00:00
5ad759ca33 [inductor] calibration inductor windows uts (2/N) (#134358)
skip unsupported UTs of `test\inductor\test_compile_worker.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134358
Approved by: https://github.com/jansel
2024-08-24 04:08:59 +00:00
5ae9c01794 [DTensor] Add naive replicate strategy for aten._linalg_eigh.default (#134284)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134284
Approved by: https://github.com/awgu
2024-08-24 03:50:05 +00:00
962e1f6ca7 [DTensor] Add aten.any.default,dim,out to linear_reduction_strategy (#134206)
For `aten.any`, we can use `reduce_op="sum"` as the linear reduction op.

When we do `all_reduce` with `reduce_op="sum"` on bool tensor, if one rank returns `torch.Tensor([True]) `, then the reduction result is `torch.Tensor([True]) `. Only when all ranks return `torch.Tensor([False]) ` would the reduction result be `torch.Tensor([False]) `. This matches with `any`'s behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134206
Approved by: https://github.com/tianyu-l, https://github.com/chuanhaozhuge
2024-08-24 03:49:46 +00:00
5d39b14b68 [DeviceMesh] Add DeviceMesh slicing support for flatten mesh dim (#133839)
Add DeviceMesh slicing support such that we could do the following:
```
mesh_3d = init_device_mesh(
    self.device_type, (2, 2, 2), mesh_dim_names=("replicate", "shard", "cp")
)
shard_cp_mesh = mesh_3d["shard", "cp"]._flatten()
hsdp_mesh = mesh_3d["replicate", "shard_cp"]
# we can get the corresponding group of the flatten mesh through

group = shard_cp_mesh.get_group()
# or
group = mesh_3d["shard_cp"].get_group()
# or
mesh_3d.get_group(mesh_dim="shard_cp")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133839
Approved by: https://github.com/fegin
ghstack dependencies: #133838
2024-08-24 03:49:29 +00:00
195abdb85c ppc64le: VSX Support for Inductor (#132746)
### Description

This PR extends the `VecISA` class to include support for VSX on the `ppc64le` architecture within the Inductor backend. This enhancement enables vectorization support, resulting in performance improvements when using `torch.compile()` on `ppc64le`.

### Fixes

- Resolved the `test_acosh_with_negative_large_input` test case in `test_cpu_repro.py` by implementing `acosh` for VSX.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132746
Approved by: https://github.com/jansel
2024-08-24 03:36:09 +00:00
519342962d Pass process group info into NcclWork (#134269)
Summary: Pass process group info into NcclWork

Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test

Differential Revision: D61677160

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134269
Approved by: https://github.com/wconstab
2024-08-24 01:04:43 +00:00
e2a87fb1e9 [ONNX] Update exporter logic (#134304)
Sync the exporter logic with torch-onnx at https://github.com/justinchuby/torch-onnx/compare/v0.1.12...v0.1.15.

https://github.com/pytorch/pytorch/issues/129277

- Create a `testing` module to facilitate testing model accuracy. The model is internal
- Improve decomp table
- Improve model verification logic
- Add tests

The next PRs will enable OpInfo tests and clean up existing code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134304
Approved by: https://github.com/titaiwangms
2024-08-24 00:49:54 +00:00
a1d0b4d568 Add option to skip functional passes in the pattern matcher's replacement graph (#134364)
The pattern matcher runs DCE and remove_noop_ops on the replacement
graph by default. Previously we had a switch for the DCE. This PR
changes that switch to also control if we run remove_noop_ops.

The context was that there is silent incorrectness with
auto_functionalized. We use the Pattern matcher to decompose
auto_functionalized into a mutable op + clones; remove_noop_ops were
deleting the clones.

Future: can try #134363

Test Plan:
- new test. I wasn't able to produce a silently incorrect example so I
  settled for asserting that clones still exist in the post-grad graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134364
Approved by: https://github.com/eellison
ghstack dependencies: #133639
2024-08-24 00:38:55 +00:00
2c8fc3f4ce [inductor] Move imports to top of file in generated code (#134195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134195
Approved by: https://github.com/eellison
ghstack dependencies: #134194
2024-08-24 00:35:57 +00:00
1aa0e35a04 [inductor] Remove dead code in multi_kernel.py (#134194)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134194
Approved by: https://github.com/eellison
2024-08-24 00:35:57 +00:00
4ff1a4dd0f [export] support set_grad_enabled hop in dynamo to enable re-tracing (#134281)
As titled. We added dynamo support for wrap_with_set_grad_enabled hop to support re-trace an exported program.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134281
Approved by: https://github.com/tugsbayasgalan
2024-08-24 00:35:53 +00:00
9dc47f5e62 [FlexAttention]Fix how we realize input buffers (#134351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134351
Approved by: https://github.com/Chillee
2024-08-24 00:31:00 +00:00
4c28a0eb0b c10d/logging: add C10D_LOCK_GUARD (#134131)
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.

This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.

This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.

Test plan:

existing CI for regressions

will add unit tests on `C10D_LOCK_GUARD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-24 00:27:39 +00:00
e52e93e8fd Update scale-config files with linux.24xlarge.ephemeral (#134380)
Add linux.24xlarge.ephemeral  to scale config
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134380
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2024-08-24 00:01:39 +00:00
54ff320519 [export] refactor ExportGraphSignature construction (#134059)
Refactors construction of ExportGraphSignature object for export & training IR, explicitly creating AOTAutograd signature for training IR. This will be helpful for upcoming refactors for placeholder naming & runtime asserts prettifying.

Changes:
- dedups `make_argument_spec` call, moved to export/graph_signature.py
- `_sig_to_specs` wrapped into new function `_convert_to_export_graph_signature`, directly converts GraphSignature -> ExportGraphSignature
- `_make_fx_helper` explicitly creates AOTAutograd GraphSignature object
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134059
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-08-23 23:29:28 +00:00
aa9f4cc733 [Inductor][CPP] Support vectorization of remainder (#129849)
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```

Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-08-23 23:26:51 +00:00
286f2dba9f [2/N refactor NCCLPG error logs][c10d] Make msg in monitoring thread in NCCLPG more accurate and simpler (#134036)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134036
Approved by: https://github.com/wconstab
2024-08-23 23:21:28 +00:00
2cfc2da527 [export] Make move_to_device_pass function public (#134263)
Summary:
This is a follow-up of https://github.com/pytorch/pytorch/pull/133660

Here we make the `move_to_device_pass()` function publich so users can call it by `from torch.export.passes import move_to_device_pass`

Test Plan: CI

Differential Revision: D61671310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134263
Approved by: https://github.com/angelayi
2024-08-23 23:18:30 +00:00
c638a40a93 [Caffe2] Remove unused AVX512 code (#133160)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133160
Approved by: https://github.com/albanD
2024-08-23 23:16:16 +00:00
1f19ccb5b3 [Inductor/Triton] Customize triton codegen to optionally preserve input dtype on tl.load (#132406)
Differential Revision: D60536337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132406
Approved by: https://github.com/jfix71, https://github.com/blaine-rister
2024-08-23 22:58:43 +00:00
8ff3a5be1b [export] basic auto dynamic shapes (#133620)
Starter version of automatic dynamic shapes for export.

Creates enums `DIM.AUTO`, `DIM.STATIC`, allowing user to specify `AUTO` for dims in dynamic_shapes specs, meaning that corresponding dims are treated as dynamic, and relevant guards will do what's necessary (e.g. refine ValueRanges, set replacements based on equality, or even set static) without raising ConstraintViolationErrors. Basically allows the user to say, "a bunch of these dims can be dynamic, let export do model analysis and return the program with maximum possible dynamism, without complaining".

The usage for specifying `dynamic_shapes` is now:
```
AUTO -> dynamic by default, return whatever produce_guards() says, even if it's static
None/int/STATIC -> static
Dim/DerivedDim -> same as before - will complain if the min/max range is invalid, or if dims related to this are unspecified.
```

Caveat 1: specifying `AUTO` for a dim won't guarantee it'll be dynamic:

- specifying `AUTO` for a dim will return the maximum possible dynamism given your program and other specified constraints, but this can still mean you'll get a static program. For example, with the program below, x is specified dynamic, but it's equal to y, which is specified static, and with how we currently do things we won't promote y to dynamic, but will demote(?) x to static. So this can be surprising if you don't fully know your model, and/or missed one of your other inputs when specifying auto-dynamic shapes.
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": None})
```

Caveat 2: specifying `AUTO` and Dims in the same spec is still problematic:

- The way Dims/DerivedDims are currently handled is very strict. A Dim represents a symbol, and we require a user to specify the symbol for all dims governed by the symbol - that's why we've seen errors in the past like `The values of x must always be related to y by ...`, asking the user to specify the exact relation as in the program. We also require the specified min/max range to be a subset of the valid range from model analysis. All this doesn't compose well with specifying `AUTO` just yet - for example in the program below, ideal behavior could be to return a dynamic program, where `dx = x.size(0) = y.size(0)` has range (3,6). Unfortunately this crashes, and correct behavior is to specify `dx` for both inputs. So currently we raise a UserError and crash if both Dims + `AUTO` are present in the spec.
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": {0: Dim("dx", min=3, max=6)}})  # this doesn't work, because x & y and related
```

Implementation details:

This is done by setting `assume_static_by_default=False`, and doing a transform on the `dynamic_shapes` spec to preserve semantics. `assume_static_by_default=False` will treat unspecified dims or Nones as dynamic. This is the opposite of what `export.export()` currently does - unspecified Dims/Nones are treated as static. Historically this static-by-default behavior, where the user deals with fewer guards, has been desirable, and we would like to respect that in this implementation. So this internal spec transformation is added, `_transform_shapes_for_default_dynamic()`, does the spec conversion necessary to be compatbile with dynamic by default. Specifically, AUTOs are converted into Nones, and Nones/unspecified dims are filled in with explicitly static constraints.

For example, this would look like, for a 3-d tensor: `{0: DIM.AUTO, 1: None, 2: Dim("dx")} -> {0: None, 1: 32, 2: Dim("dx")}`

This does seem overly complicated, but it's done to preserve dynamic shapes semantics for `torch._dynamo.export()`, which already uses `assume_static_by_default=False`, and follows the same process for generating shape constraints , via `_process_dynamic_shapes`. There the semantics are:
```
None/unspecified: dynamic by default
Dim/DerivedDim: also a strict assertion
```

If we don't care about BC for `_dynamo.export(dynamic_shapes)`, then we can just modify semantics for `_process_dynamic_shapes()` and change all the relevant tests in `test/dynamo/test_export.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133620
Approved by: https://github.com/avikchaudhuri
2024-08-23 22:56:39 +00:00
f5a2a22dc4 [export] Fix unflattener to respect nn.Parameter requires_grad (#134353)
Summary: Fixes P1539870235

Test Plan: CI

Differential Revision: D61726403

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134353
Approved by: https://github.com/pianpwk
2024-08-23 22:49:34 +00:00
eaa2c0e009 Improves error message when passing wrong tensor type to torch.nn.functional.one_hot (#134209)
The function expects a Tensor of type LongTensor. It currently throws the following error: "one_hot is only applicable to index tensor." which, imo, does not provide the user with enough information on what the problem is.

PR simply adds extra information to the error message on this specific scenario.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134209
Approved by: https://github.com/mikaylagawarecki
2024-08-23 22:40:05 +00:00
09a82f3d24 [EZ][BE] Delete references to non-existing AWS_SCCACHE secrets (#134370)
First of all, none of the binary builds should be using sccache for security and reliability reasons (as distributed cache can become corrupted/compromised), but even if they do all authentication to AWS service shoudl be done via OIDC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134370
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-08-23 22:23:48 +00:00
adf0f50cc7 [Compile] Add NEON implementation for bf16->fp32 cast (#134297)
This changes assembly generated for the following routine
```cpp
void bfloat16tofloat(c10::BFloat16* in, float* out) {
        auto tmp0 = at::vec::Vectorized<c10::BFloat16>::loadu(in, 8);
        auto tmp1 = at::vec::convert<float>(tmp0);
        tmp1.store(out);
}
```
from
```asm
bfloat16tofloat(c10::BFloat16*, float*):
0000000000000034        stp     x29, x30, [sp, #-0x10]!
0000000000000038        mov     x29, sp
000000000000003c        sub     x9, sp, #0x90
0000000000000040        and     sp, x9, #0xffffffffffffffe0
0000000000000044        mov     x8, #0x0
0000000000000048        adrp    x9, 0 ; 0x0
000000000000004c        ldr     x9, [x9]
0000000000000050        ldr     x9, [x9]
0000000000000054        str     x9, [sp, #0x88]
0000000000000058        stp     xzr, xzr, [sp, #0x10]
000000000000005c        ldr     q0, [x0]
0000000000000060        str     q0, [sp]
0000000000000064        ldr     q1, [sp, #0x10]
0000000000000068        stp     q0, q1, [sp, #0x20]
000000000000006c        add     x9, sp, #0x40
0000000000000070        add     x10, sp, #0x20
0000000000000074        add     x11, x10, x8
0000000000000078        ldp     d0, d1, [x11]
000000000000007c        shll.4s v0, v0, #16
0000000000000080        shll.4s v1, v1, #16
0000000000000084        stp     q0, q1, [x9], #0x20
0000000000000088        add     x8, x8, #0x10
000000000000008c        cmp     x8, #0x20
0000000000000090        b.ne    0x74
0000000000000094        add     x8, sp, #0x40
0000000000000098        ld1.4s  { v0, v1 }, [x8]
000000000000009c        st1.4s  { v0, v1 }, [x1]
00000000000000a0        ldr     x8, [sp, #0x88]
00000000000000a4        adrp    x9, 0 ; 0x0
00000000000000a8        ldr     x9, [x9]
00000000000000ac        ldr     x9, [x9]
00000000000000b0        cmp     x9, x8
00000000000000b4        b.ne    0xc4
00000000000000b8        mov     sp, x29
00000000000000bc        ldp     x29, x30, [sp], #0x10
00000000000000c0        ret
00000000000000c4        bl      0xc4
```
to
```asm
bfloat16tofloat(c10::BFloat16*, float*):
0000000000000034        ldr     q0, [x0]
0000000000000038        shll.4s v1, v0, #16
000000000000003c        shll2.4s        v2, v0, #16
0000000000000040        st1.4s  { v1, v2 }, [x1]
0000000000000044        ret
```

And as result speeds up `python3 torchchat.py generate stories110M --num-samples 3 --compile --device cpu --dtype bfloat16` from 33 to 90 tokens/sec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134297
Approved by: https://github.com/kimishpatel
2024-08-23 22:22:59 +00:00
69813dbbfd [export] Schematize nn_module_stack serialization (#134049)
`nn_module_stack` was previously serialized to string by adding commas between the module_path and module_type. This error prone when the `nn_module_stack` itself contains commas.

This PR fixes this by creating a dictionary to store the `nn_module_stack` and serialize it to string via `json.dumps()`

Fixes #131941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134049
Approved by: https://github.com/angelayi
2024-08-23 21:50:01 +00:00
78d69bfe11 [SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424)
### Summary
- Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address.
- Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation).

### Benchmark

8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support.

![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947)

![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe)

Differential Revision: [D61682507](https://our.internmc.facebook.com/intern/diff/D61682507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424
Approved by: https://github.com/yf225, https://github.com/weifengpy
2024-08-23 20:09:20 +00:00
2ca7f0fc5c [Minimizer] for sequential mode, respect find_all setting (#134339)
Summary: Currently, for sequential mode, minimizer search terminates after a node is excluded via the user defined exclusion_fn. However, on some occasions we would like the search to continue past that for the remaining nodes. In this diff I am changing the termination criteria to respect the find_all setting, where we continue sequential search if it is set.

Test Plan: CI

Differential Revision: D61720262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134339
Approved by: https://github.com/jfix71
2024-08-23 19:59:43 +00:00
58e2cf364b Make DTensor sharding propagation for scaled_dot_product_efficient_attention and scaled_dot_product_flash_attention more conservatively cached (#134146)
Fixes #134050

### The issue

The current `DTensor` sharding propagation caching policy for  `aten.scaled_dot_product_efficient_attention` (default) can result in silently incorrect gradients or trigger an IMA after cuda kernel launch in mixed `require_grad` configurations. Please see issue #134050 for a full description of the observed failure patterns along with reproduction. Note `aten.scaled_dot_product_flash_attention` presents a similar concern so this PR addresses both [as discussed here.](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602)

### Remediation

While there are a number of ways this could be addressed, the most straightforward remediation is to modify the sharding propagation caching policy of [`aten._scaled_dot_product_efficient_attention.default`](b03381cac2/torch/distributed/_tensor/ops/_matrix_ops.py (L337-L340)), registering it with `schema_info=RuntimeSchemaInfo(4)` to prevent cache sharing between differing `compute_log_sumexp` values i.e.

```python
@register_op_strategy(aten._scaled_dot_product_efficient_attention.default, schema_info=RuntimeSchemaInfo(4))
def scaled_dot_product_efficient_attention_strategy(
...
```

[As discussed here](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602),  since `aten::_scaled_dot_product_flash_attention` could be affected by a similar issue wrt `return_debug_mask`, this PR adjusts the sharding propagation caching policy for that op as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134146
Approved by: https://github.com/tianyu-l
2024-08-23 19:43:30 +00:00
157de30f53 [sparse] Update cuSPARSELt to v0.6.2 (#134022)
Summary:

This PR updated cuSPARSELt to v0.6.2. I think we should land
https://github.com/pytorch/pytorch/pull/128534 first though.

Most of this PR is just enabling tests to run when cuSPARSELt v0.6.2 is
available.

Unfortunately was running into a bug with fp32 support on Hopper, so I
removed fp32 support from the cuSPARSELt backend. I think this should be
fine since almost everybody uses the bfloat/float16/int8 kernels.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134022
Approved by: https://github.com/jerryzh168, https://github.com/malfet
ghstack dependencies: #128534
2024-08-23 19:34:53 +00:00
74a9001ada [aoti] Add additional custom op input type support (#132454)
Summary:
Added support for more custom op input types, now only missing dtype,
layout, memory format as input type, since we need to add some more testing for
mapping the types to their integer values
([previous
comment](https://github.com/pytorch/pytorch/pull/126215#discussion_r1617428066)).

This PR also replaces the `DynamicArg` struct's `serialized_arg_val` with
`list_item_types`, which stores an optional list of strings, where each string
represents the type of the value within this list. This is only used for
parsing lists of optional tensors, where we need to know if a specific value in
the list should be a tensor, or a None. Replacing with a list of strings is
also better than storing the actual json format because then we don't need to
parse the json string during the runtime, and can just loop over a preprocessed
list of strings.

Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r "test_custom_"`

Reviewed By: desertfire

Differential Revision: D60295995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132454
Approved by: https://github.com/desertfire
2024-08-23 19:11:36 +00:00
f8fbfe5846 Always emit end events even on failure, use thread local storage for stack (#134279)
Summary:
We should always emit an end event in a finally block so that if a unit test or job fails, the stack is still correct.

Also, we use thread local storage for the stack, so that in multithreaded scenarios the stack will still be correctly added.

Test Plan:
Run benchmark and see that everything still works
Run
```
TORCH_LOGS=dynamo buck run test/functorch:test_aotdispatch -- -r test_backward_mutation_on_grad_out
```
With some extra logging to see that start events with the correct stack are emitted, and the end events are also emitted even though the test fails at runtime.

Differential Revision: D61682556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134279
Approved by: https://github.com/aorenste
2024-08-23 18:13:13 +00:00
a23d86c178 [hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133645
Approved by: https://github.com/zou3519
2024-08-23 17:28:02 +00:00
3546628a2a Allow mp.start_processes to create processes in parallel (#133707)
Summary:
Background discussion in https://fb.workplace.com/groups/319878845696681/posts/1226087421742481

and pytorch issue filed https://github.com/pytorch/pytorch/issues/133010

one way to fix this problem is to add an option to parallel start processes on pytorch side.

Test Plan: Tested aps run in problem and things are in parallel now (next diff)

Differential Revision: D61301603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133707
Approved by: https://github.com/d4l3k, https://github.com/ezyang
2024-08-23 17:11:20 +00:00
afd081c9d4 [inductor] Fix needs_fixed_stride_order silent incorrectness (#133639)
Fixes #128084

The approach is option 2 of what Elias suggested in the comment
thread:
- We require tensors to have the correct stride at usage. This may
  involve a clone; if there was a clone and then a mutation into it
  then we copy_ back the result of the mutation.

The reason why I went this approach was because it was the easiest and
Inductor already works really hard to remove additional clones/copy_.

There are some cases that this doesn't generate efficient code for; for
example, if the tensor is a view, we don't change the base of the view
to have the right stride order, instead we do a clone.
The view case isn't very common so I'm ignoring it for now but we could
improve this in the future.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639
Approved by: https://github.com/eellison
2024-08-23 17:07:58 +00:00
2553278bae .github/merge_rules.yaml: added multiprocessing to Distributed (#134262)
This allows the Distributed team to approve changes to torch.multiprocessing which is used by torchelastic/run.

Example PR: https://github.com/pytorch/pytorch/pull/133707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134262
Approved by: https://github.com/wconstab, https://github.com/PaliC
2024-08-23 17:07:20 +00:00
6eae569546 [dynamo][fix] always use POSIX-style path in trace_rule.py (#133987)
We are hardcoding some path in string in POSIX style. This will lead to different results on Windows. This PR force all paths to be in POSIX-style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133987
Approved by: https://github.com/jansel
2024-08-23 16:28:57 +00:00
2eef749b31 [Inductor][FlexAttention] Fix IS_DIVISIBLE bug and add unit tests (#134055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134055
Approved by: https://github.com/Chillee
2024-08-23 16:11:09 +00:00
8ae4f82243 [aotd] Support HOP effects in backward (#132638)
Support of effectful operations in backward:

1/ AOTD collects metadata from forward fn only, so we can have usage of effectful ops in backward, that were not used in forward => Allowing tokens discovery during joint function .

FunctionalTensorMode holds _tokens, in Joint function after tracing forward we memoize _tokens as `_tokens_forward_output`.

2/ Tokens are added as primals inputs (forward) in EffectTokensWrapper.
Tokens that will be used in backward are in partitioner saved values. We do not have control on which positions they are saved in forward outputs.

2/ If new tokens discovered in backward after tracing joint_fn, the result graph will be manually added in the end of primals.
_aot_autograd/utils.py

3/ All effectful ops during backward are marked with 'must_be_in_backward' partitioner_tag, to prevent partiitoner to place them in forward.

For that functional_tensor_mode got new optional state `self._effects_partitioner_tag` for effectful ops, to set after tracing forward.

There are additional changes in partitioner to improve functionality of 'must_be_in_backward'

4/ Unlift tokens now should run for both forward and backward.
- As saved for backward tokens are placed on non static places - we identify input and output tokens to erase, by input and output of `with_effects` operation
- In forward we can have input tokens, discovered in backward, that are not used in with_effects ops in forward, but saved for backward. We identify them by position in forward inputs.

5/ Adding aot debug logging for graphs before unlifting and before adding additional primal for backward tokens.

Tests:
```
python test/higher_order_ops/test_with_effects.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132638
Approved by: https://github.com/bdhirsh
2024-08-23 15:30:58 +00:00
7fd3b69886 Revert "[dynamo][super] Improve handling of getattr on super (#134039)"
This reverts commit 1da3a049dac3c78554506d5ef9ede55b7c2b774d.

Reverted https://github.com/pytorch/pytorch/pull/134039 on behalf of https://github.com/jeanschmidt due to broke internal torchrec signals, see [D61670727](https://www.internalfb.com/diff/D61670727) ([comment](https://github.com/pytorch/pytorch/pull/134039#issuecomment-2307151643))
2024-08-23 13:57:04 +00:00
09127b096c Revert "[inductor] Fix needs_fixed_stride_order silent incorrectness (#133639)"
This reverts commit 8604c0a150b12e0ba3f9a6faaf52498370f21368.

Reverted https://github.com/pytorch/pytorch/pull/133639 on behalf of https://github.com/jeanschmidt due to Broke internal fbgemm signals, see [D61670495](https://www.internalfb.com/diff/D61670495) ([comment](https://github.com/pytorch/pytorch/pull/133639#issuecomment-2307133060))
2024-08-23 13:48:04 +00:00
75c22dd8bf Revert "[dynamo][fix] always use POSIX-style path in trace_rule.py (#133987)"
This reverts commit b23779ef0af8d4f06e667da460c43d264359f1f0.

Reverted https://github.com/pytorch/pytorch/pull/133987 on behalf of https://github.com/albanD due to This breaks windows trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/133987#issuecomment-2306956764))
2024-08-23 12:08:56 +00:00
0e49b2f18e [dynamo][itertools] support itertools.tee (#133771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778, #133779
2024-08-23 10:13:12 +00:00
8d90392fb0 [dynamo] simplify implementation for builtins.sum (#133779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778
2024-08-23 10:10:19 +00:00
6c0b15e382 [dynamo] simplify implementation for functools.reduce (#133778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778
Approved by: https://github.com/jansel
ghstack dependencies: #133769
2024-08-23 09:10:44 +00:00
cc3a76edba [dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769
Approved by: https://github.com/jansel
2024-08-23 09:05:24 +00:00
ca3f48dd5b [XPU] Set make triton install pre-built whl by default (#130313)
Now the user could install the pre-built `triton` for xpu by calling the following:

```Bash
export USE_XPU=1
make triton
```

[Dev Only]: If the user wishes to build it from the source, one could set an additional flag:

```Bash
export TRITON_XPU_BUILD_FROM_SOURCE=1
export USE_XPU=1
make triton
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130313
Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/atalman
2024-08-23 07:36:34 +00:00
55cdcef0f7 [fp8 rowwise] Work around CUDA Invalid Memory Access bug (#134227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134227
Approved by: https://github.com/drisspg, https://github.com/eqy
ghstack dependencies: #134223, #134224, #134225, #134226
2024-08-23 07:27:55 +00:00
9d81767d43 [fp8 rowwise] Rework dispatch logic (#134226)
It's likely a matter of opinion, but I find this new version to have less duplication, even if it might have more boilerplate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134226
Approved by: https://github.com/drisspg
ghstack dependencies: #134223, #134224, #134225
2024-08-23 07:27:55 +00:00
0afb4872aa [fp8 rowwise] Support non-contiguous inputs and clarify checks (#134225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134225
Approved by: https://github.com/drisspg
ghstack dependencies: #134223, #134224
2024-08-23 07:27:52 +00:00
9f8d3f511f [fp8 rowwise] Some clean-up (#134224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134224
Approved by: https://github.com/drisspg
ghstack dependencies: #134223
2024-08-23 07:27:48 +00:00
2f198605ac [fp8 rowwise] Simplify epilogue visitor tree via common blocks (#134223)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134223
Approved by: https://github.com/drisspg
2024-08-23 07:27:41 +00:00
25b2e46573 [dynamo] add max iterator limit while inlining generators (#134233)
Related:

- #133879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134233
Approved by: https://github.com/jansel
2024-08-23 07:03:31 +00:00
673b9bd561 [WIP] [Inductor UT] Reuse inductor UT for intel GPU test/inductor/test_multi_kernel.py (#133943)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_multi_kernel.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133943
Approved by: https://github.com/EikanWang, https://github.com/jansel

Co-authored-by: Justin Chu <justinchu@microsoft.com>
Co-authored-by: Jesse Cai <jcjessecai@gmail.com>
Co-authored-by: Sahdev Zala <spzala@us.ibm.com>
Co-authored-by: rzou <zou3519@gmail.com>
Co-authored-by: FFFrog <ljw1101.vip@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: yanbing-j <yanbing.jiang@intel.com>
Co-authored-by: Will Feng <yf225@cornell.edu>
Co-authored-by: Bin Bao <binbao@meta.com>
Co-authored-by: Yiming Zhou <yimingzhou@meta.com>
Co-authored-by: Yanbo Liang <ybliang8@gmail.com>
2024-08-23 05:52:29 +00:00
80846caa8c [inductor] fix dynamic size array(vla) build error on msvc v4 (#134221)
MSVC don't support dynamic array.
Ref: https://stackoverflow.com/questions/56555406/creating-dynamic-sized-array-using-msvc-c-compiler

We tried to solutions:
1. use std::vector to instead of it in previous PR: https://github.com/pytorch/pytorch/pull/134140, but it changed variable's type and failed at UTs.
2. Use `std::unique_ptr` to instead of it in PR: https://github.com/pytorch/pytorch/pull/134156, @jansel reviewed and give comments:  https://github.com/pytorch/pytorch/pull/134156#pullrequestreview-2253091693. It is make sense, allocation memory maybe make code run slower.
3. Use fixed size array to instead of it in PR: https://github.com/pytorch/pytorch/pull/134210, fixed size is hard to process the situlation, reserved size if small than CPU number.
> a. Use min() function limited is local test failed: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304447729
> b. Dynamic select fixed size or dynamic array: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304128666 . It makes code too complex to maintains.

Discussed with origin PR(https://github.com/pytorch/pytorch/pull/115620) author @zhuhaozhe, we think:
1. MSVC it the only one compiler, which not support VLA.
2. MSVC it worse performance than other compilers, use `std::unique_ptr` for MSVC and make it works.
3. For other compilers, keep using current `VLA` code.
4. For Windows users, they can use `clang-cl` or `icx` to get better performance than MSVC.
5. Discussed with @jansel , we need to move compiler check to python side, and make output code cleaner.

Reproduce UT:
```cmd
pytest test/inductor/test_cpu_repro.py -v -k test_reduction_with_dynamic_threads
```

Error msg:
```cmd
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): error C2131: expression did not evaluate to a constant
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: failure was caused by a read of a variable outside its lifetime
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: see usage of 'max_threads'
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(16): error C3863: array type 'float [max_threads]' is not assignable
```
Genarated code:
```c++

#include "C:/Users/Xuhan/AppData/Local/Temp/tmpt6mxcjzi/j2/cj22tgrdgh42wbunl7gdptg2lintcziox2kmr7rdbcc6n2njrhgx.h"
extern "C" __declspec(dllexport) void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       float* out_ptr0,
                       float* out_ptr1)
{
    {
        {
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            int max_threads = omp_get_max_threads();
            float tmp_acc0_arr[max_threads];
            for (int tid = 0; tid < max_threads; tid++)
            {
                tmp_acc0_arr[tid] = 0;
            }
            at::vec::Vectorized<float> tmp_acc0_vec_arr[max_threads];
            for (int tid = 0; tid < max_threads; tid++)
            {
                tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
            }
            #pragma omp parallel
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134221
Approved by: https://github.com/zhuhaozhe, https://github.com/jansel
2024-08-23 05:40:08 +00:00
49b9f2d8b0 [inductor] fix signbit build fail on Windows. (#134229)
Reproduce UT:
```cmd
pytest test/inductor/test_torchinductor.py -v -k test_randint_int64_mod_cpu
```

Error message:
```cmd
cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental'
c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp
C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(23): error C2668: 'signbit': ambiguous call to overloaded function
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(309): note: could be 'bool signbit(float) noexcept'
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(314): note: or       'bool signbit(double) noexcept'
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(319): note: or       'bool signbit(long double) noexcept'
C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(23): note: while trying to match the argument list '(__int64)'
C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(24): error C2668: 'signbit': ambiguous call to overloaded function
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(309): note: could be 'bool signbit(float) noexcept'
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(314): note: or       'bool signbit(double) noexcept'
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(319): note: or       'bool signbit(long double) noexcept'
C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(24): note: while trying to match the argument list '(int64_t)'
```

Genarated code:
```c++

#include "C:/Users/Xuhan/AppData/Local/Temp/tmpcjnxnvkl/4f/c4ff4q4pxgo3yprbo2nkfopkt3qgi6rmptfpgpl2iylgtunvizwn.h"
extern "C" __declspec(dllexport) void kernel(const int64_t* in_ptr0,
                       int64_t* out_ptr0)
{
    #pragma omp parallel num_threads(8)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for
            for(int64_t x0=static_cast<int64_t>(0LL); x0<static_cast<int64_t>(20LL); x0+=static_cast<int64_t>(1LL))
            {
                auto tmp0 = in_ptr0[static_cast<int64_t>(0LL)];
                auto tmp1 = x0;
                auto tmp2 = c10::convert<int32_t>(tmp1);
                auto tmp3 = static_cast<int64_t>(-5);
                auto tmp4 = static_cast<int64_t>(5);
                auto tmp5 = randint64_cpu(tmp0, tmp2, tmp3, tmp4);
                auto tmp6 = static_cast<int64_t>(10);
                auto tmp7 = mod(tmp5, tmp6);
                auto tmp8 = static_cast<int32_t>(0);
                auto tmp9 = tmp7 != tmp8;
                auto tmp10 = std::signbit(tmp7);
                auto tmp11 = std::signbit(tmp6);
                auto tmp12 = tmp10 != tmp11;
                auto tmp13 = tmp9 & tmp12;
                auto tmp14 = decltype(tmp7)(tmp7 + tmp6);
                auto tmp15 = tmp13 ? tmp14 : tmp7;
                out_ptr0[static_cast<int64_t>(x0)] = tmp15;
            }
        }
    }
}
```

Fixed by cast `std::signbit` to `long double`: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/signbit?view=msvc-170

Local test passed:
<img width="848" alt="image" src="https://github.com/user-attachments/assets/e4467256-a068-40ef-a6ff-19b442e9116d">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134229
Approved by: https://github.com/jansel
2024-08-23 05:40:05 +00:00
311af3b988 Add new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked (#134232)
Summary:
This diff adds two new operators torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked. It is a decomposition of the op torch.ops._quantized.wrapped_quantized_linear added in the previous diff.

We decomposed in this way as packed weight could be computed early so we don;t need to do it in every forward in AOTI

Reviewed By: jerryzh168

Differential Revision: D61395887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134232
Approved by: https://github.com/houseroad
2024-08-23 04:54:26 +00:00
b23779ef0a [dynamo][fix] always use POSIX-style path in trace_rule.py (#133987)
We are hardcoding some path in string in POSIX style. This will lead to different results on Windows. This PR force all paths to be in POSIX-style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133987
Approved by: https://github.com/jansel
2024-08-23 04:33:05 +00:00
a699bd1155 [dynamo] Cache _dynamo.disable results (#134272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134272
Approved by: https://github.com/yf225, https://github.com/jansel
2024-08-23 04:20:50 +00:00
b454c51060 remove dynamic_dim (#134211)
Summary: As promised in https://github.com/pytorch/pytorch/pull/134045.

Test Plan: existing

Differential Revision: D61646937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134211
Approved by: https://github.com/angelayi
2024-08-23 04:13:03 +00:00
058302494c [AOTI][Tooling] Add a test case where config.debug_intermediate_value_printer=True to check codegen (#133326)
Summary:
As title.

Add a test case in test_aot_inductor to check for codegen (i.e. `aoti_torch_print_tensor_handle` is inserted as expected for debugging printer) for both cpu and cuda based on a simple `addmm` test model.

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_codegen_abi_compatible_{cuda/cpu}
```

Differential Revision: D61169068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133326
Approved by: https://github.com/ColinPeppler
2024-08-23 02:12:21 +00:00
d2c60749ac [Inductor][FlexAttention] Respect user's input kernel_options (#134065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134065
Approved by: https://github.com/Chillee
2024-08-23 01:21:05 +00:00
8301add833 [4/N] Further refactor FR script to make it more modulized (#134196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134196
Approved by: https://github.com/c-p-i-o
2024-08-23 01:15:29 +00:00
bcfc560aea [Profiler/CPU] Add Test for Dynamic Activity Toggling [4/n] (#134149)
Summary: Add tests that check function events for dynamic activity toggling for both GPU and CPU events. Also added comments from previous GH comments

Test Plan: Make sure all tests pass

Differential Revision: D61617514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134149
Approved by: https://github.com/aaronenyeshi
2024-08-23 01:13:42 +00:00
bf5addb613 [FlexAttention] Enable different qk and v head-dims (#134043)
# Summary
Adds the option for the head dims to be different between QK and V tensors.

Fixes issue: https://github.com/pytorch/pytorch/issues/133674

V_DIM > QK_DIM is blocked by landing: https://github.com/triton-lang/triton/pull/4138 / https://github.com/triton-lang/triton/pull/4540

Into PyTorch's triton branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134043
Approved by: https://github.com/Chillee
2024-08-23 01:06:57 +00:00
7c93c4f8cf [CI][dashboard] Change aarch64 perf run (#134265)
Summary: Reduce the aarch64 dashboard run to only test the default config, until we solve the timeout issue. Also increase the frequency from nightly to 6 times a day, to see if we can reproduce the perf instability Nikita has observed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134265
Approved by: https://github.com/malfet
2024-08-23 00:40:28 +00:00
b3821f1da1 [dynamo][guards][logs] Generate code_parts for debugging (#134181)
Fixes https://github.com/pytorch/pytorch/issues/132692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134181
Approved by: https://github.com/youkaichao, https://github.com/jansel
ghstack dependencies: #133742, #134016, #134039
2024-08-22 23:40:37 +00:00
edbadc904b Do not broadcast uniqueId during a split (#133962)
When using split, we do not need to exchange the NCCL uniqueID at all.
This would avoid connecting to the TCPStore on each split operation.
@exported-using-ghexport

Differential Revision: [D60966980](https://our.internmc.facebook.com/intern/diff/D60966980/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133962
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #133960, #133961
2024-08-22 23:23:32 +00:00
b2eb0e8c6a docker: Use miniforge, install from pip (#134274)
Switch installation of the pytorch package to be installed from our download.pytorch.org sources which are better maintained.

As well, switching over the miniconda installation to a miniforge installation in order to ensure backwards compat for users expecting to have the conda package manager installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134274
Approved by: https://github.com/malfet, https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2024-08-22 23:20:22 +00:00
30d7e7a1cd [XPU] Fix patch for old llvm package error for triton xpu (#134204)
Fixes #134199

The PR #133694 does a workaround to replace the str `"https://tritonlang.blob.core.windows.net/llvm-builds/"` with  `"https://oaitriton.blob.core.windows.net/public/llvm-builds/"` in `triton/python/setup.py`. However, in [newer version of Triton](06e6799f4e), it has already been changed to `"https://oaitriton.blob.core....` and don't need to be replaced.  But formerly, this will throw a runtime error.

This PR makes the `check_and_replace` logic won't fail in such a scenario. Both the old link and the newer link could work.

Also note that the `.ci/docker/common/install_triton.sh` does not need the fix, because its `sed` command won't be in effect if there is no such pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134204
Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/atalman
2024-08-22 23:18:44 +00:00
629bd6f718 Update FlexAttention with masking semantic (#133373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133373
Approved by: https://github.com/yanboliang
2024-08-22 22:50:33 +00:00
e7929809f3 [c10d][ez] Add comments to CudaEventCache class (#134172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134172
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2024-08-22 22:44:12 +00:00
b319fa3fd9 [ONNX] Opt into ruff fmt (#134120)
Add ONNX directory to use ruff format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134120
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2024-08-22 22:44:03 +00:00
25499de814 Remove ncclIdToCommMap_. (#133961)
There is no purpose for this map structure, and it is incorrect in
some cases.  For example, when the uniqueID is not broadcasted to the
other processes.
@exported-using-ghexport

Differential Revision: [D60966882](https://our.internmc.facebook.com/intern/diff/D60966882/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133961
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #133960
2024-08-22 22:06:25 +00:00
b0cf287b46 [export][training ir migration] Fix getitem not exist (#134259)
Summary:
Make quantization tests compatible with the new training IR.

With the new batch norm node `torch.ops.aten.batch_norm.default`, we don't need an additional getitem node after the bn node, so tests need to be fixed to not check for the getitem node.

We added a capture_pre_autograd_graph_using_training_ir() function, which returns True when we are using the training ir, and False otherwise. This way, the code supports both training ir and the old ir.

For now, we are just rolling out the training ir for fbcode internal tests.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_preserve_source_fn_stack
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_update_shared_qspec
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_relu_fusion

buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion_literal_args
```

Reviewed By: andrewor14, tugsbayasgalan

Differential Revision: D61292102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134259
Approved by: https://github.com/tugsbayasgalan
2024-08-22 22:00:14 +00:00
f0ba309d78 [CI][dashboard] Add jemalloc back for aarch64 (#134189)
Forward fix based on https://github.com/pytorch/pytorch/pull/133997#discussion_r1726004220
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134189
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-08-22 21:08:39 +00:00
1b6bbaa016 Remove PMI dependencies in PyTorch (#133960)
This patch makes two changes:
1. Whenever ncclCommSplit accepts groupRanks in its config, we should
populate it.  This is independent of using PMI or not.  For example,
non-PMI NCCL can also use this information, if it chooses to.
2. Provide a user flag to decide when to do a uniqueId broadcast and
when to skip it.  This is a performance optimization, and not a
correctness requirement.  If the user forgets to set this, we will
do the uniqueId broadcast, which is wasteful (because it will be
ignored by NCCL), but not incorrect.
@exported-using-ghexport

Differential Revision: [D60966774](https://our.internmc.facebook.com/intern/diff/D60966774/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133960
Approved by: https://github.com/shuqiangzhang
2024-08-22 20:34:43 +00:00
ff61f55387 [Dynamo][autograd.Function] Supports ctx.set_materialize_grads (#133978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133978
Approved by: https://github.com/zou3519
2024-08-22 20:06:17 +00:00
5633773188 Convert various jobs to be Linux Foundation fleet compatible (#134128)
Migrates a batch of workflows over to LF
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134128
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
2024-08-22 19:23:07 +00:00
0eb9c870fd [reland][ROCm] TunableOp for gemm_and_bias (#128919)
Reland of #128143 but added `alpha` and `bias` initialization to `launchTunableGemmAndBias`

Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128919
Approved by: https://github.com/malfet
2024-08-22 18:27:50 +00:00
978c5a80a0 [export][training ir migration] fix batch norm pattern match in quantization (#134157)
Summary:
In the new training ir, we produce `torch.ops.aten.batch_norm.default` instead of `torch.ops.aten._native_batch_norm_legit.default` or `torch.ops.aten._native_batch_norm_legit_no_training.default`.

So we need to change the pattern match to accomodate the new op.

- Add `torch.ops.aten.batch_norm.default` to pattern matcher list so it's identified as a batch norm node
- `torch.ops.aten.batch_norm.default` doesn't have a getitem user anymore, so when removing the bn norm,  we need to do `bn_node.replace_all_uses_with(conv_node)` instead of `getitem_node.replace_all_uses_with(conv_node)`

The behavior of capture_pre_autograd_graph is consistent for each run.

If the run is a fbcode test, then capture_pre_autograd_graph uses training IR. This means both _get_aten_graph_module_for_pattern and  replace_pattern_with_filters see the same training IR.

If the run is not a fbcode test, then both would see the old IR.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_binary2
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_unary
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_quant_linear
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_dynamic_quant_linear
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_flatten_recipe
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary
```

Reviewed By: andrewor14, tugsbayasgalan

Differential Revision: D61291077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134157
Approved by: https://github.com/tugsbayasgalan
2024-08-22 18:25:45 +00:00
fee677eeb6 [fbode-testing][dynamo][reland][inline-inbuilt-nn-modules] Mark attri… (#134136)
Shuai wants to test this internally before https://github.com/pytorch/pytorch/pull/133713 can go in. Creating a separate PR for ghmport.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134136
Approved by: https://github.com/yanboliang
2024-08-22 17:54:58 +00:00
8f7d66f0c3 Enable dynamic rollout for Linux binary workflows (#131472)
Enables dynamic migration of jobs to the LF AWS account for binary workflows.

The new runners are only given to people specified in this issue: pytorch/test-infra#5132

This closes pytorch/ci-infra#251.

Depends-On: pytorch/pytorch#132870
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131472
Approved by: https://github.com/ZainRizvi
2024-08-22 17:12:50 +00:00
d95aedf5fd [BE] typing for decorators - fx/_compatibility (part 1) (#134202)
Part of #134054.

This corresponds to the pytorch mypy changes from D61493706. Updating takes so
long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change.
So landing these 'type: ignore' for pytorch in advance of them actually being needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202
Approved by: https://github.com/Skylion007
2024-08-22 17:07:33 +00:00
44fa9f991c [NJT] add aten.to.dtype support (#134164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134164
Approved by: https://github.com/davidberard98
2024-08-22 16:59:38 +00:00
b6abac68ec [BE][dynamo] reorganize polyfill module hierarchy (#133977)
Changes:

1. Move `polyfill.py` -> `polyfills/__init__.py`. It can be used as `polyfill.xxx` -> `polyfills.xxx`.
2. Move submodule loading from `polyfills/__init__.py` to `polyfills/loader.py`.

Merge `polyfill.py` and `polyfills/` packages. Each polyfill module have its own namespace for better code organization.

The ultimate goal is make `polyfills/__init__.py` empty and all polyfill functions move to its own namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133977
Approved by: https://github.com/jansel
2024-08-22 16:42:29 +00:00
c95ddd4bf2 [dynamo] ensure polyfill function has the same signature as the original function in substitute_in_graph (#133813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133813
Approved by: https://github.com/jansel
2024-08-22 16:38:06 +00:00
240467adfe [fx] Implement deepcopy for Proxy (#133706)
Summary: When deepcopy a proxy, we first try the default deepcopy behavior.

Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r  proxy_deepcopy

Differential Revision: D61398418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133706
Approved by: https://github.com/angelayi
2024-08-22 16:37:30 +00:00
b0171c3920 Revert "[ONNX] Opt into ruff fmt (#134120)"
This reverts commit 0870398fa8c3e097640f31cb8a8e2e2d3e522d33.

Reverted https://github.com/pytorch/pytorch/pull/134120 on behalf of https://github.com/albanD due to Breaks main branch lint ([comment](https://github.com/pytorch/pytorch/pull/134120#issuecomment-2305089756))
2024-08-22 15:48:14 +00:00
828ab84e19 Improve error msg on _lazy_init() error (#134159)
Reviewed By: hanzlfs

Differential Revision: D61627609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134159
Approved by: https://github.com/hanzlfs
2024-08-22 15:10:50 +00:00
3c5485fb7f [Retry] Log chromium events to scuba (#134118)
Summary:
This diff implements a bunch of views for internal scuba viewing.

TODOS that I might punt to another diff:
- Saving cache stats via counter is definitely sus here, but there's not really a good way to track "fx graph cache hit for this compile phase" right now. Will think about this more.
- We should definitely log frame id, compile id, etc
- We should definitely be logging configs. That way, we can A/B test based on whether a config is turned on.
- idk what I'm doing with compile_uuid yet, but it's useful when you want to look at samples for a single run. I think if we had mast job info this field is not needed, but it's nice to be able to drill down to a single run and get its chrome trace view or icicle view, so idk

Test Plan:
All of the above views are run with nanogpt benchmark:

```
buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --performance
```

Differential Revision: D61603243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134118
Approved by: https://github.com/oulgen
2024-08-22 14:59:45 +00:00
1b10a5c652 Allow SymInts and SymFloats as other in div_softmax_pattern (#133989)
Fixes https://github.com/pytorch/pytorch/issues/133759

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133989
Approved by: https://github.com/ezyang
2024-08-22 14:36:01 +00:00
afc2615d33 Add proper casting to fuse_linear_bn_weights (#134105)
As per title, this PR adds proper casting to fuse_linear_bn_weights in the same style as the conv case above. This previously caused numerical issues on my end, so that is why I am fixing it.

Also cleans up the docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134105
Approved by: https://github.com/mikaylagawarecki
2024-08-22 14:26:12 +00:00
b459ca78eb [NJT]Add unit tests that cover the internal use cases using new NJT API (#133513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133513
Approved by: https://github.com/davidberard98, https://github.com/soulitzer
2024-08-22 13:54:40 +00:00
1a7e8e5780 Revert "Update FlexAttention with masking semantic (#133373)"
This reverts commit 5a7b544e5c3e37bea62c6a231f6230c004a33d38.

Reverted https://github.com/pytorch/pytorch/pull/133373 on behalf of https://github.com/jeanschmidt due to Broke internal test/inductor signals, see D61611729 ([comment](https://github.com/pytorch/pytorch/pull/133373#issuecomment-2304714503))
2024-08-22 13:47:26 +00:00
88c973005d Revert "[FlexAttention] Enable different qk and v head-dims (#134043)"
This reverts commit e847b6bb9ba281b0db83fcdd79c328252403e9e8.

Reverted https://github.com/pytorch/pytorch/pull/134043 on behalf of https://github.com/jeanschmidt due to Need to revert, in order to be able to revert https://github.com/pytorch/pytorch/pull/133373, feel free to reland this after solving conflicts ([comment](https://github.com/pytorch/pytorch/pull/134043#issuecomment-2304708996))
2024-08-22 13:44:17 +00:00
83b5d449a3 Add full float16/bfloat16 support to MaxUnPool (#133774)
It already supported half so might as well add bfloat16 support for parity

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133774
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-22 13:34:43 +00:00
c9c84ae3ee [BE][Ez]: Update CUDNN_frontend submodule to 1.6.1 (#134007)
Update cudnn_frontend submodule to 1.6.1 to patch some minor bugfixes and compiler fixes.
# Bug fix
* Fixed an issue where custom dropout mask was not correctly applied.
* Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend.
* Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches.
* Fixed an issue in sdpa fp8 fprop operation (in inference mode).
# Samples
* Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation.
* Added a sample to showcase convolutions on large (c * d * h * w > 2 **31) tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134007
Approved by: https://github.com/eqy
2024-08-22 13:34:17 +00:00
108a75b454 [PP] Add ZeroBubble schedule (#133467)
Zero bubble can be expressed through `ScheduleFlexibleInterleaved1F1B` by setting `enable_zero_bubble=True`. But instead of having to include this flag in schedule initialization we should create a separate ZeroBubbleSchedule and also transition `Interleaved1F1B` to derive from `ScheduleFlexibleInterleaved1F1B`. Then we dont need to expose `ScheduleFlexibleInterleaved1F1B` since the naming is not obvious

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133467
Approved by: https://github.com/wconstab
ghstack dependencies: #132691
2024-08-22 13:32:15 +00:00
cedfac20c7 Revert "[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424)"
This reverts commit 66d3eb783c3b3d7087988dd29bfb619b7f4306b7.

Reverted https://github.com/pytorch/pytorch/pull/133424 on behalf of https://github.com/jeanschmidt due to Broke internal ADS builds, see D61611517 ([comment](https://github.com/pytorch/pytorch/pull/133424#issuecomment-2304676328))
2024-08-22 13:29:27 +00:00
592a172910 [FSDP2] Resolved strided sharding todo in clipping tests (#134152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134152
Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wz337
2024-08-22 12:45:13 +00:00
4c645c04d8 Fix type of get_raw_stream (#134187)
Just something I noticed while implementing a new DeviceInterface

I had to add `# type: ignore[assignment]` because mypy thinks
DeviceInterface.get_raw_stream is a `Callable` and therefore
incompatible with a `staticmethod`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134187
Approved by: https://github.com/jansel
2024-08-22 12:00:08 +00:00
5fb8754434 [inductor] write cpp code with encoding utf-8 (#134027)
Windows is different to Linux, each Windows version with different language pack have different code page.
Inductor on Windows will write the genarated cpp code with its code page, and it should occured un-decode character failed.

For this situlation, Microsoft suggest to use Unicode to instead of a specific code page. Ref: https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers

Changes:
1. Use `utf-8` as encoder for cpp code.
2. It only change encode for cpp code, but not for binary type. binary type is for AoT binary context.

It works on https://github.com/pytorch/pytorch/issues/122094#issuecomment-2299592942.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134027
Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/jansel
2024-08-22 11:54:32 +00:00
aea1148d56 [fp8 rowwise] Clarify dtypes (#134114)
Disambiguate some of the dtypes (e.g., for the scales), move the "constant" ones out of the function, and use safe casting functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134114
Approved by: https://github.com/drisspg
ghstack dependencies: #134110, #134111, #134112, #134113
2024-08-22 11:07:39 +00:00
72586ccd14 [fp8 rowwise] Don't build separate kernel for no bias (#134113)
CUTLASS automatically skips a stage in the epilogue if we provide a nullptr. Thus, instead of building a special kernel for bias=None, we can reuse one of the other ones.

This also considerably simplifies the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134113
Approved by: https://github.com/drisspg
ghstack dependencies: #134110, #134111, #134112
2024-08-22 11:07:39 +00:00
d64fa11095 [fp8 rowwise] Fix bias calculation being done in low precision (#134112)
The compute dtype for the bias addition was set to ElementBias. Thus, for a bf16 bias, we would cast the fp32 accum to bf16 and _then_ add the bias. It is however (slightly?) more accurate to first add the bias in fp32 and only cast at the end.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134112
Approved by: https://github.com/drisspg
ghstack dependencies: #134110, #134111
2024-08-22 11:07:34 +00:00
15faed60ca [fp8 rowwise] Make schedule selection more readable (#134111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134111
Approved by: https://github.com/drisspg
ghstack dependencies: #134110
2024-08-22 11:07:30 +00:00
b8ea5b01c9 [fp8 rowwise] Allocate workspace as a PyTorch Tensor (#134110)
This makes us pass through the CUDA caching allocator which is safer e.g. in case of CUDA graphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134110
Approved by: https://github.com/drisspg
2024-08-22 11:07:26 +00:00
cyy
4c8193b8f0 [14/N] Fix clang-tidy warnings in aten/src/ATen (#132733)
Follows #133807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132733
Approved by: https://github.com/ezyang
2024-08-22 10:09:15 +00:00
90c821814e SparseCsrCUDA: cuDSS backend for linalg.solve (#129856)
This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve.
Fixes #69538

Minimum example of usage:
```
import torch

if __name__ == '__main__':
    spd = torch.rand(4, 3)
    A = spd.T @ spd
    b = torch.rand(3).to(torch.float64).cuda()
    A = A.to_sparse_csr().to(torch.float64).cuda()

    x = torch.linalg.solve(A, b)
    print((A @ x - b).norm())

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn

Co-authored-by: Zihang Fang <zhfang1108@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2024-08-22 07:57:30 +00:00
64cfcbd8a3 Tune _int_bsr_dense_addmm for int8 inputs on A100 (#134035)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134035
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #133855
2024-08-22 06:43:11 +00:00
b7baa062fc Update torch-xpu-ops pin (ATen XPU implementation) (#133850)
Bugfixings for PyTorch 2.5,
1. Using SYCL group algorithm API instead of old style for sub group shift utilities.
2. Add preprocess in reduction kernel for cases requiring data type cast.
3. Make group norm memory format compatible.
4. ZeroTensor: a. Remove unnecessary aten operators registration, or ZeroTensor process is bypassed. b. Align preprocess with intree implementation in aten::copy_.
5. Rebase checkIndexTensorTypes usage.
6. Align latest semantics of PyTorch foreach operators. Return multiple tensors with offset=0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133850
Approved by: https://github.com/EikanWang
2024-08-22 06:27:03 +00:00
cdb9c7d228 Add support for using privateuse1 backend name in instantiate_device_type_tests() (#133082)
As you can see, 'privateuse1' appears many times in out-of-tree extension codebase. I think that everything about the device type should be as same as other in-tree backends after registering the privateuse1 backend.

For example, after registering a privateuse1 backend named "foo", you should allow "foo" to be passed in as a valid device type.

```diff
- instantiate_device_type_tests(TestIndexing, globals(), only_for='privateuse1')
- instantiate_device_type_tests(NumpyTests, globals(), only_for='privateuse1')
+ instantiate_device_type_tests(TestIndexing, globals(), only_for='foo')
+ instantiate_device_type_tests(NumpyTests, globals(), only_for='foo')
```

> https://github.com/Ascend/pytorch/blob/master/test/test_indexing.py#L1654-L1655

The change is to map privateuse1 backend name to 'privateuse1' when calling `filter_desired_device_types()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133082
Approved by: https://github.com/albanD
2024-08-22 06:17:21 +00:00
24c2dd2002 Migrate fuse_chunk_reshape_concat_pass to PT2 (#134026)
Summary:
This is part of the work of dper pass migration https://fburl.com/gdoc/wxwykxns
This pass has ~2.4% perf impact for adfinder_reels_ctr_model

Test Plan: Still in test

Differential Revision: D60789747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134026
Approved by: https://github.com/huxintong
2024-08-22 06:13:52 +00:00
938f37b745 Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964)
Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964
Approved by: https://github.com/Skylion007
2024-08-22 05:29:49 +00:00
e2ff094008 [inductor] calibration inductor windows uts (1/N) (#134033)
Changes:
1. Re-open fixed UTs.
2. Mark skiped reasons for failed UTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134033
Approved by: https://github.com/jansel
2024-08-22 05:21:28 +00:00
0d7ac1966a kill sharing of constraints (#134045)
Summary:
Previously, reuse of the same `Dim` was encoded by "sharing" internal constraints among constraint targets. This kind of sharing, implemented using `shared` fields between `_Constraint`s, was originally motivated by `dynamic_dim`, specifically to support `==` between `dynamic_dim`s, but we no longer need to maintain this overcomplicated structure: we can simply use names of `Dims` to directly encode sharing information.

Thus this PR vastly simplifies the structure of `_Constraint` by removing `shared` fields. As a result, both `_Constraint` and its moral subclass, `_DerivedConstraint`, are 1-1 with `Dim` and its moral subclass, `DerivedDim`.

Note that this will break `==` over `dynamic_dim`, so an immediate follow-up will be to remove `dynamic_dim` entirely from our public API. (It's been more than 6 months since the deprecation warning anyway.) I just didn't want to deal with that process in the same PR.

Test Plan: existing

Differential Revision: D61559413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134045
Approved by: https://github.com/pianpwk
2024-08-22 04:40:47 +00:00
de06345e9b Avoid Host & Device Sync In LR Scheduler (#133663)
Fixes #133662.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133663
Approved by: https://github.com/janeyx99, https://github.com/eqy

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-08-22 03:52:43 +00:00
e847b6bb9b [FlexAttention] Enable different qk and v head-dims (#134043)
# Summary
Adds the option for the head dims to be different between QK and V tensors.

Fixes issue: https://github.com/pytorch/pytorch/issues/133674

V_DIM > QK_DIM is blocked by landing: https://github.com/triton-lang/triton/pull/4138 / https://github.com/triton-lang/triton/pull/4540

Into PyTorch's triton branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134043
Approved by: https://github.com/Chillee
2024-08-22 03:42:17 +00:00
7868b65c4d [Dynamo] Support dict.setdefault (#134083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134083
Approved by: https://github.com/williamwen42
2024-08-22 01:57:33 +00:00
7b20514f8e [export] Device remapping in export (#133660)
Implemented `move_to_device_pass()` function in `torch._export.passes`.

The user has to explicitly call this method to move the exported program from one torch.device to another one.

Fixes https://github.com/pytorch/pytorch/issues/121761
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133660
Approved by: https://github.com/angelayi
2024-08-22 01:03:35 +00:00
df467f8746 [CI] Do not set Intel OMP for aarch64 (#133997)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133997
Approved by: https://github.com/angelayi
2024-08-22 00:55:46 +00:00
6bddfb9546 [FSDP2] Add cache for FSDP wrapper class (#134135)
Currently, `fully_shard` will create a new `FSDPMyModuleClass` class for each `MyModuleClass` module **object**, which causes Dynamo to guard-fail on every module object's type checking. This PR fixes the issue by caching and reusing previously created FSDP wrapper class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134135
Approved by: https://github.com/awgu
2024-08-22 00:41:30 +00:00
2a73ba298c Upgrade submodule oneDNN to v3.5.3 (#131620)
This PR is to upgrad submodule oneDNN to v3.5.3.

## Improvements

- [experimental] Introduced [microkernel API](https://oneapi-src.github.io/oneDNN/ukernels.html) for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.
- Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
- Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only.

## Validation results on CPU
No regression was found.

1. NLP models accuracy/inference/training

Model Name | Mode Name | Precision | OneDNN | Baseline | OneDNN/Baseline
-- | -- | -- | -- | -- | --
bert-large | realtime | bf16 | 192.498 | 189.664 | 1.014942214
bert-large | throughput | bf16 | 202.424 | 202.156 | 1.001325709
bert-large | train_phase2 | bf16 | 15.955 | 16.029 | 0.995383368
LCM | throughput | bf16 | 1.01983 | 1.06632 | 0.956401455
stable-diffusion | throughput | bf16 | 0.10313 | 0.10184 | 1.012666929
ViT | realtime | bf16 | 1086.48 | 928.43 | 1.17023362
ViT | throughput | bf16 | 1419.07 | 1393.81 | 1.018122987
yolov7 | realtime | bf16 | 413.468682 | 415.16503 | 0.995914039
yolov7 | throughput | bf16 | 369.697 | 366.789 | 1.007928264
bert-large | realtime | fp32 | 46.685 | 46.652 | 1.000707365
bert-large | throughput | fp32 | 47.766 | 48.007 | 0.994979899
bert-large | train_phase2 | fp32 | 7.101 | 7.104 | 0.999577703
LCM | throughput | fp32 | 0.5501 | 0.55023 | 0.999763735
stable-diffusion | throughput | fp32 | 0.04012 | 0.04002 | 1.002498751
ViT | realtime | fp32 | 337.27 | 335.19 | 1.006205436
ViT | throughput | fp32 | 346.52 | 350.08 | 0.989830896
yolov7 | realtime | fp32 | 107.138054 | 107.242747 | 0.999023775
yolov7 | throughput | fp32 | 103.383 | 104.301 | 0.99119855
bert-large | realtime | int8 | 283.541 | 289.569 | 0.979182855
LCM | throughput | int8 | 1.09864 | 1.08998 | 1.0079451
stable-diffusion | throughput | int8 | 0.10617 | 0.10604 | 1.001225952
ViT | realtime | int8 | 1562.11 | 1554.68 | 1.004779119
ViT | throughput | int8 | 1904.38 | 1903.39 | 1.000520125
yolov7 | realtime | int8 | 540.489493 | 539.902488 | 1.001087243
yolov7 | throughput | int8 | 499.999 | 500.757 | 0.998486292

Device | Dtype | Geomean Higher is better
-- | -- | --
All | all | 101.17%
All | fp32 | 99.83%
All | bf16 | 102.24%
All | int8 | 99.91%
All | fp16 | 103.61%
SPR | all | 100.54%
SPR | fp32 | 99.82%
SPR |bf16 | 101.78%
SPR |int8 | 99.90%
GNR | all | 101.58%
GNR | fp32 | 99.85%
GNR | bf16 | 102.66%
GNR | int8 | 99.93%
GNR | fp16 | 103.61%

2. Torchbench cpu userbenchmark inference & training

Perf_Geomean | Ratio (oneDNN/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 1.00x
jit_llga_throughtput_fp32 | 1.00x
eager_throughtput_fx_int8 | 0.99x
eager_throughtput_bf16_train | 1.01x
eager_throughtput_fp32_train | 1.00x

3. Inductor quantization

Static quant:
Perf_Geomean | Ratio (oneDNN/baseline)
-- | --
PTQ | 1.00x
PTQ_CPP_WRAPPER | 1.00x
QAT | 1.00x

ACC_Geomean | Ratio (oneDNN/baseline)
-- | --
PTQ | 1.00x
PTQ_CPP_WRAPPER | 1.00x
QAT | 1.00x

Dynamic quant:

  | Ratio (oneDNN/baseline)
-- | --
Performance | 1.04x
Accuracy | 1.00x

4. Dynamo benchmarks
GEOMEAN summary
![image](https://github.com/user-attachments/assets/82fc4b76-50f6-4f06-9ba9-034b932f1158)

FP32 Static shape, default wrapper
![image](https://github.com/user-attachments/assets/9335268e-3e99-426b-91f8-f9df90a2007c)

FP32 Dynamic shape, default wrapper
![image](https://github.com/user-attachments/assets/e7cf3f4f-2a62-4b58-9461-5e5ba254d822)

AMP Static shape, default wrapper
![image](https://github.com/user-attachments/assets/12392c88-e44f-4c95-904a-4fa5fc9f34a2)

AMP Dynamic shape, default wrapper
![image](https://github.com/user-attachments/assets/13930b0d-9bb2-46de-9ecb-5d2585d5c2f6)

## Validation results on XPU
Category | Eager | Inductor
-- | -- | --
huggingface_amp_fp16_training | 1.002456 | 0.999998
huggingface_bfloat16_inference | 1.005386 | 1.003511
huggingface_float32_training | 1.002533 | 1.003098
torchbench_amp_fp16_training | 1.009065 | 1.01323
torchbench_bfloat16_inference | 1.003371 | 1.001534
torchbench_float32_training | 1.012102 | 1.011596
timm_models_amp_fp16_training | 1.005511 | 1.010329
timm_models_bfloat16_inference | 1.000935 | 1.000538
timm_models_float32_training | 0.991873 | 0.99721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131620
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-08-21 23:40:02 +00:00
5f0bd98767 Increase max total number of dynamo partitions to 15 (#134153)
Needed to be able to split some of the aarch64 workflows to 15 shards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134153
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/ZainRizvi
2024-08-21 23:10:12 +00:00
a5ef04a3b8 add relevant function (#133946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133946
Approved by: https://github.com/ezyang
2024-08-21 23:04:59 +00:00
8604c0a150 [inductor] Fix needs_fixed_stride_order silent incorrectness (#133639)
Fixes #128084

The approach is option 2 of what Elias suggested in the comment
thread:
- We require tensors to have the correct stride at usage. This may
  involve a clone; if there was a clone and then a mutation into it
  then we copy_ back the result of the mutation.

The reason why I went this approach was because it was the easiest and
Inductor already works really hard to remove additional clones/copy_.

There are some cases that this doesn't generate efficient code for; for
example, if the tensor is a view, we don't change the base of the view
to have the right stride order, instead we do a clone.
The view case isn't very common so I'm ignoring it for now but we could
improve this in the future.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639
Approved by: https://github.com/eellison
2024-08-21 22:54:16 +00:00
d2204d4f0f Remove skip ci recommendation (#134134)
Using `skip ci` is no longer a recommendation practices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134134
Approved by: https://github.com/soulitzer
2024-08-21 22:42:25 +00:00
255cd75a97 [sparse] Add cuSPARSELt as a backend (#128534)
Summary:

This PR adds in cuSPARSELt as a backend to PyTorch.

It is now possible to see if cuSPARSELt is available and the version if
it is with
```
torch.backends.cusparselt.is_available()
torch.backends.cusparselt.version()
```

Test Plan:
```
python test/test_sparse_semi_structured.py -k test_cusparselt_backend
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534
Approved by: https://github.com/cpuhrsch, https://github.com/eqy, https://github.com/syed-ahmed
2024-08-21 22:06:07 +00:00
0870398fa8 [ONNX] Opt into ruff fmt (#134120)
Add ONNX directory to use ruff format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134120
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2024-08-21 21:43:55 +00:00
96dfe95ed0 Fix DDPLoadBalancingPlanner docstring (#134044)
Summary:
1. Indentation in chunk function was wrong.
1. The previous logic missed a level of zip.

This diff uses the idiom in python zip doc to do chunking https://docs.python.org/3/library/functions.html#zip

Test Plan: Run the docstring locally

Differential Revision: D61548758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134044
Approved by: https://github.com/fegin
2024-08-21 21:28:22 +00:00
5d5a45dc85 [CI][dashboard] Collect Export pass rate separately (#134076)
Summary: Collect Export pass rate separately when running AOTInduction, so that we can have a better isolated signal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134076
Approved by: https://github.com/angelayi
2024-08-21 21:18:55 +00:00
b3eef3deaf Triple number of shards for aarch64 cpu inductor tests (#134123)
Let's see if this will work.

Alas, other than linting I can only test it after it lands
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134123
Approved by: https://github.com/clee2000
2024-08-21 20:52:23 +00:00
345578afb4 Add int8 support to bsr_dense_addmm and bsr_dense_mm Triton kernels (#133855)
As in the title. In addition, the PR introduces `_int_bsr_dense_addmm` that is equivalent to `bsr_dense_addmm` except for int8 inputs the operation result is int32 tensor (similar to existing `_int_mm`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133855
Approved by: https://github.com/cpuhrsch
2024-08-21 20:44:40 +00:00
a3e1416c05 Fix out_tensor device in diag_test.py (#134020)
This benchmark fails if device='cuda' but out_tensor is on cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134020
Approved by: https://github.com/soulitzer
2024-08-21 20:43:39 +00:00
6c1e2d2462 [easy] Force inline_inbuilt_nn_modules to remove divergence (#134122)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134122
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2024-08-21 20:42:15 +00:00
865facda44 [pytorch] Remove thread naming when torch is imported (#134066)
Fixes #133690

The naming was added in #121170 to allow performance debugging of latency critical threads. However the `pt_main_thread` name gets inherited every time a new process or thread is created from the parent one, which defeats the purpose. We need a better way to name the thread that launches kernels on accelerators but for the time being we can let users name the threads in the application code, using: `torch.multiprocessing._set_thread_name("insert_name")`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134066
Approved by: https://github.com/soulitzer, https://github.com/d4l3k
2024-08-21 20:34:35 +00:00
1491a61769 Revert "[hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645)"
This reverts commit 696107efcb83f9359aa669ab343c2cfa2a111372.

Reverted https://github.com/pytorch/pytorch/pull/133645 on behalf of https://github.com/ydwu4 due to breaking ci. probably due to land race ([comment](https://github.com/pytorch/pytorch/pull/133645#issuecomment-2302866106))
2024-08-21 19:33:14 +00:00
5fcfccefc6 [export] Migrate capture_pre_autograd_graph to _export_for_training (#132815)
Summary: as title

Test Plan: CI

Differential Revision: D60860909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132815
Approved by: https://github.com/tugsbayasgalan
2024-08-21 19:00:41 +00:00
18aaceb7be Update conda-env-iOS.txt (#134068)
Followup after https://github.com/pytorch/pytorch/pull/133814 To fix periodic build failures update `typing-extensions` to 4.11.0, as 4.10 is missing in conda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134068
Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007
2024-08-21 18:47:14 +00:00
84b3f1900a C++ network flow implementation in c10 (#132188)
The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency.

So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness.

Differential Revision: [D61550977](https://our.internmc.facebook.com/intern/diff/D61550977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188
Approved by: https://github.com/Chillee
2024-08-21 18:40:54 +00:00
05304f59f0 [Doc] Fix typo in torch/fx/passes/README.md (#134078)
Fix typo, `utis` to `utils`, in the utility name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134078
Approved by: https://github.com/soulitzer, https://github.com/malfet
2024-08-21 18:35:50 +00:00
32e057636c Enable scribe environment for compile-time benchmarks if requested. (#133891)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133891
Approved by: https://github.com/malfet
2024-08-21 18:02:54 +00:00
750d68ff70 Use amazon linux2 for Docker builds, fix build-docker-conda condition (#134116)
1. Switches failing jobs to amzon linux 2:
- CUDA, CPU, ROCM jobs are failing
3. Fix trigger condition for build-docker-conda to be same as manywheel and libtorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134116
Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia
2024-08-21 18:01:16 +00:00
696107efcb [hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133645
Approved by: https://github.com/zou3519
ghstack dependencies: #133521
2024-08-21 17:34:21 +00:00
6835f20d20 [HOP] support generating schema for hop (#133521)
Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop.

We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521
Approved by: https://github.com/zou3519
2024-08-21 17:34:21 +00:00
dd5a7c8397 [PT2] Add a pass to convert stack to unsqueeze cat (#133966)
Summary: so that we can optimize with `fuse_chunk_reshape_unsqueeze_concat_pass`

Test Plan: new UT

Reviewed By: frank-wei

Differential Revision: D61220221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133966
Approved by: https://github.com/frank-wei
2024-08-21 17:31:26 +00:00
1da3a049da [dynamo][super] Improve handling of getattr on super (#134039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134039
Approved by: https://github.com/yanboliang
ghstack dependencies: #133742, #134016
2024-08-21 16:50:35 +00:00
3ef1cc8583 [export] Implement common_getitem_elimination pass. (#133618)
Summary:
In export, we will generate many redundant getitem nodes branching from the same source, inserted by runtime assertions or any passes. This is causing issues with any downstream system relying on any value being uniquely defined by a single node.

I don't think it hurt to remove a bunch of getitem nodes only, so I just added to the ctor.

Test Plan:
rebase on D61256937
```
buck2 run scripts/bearzx:pt2_export_playground
```

Differential Revision: D61351578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133618
Approved by: https://github.com/tugsbayasgalan
2024-08-21 16:48:24 +00:00
2db28a9611 Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814)"
This reverts commit bce0caba7804b0787684dbf1f4e1c4d9e3acded5.

Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/ezyang due to root cause of internal failures not addressed ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2302466444))
2024-08-21 16:13:34 +00:00
57625bacea [partitioner] Fix must_be_in_backward corner cases (#134002)
Preparation PR for https://github.com/pytorch/pytorch/pull/132638

"must_be_in_backward" fails the partitioner, if partitioner picks this node as saved_values.

The fix is to prevent partitioner to pick those nodes during nodes classification.

It's hard to make a test without making effectful ops in backward "must_be_in_backward", which will be testing this ( https://github.com/pytorch/pytorch/pull/132638 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134002
Approved by: https://github.com/bdhirsh
ghstack dependencies: #134003
2024-08-21 15:58:49 +00:00
68425e68fe Revert "[dynamo][reland][inline-inbuilt-nn-modules] Mark attributes of nn mod… (#133714)"
This reverts commit e8d3c4be3629582294b5944754009fae60f42f6d.

Reverted https://github.com/pytorch/pytorch/pull/133714 on behalf of https://github.com/anijain2305 due to fails internally ([comment](https://github.com/pytorch/pytorch/pull/133714#issuecomment-2302171472))
2024-08-21 14:21:06 +00:00
32e052e468 [docs] improve torch.stack example code to be reproducible (#133857)
Improve the sample code can produce the expected results after copying and executing it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133857
Approved by: https://github.com/soulitzer
2024-08-21 14:07:02 +00:00
585c049fa3 Fix Extension attribute name in CppExtension example (#134046)
Hi! It seems there's a typo in `CppExtension` example. I think it should say `extra_link_args` instead of `extra_link_flags`. Not that I spent a few hours debugging missing kernels inside a library's fatbin or anything :D.

Please see `Extension` definition inside setuptools:
ebddeb36f7/setuptools/_distutils/extension.py (L62)

Thanks!
Błażej

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134046
Approved by: https://github.com/soulitzer
2024-08-21 13:58:16 +00:00
afaa5fcecb [BE][Ez]: FURB142,FURB92 misc preview fixes (#133880)
Fixes some miscellaneous code quality issues with some refurb rules that have not been enabled yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133880
Approved by: https://github.com/soulitzer, https://github.com/malfet
2024-08-21 13:54:51 +00:00
683609c631 Skip cpp_extension test internally (#134011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134011
Approved by: https://github.com/masnesral
2024-08-21 13:51:05 +00:00
4b1fb3b0ed [PP] pt-native input/weight grad split (#132691)
Add `stage_backward_input` and `stage_backward_weight` functions to perform the weight updates for inputs and weights independently.

We still support `self.dw_builder` argument for a custom backward, but it has become optional. It takes a separate code path and cannot be used in conjuction with the native zero backward.

Added tests:
`python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`
`python test/distributed/pipelining/test_backward.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132691
Approved by: https://github.com/wconstab
2024-08-21 13:37:54 +00:00
2bffbe06bd [Inductor][CPP] Support vectorization of load_seed and randn (#130317)
**Summary**
Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130317
Approved by: https://github.com/jgong5
ghstack dependencies: #122961
2024-08-21 13:20:43 +00:00
313bc11963 [inductor][cpp] complete vectorization for int32/int64 (#122961)
**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node:

- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961
Approved by: https://github.com/jansel

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
2024-08-21 13:12:38 +00:00
539be0a769 [dynamo] support ClassMethodDescriptorType (#133862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133862
Approved by: https://github.com/jansel
2024-08-21 12:56:19 +00:00
0d79f67a25 [dynamo][exception] Support raise exception from None (#134028)
Fixes https://github.com/pytorch/pytorch/issues/132362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134028
Approved by: https://github.com/yanboliang
2024-08-21 12:48:35 +00:00
bd0db490bf [dynamo][set] Fix EQUALS_MATCH guard for constant sets and lists (#134016)
Fixes https://github.com/pytorch/pytorch/issues/133509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134016
Approved by: https://github.com/laithsakka, https://github.com/jansel
ghstack dependencies: #133742
2024-08-21 12:41:52 +00:00
c929e1e11f [dynamo] fix polyfill for user defined constructor __new__ (#133822)
In `cls->tp_call`, if `cls->tp_new` does not return an instance of class `cls`, then `cls->tp_init` is not called on the new instance.

Related PR:

- #132977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133822
Approved by: https://github.com/jansel
2024-08-21 12:41:19 +00:00
695291be2f Fix test flakiness due to not resetting state (#134058)
Fixes https://github.com/pytorch/pytorch/issues/133994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134058
Approved by: https://github.com/yanboliang
2024-08-21 11:54:08 +00:00
30dc6338c1 [effects] Prevent inductor dtype promotions for HOP effects tokens (#134003)
Preparation for https://github.com/pytorch/pytorch/pull/132638 and https://github.com/pytorch/pytorch/pull/132755

Inductor promotes arguments dtypes to the highest dtype, as a result additional token tensor argument wtih float32 dtype incurred dtype promotions for lower types, e.g. int32

The solution for that - to use the lowest dtype for tokens - torch.bool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134003
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-08-21 11:42:10 +00:00
19eb14493a [Inductor] Moves intermediary tensors which are constructed on the cpu to XPU when safe, align with CUDA. (#132843)
[Inductor] Moves intermediary tensors which are constructed on the cpu to XPU when safe, align with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132843
Approved by: https://github.com/EikanWang, https://github.com/eellison
ghstack dependencies: #132740, #132748
2024-08-21 11:28:09 +00:00
6535f11259 [Inductor] Support _check_triton_bf16_support on XPU. (#132748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132748
Approved by: https://github.com/EikanWang, https://github.com/eellison
ghstack dependencies: #132740
2024-08-21 11:28:09 +00:00
c2e2602ecd [Inductor] Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from (#132740)
Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from `testing/_internal/inductor_utils.py` to `_inductor/utils.py`. So that we can use it in Inductor, not limited in test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132740
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-08-21 11:18:00 +00:00
3d8db41337 Add new op wrapped_quantized_linear (#134024)
Summary:
This diff adds a new operator wrapped_quantized_linear (torch.ops._quantized.wrapped_quantized_linear) and takes the following input argument: input (in fp32) , input_scale, input_zero_point, weight (in fp32), weight_scale, weight_zero_point, bias (in fp32), output_scale, output_zero_point, and out_channel. It does the following

1. Use quantize_per_tensor(input, input_scale, input_zero_point) to quantize the input tensor to int8
2. Use quantized::linear_prepack(weight, weight_scale, weight_zero_point, bias) to pack the weight and bias
3. Use quantized::linear to perform int8 quantized linear
4. dequantize

This new op is essentially a wrapper of mutiple ops. We do this as torch.export cannot handle models where it has old quantize apis.

Reviewed By: jerryzh168

Differential Revision: D61377266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134024
Approved by: https://github.com/houseroad
2024-08-21 09:26:58 +00:00
022cd7c9aa [RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712)
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.

5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)

Example:

```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...

>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
...     for i, item in enumerate(sequence):
...         if item is x or item == x:
...             return i
...     raise ValueError("sequence.index(x): x not in sequence")

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
2024-08-21 06:36:41 +00:00
843fdf81c2 Fix a getenv segfault due to a race (#133744)
Summary:
* TLDR:

`getenv` is not thread safe w.r.t `setenv`. Environment variables are kept as a per-process "dictionary" by libc. `setenv` can essentially realloc the whole thing move this list to a completely different location. If there is a concurrent `getenv` happening as the same time, it is possible that it might end up reading stale memory and segfault.
`getenv` is thread safe w.r.t other `getenv`.

* Details:

Inside PTD init:
```
ProcessGroupNCCL ctor
	...
	ncclCommWatchdogThread_ =
      std::thread(&ProcessGroupNCCL::ncclCommWatchdog, this); (https://fburl.com/code/terf9ai7)
```

Inside ncclCommWatchdog thread:
```
	...
	ncclHeartbeatMonitorThread_ =
        std::thread(&ProcessGroupNCCL::heartbeatMonitor, this);  (https://fburl.com/code/fv9camg2)
    ...
```

Inside heartbeatMonitor thread:
```
	...
	std::optional<DumpPipe> dumpPipe = std::nullopt; (https://fburl.com/code/qdvahzbu)
	dumpPipe.emplace(rank_);
	...
```

Inside DumpPipe ctor (https://fburl.com/code/wvixlqcz)
```
	getCvarString
		getenv <=== SIGSEGV
```

On the main thread:

We go on to initialize NCCL:

Inside getNCCLComm, we call: `getNcclVersion` -> `initEnv` (https://fburl.com/code/j312pccu)

`initEnv` inside NCCL does this: `initEnv` -> `setEnvFile`

This guy, reads the /etc/nccl.conf file, and sets values of env variables with "setenv" (https://fburl.com/code/cq4r0y0h)
This "setenv" can race with "getenv" in heartbeatMonitor thread.

Ideally, all `setenv` should be done by a single thread before launching other threads. This diff moves getNCCLVersion before launching watchdog thread to make sure all setenvs are done beforehand.

I think we are just getting lucky that we are not hitting it in production. IIRC in fact we saw getenv segfault once in one of the large scale runs, but now I dont remember the details.

Test Plan: A lot of testing done as part of D61411062 & CI

Differential Revision: D61421292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133744
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-08-21 06:27:31 +00:00
af664882dd Safely infer device type + docstrings + tests (#133668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133668
Approved by: https://github.com/eellison
2024-08-21 05:27:31 +00:00
b39ec7fbe9 [1/N] Make NCCL PG error messages more accurate and simpler (#134017)
We did a thorough review on all the error messages we are logging inside PGNCCL, and we want to make log message simpler and more accurate, this is the first PR for this effort.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134017
Approved by: https://github.com/wconstab
2024-08-21 05:21:24 +00:00
66d3eb783c [SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424)
### Summary
- Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address.
- Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation).

### Benchmark

8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support.

![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947)

![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424
Approved by: https://github.com/yf225, https://github.com/weifengpy
2024-08-21 05:11:21 +00:00
8337b4d96e [training ir migration] Fix ReorderConvertTest (#134010)
Summary:
Change ReorderConvertTest to work with the new `capture_pre_autograd_graph` implementation using D61175223.

Note that now `ReorderConvertTest` doesn't work with the old `capture_pre_autograd_graph` anymore.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/passes/tests:optimize_test -- -r ReorderConvertTest
```

Differential Revision: D61507772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134010
Approved by: https://github.com/tugsbayasgalan
2024-08-21 04:48:43 +00:00
e8fc1e0118 [ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530)
1/n PR to

- Move code from torch-onnx from commit 395495e566 into torch.onnx and fixes imports.
- Integrate the new export logic with the torch.onnx.export API and include basic set of tests.
- Refactor the API for the change.
- Improve documentation.

Next PRs will be more tests and docs.

Fix https://github.com/pytorch/pytorch/issues/129277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132530
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2024-08-21 01:08:42 +00:00
06cc2e83f0 Make optim.swa.util content accessible from the torch.optim doc (#133393)
Link various classes and functions of the `optim.swa.util` to make doc content accessible from the `torch.optim` doc.

Currently, if you click the link,
https://pytorch.org/docs/stable/optim.html#module-torch.optim.swa_utils it goes to a blank, bottom of the page section of `torch.optim`.
Also,
`torch.optim.swa_utils.AveragedModel` and `torch.optim.swa_utils.SWALR` classes as well as `torch.optim.swa_utils.update_bn()` and `optim.swa_utils.get_ema_multi_avg_fn` are not linked to doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133393
Approved by: https://github.com/janeyx99
2024-08-21 00:43:46 +00:00
d1abd6241a [CI][BE] Update retry action to v3.0.0 (#119403)
To reduce number of
```
 Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20
```

Finally can land this one as all nodes has been migrated to AmazonLinux2023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119403
Approved by: https://github.com/clee2000, https://github.com/Skylion007
2024-08-20 23:56:37 +00:00
c42ac54d9e [inductor] prune unused constants in graph scheduling (#132208)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132208
Approved by: https://github.com/leslie-fang-intel

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
2024-08-20 23:40:11 +00:00
5f3d22a609 Avoid GPU syncs by reusing Pre-allocated Zero Tensor (#128069)
This commit improves the FullyShardedDataParallel (FSDP) class in PyTorch by reducing unnecessary GPU synchronizations by reusing a pre-allocated zero tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128069
Approved by: https://github.com/awgu
2024-08-20 22:51:33 +00:00
5a7b544e5c Update FlexAttention with masking semantic (#133373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133373
Approved by: https://github.com/yanboliang
2024-08-20 22:38:10 +00:00
bc785c2d9a [Inductor][FlexAttention] Don't trigger dynamic shape on building empty block mask (#133836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133836
Approved by: https://github.com/Chillee
2024-08-20 22:36:53 +00:00
f7c1f32803 Fix partially initialized module error (#134019)
https://github.com/pytorch/pytorch/pull/132990 introduced dependency on `torch.version`, which might not be imported yet, and can result in  `AttributeError: partially initialized module 'torch' has no attribute 'version' (most likely due to a circular import)` if user starts its code with `import torch.cuda`

Fix it by importing `torch.version` explicitly

Test Plan: CI

Differential Revision: D61549284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134019
Approved by: https://github.com/seemethere
2024-08-20 22:20:02 +00:00
41fab40be7 [report_exportability] Avoid re-exporting duplicated modules (#133930)
Summary:
Skip re-exporting modules with the duplicated types to speed up the exportability tests.

In real models, there are many duplicated modules, and mostly have the same export issues.

Test Plan: Existing CI

Differential Revision: D61504630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930
Approved by: https://github.com/angelayi
2024-08-20 22:11:57 +00:00
1ae5d5bb62 [dynamo][user-defined] Improve getattr_static for user_defined objects (#133742)
Fixes https://github.com/pytorch/pytorch/issues/133607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133742
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-08-20 21:51:03 +00:00
a36739f36a Cherry-Picking don't resolve conflicts (#134047)
During cherry-picking we want to use default setting and fail if there is merge conflict
Here an example of invalid conflict resolution:
https://github.com/pytorch/pytorch/pull/131194
and cherry-pick
https://github.com/pytorch/pytorch/pull/133590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134047
Approved by: https://github.com/kit1980
2024-08-20 21:48:05 +00:00
2e1830c7c8 Implement 2D version of masked_select for nestedtensors (#133889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133889
Approved by: https://github.com/soulitzer
2024-08-20 21:46:32 +00:00
15b5a0b67f Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712)"
This reverts commit 71dd52f51a05d110c06e83f74cef165f64627842.

Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))
2024-08-20 21:14:45 +00:00
88ead0afc6 Revert "[dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)"
This reverts commit 178e8563b8a44243a6f69f3d257d9a3aab71b2c5.

Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))
2024-08-20 21:14:45 +00:00
3fa874abbe Revert "[dynamo] simplify implementation for functools.reduce (#133778)"
This reverts commit 37b4bc60a4ec65858044983a36577912fb9b4651.

Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))
2024-08-20 21:14:45 +00:00
98e6a1d8ff Revert "[dynamo] simplify implementation for builtins.sum (#133779)"
This reverts commit 3f58a8051a92470dbd254859322a7eb085a8f243.

Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))
2024-08-20 21:14:44 +00:00
2540ee372a Revert "[dynamo][itertools] support itertools.tee (#133771)"
This reverts commit 28ce3c0227830c78c0b5d4ec592f5c3879bc61a3.

Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))
2024-08-20 21:14:44 +00:00
ccc0aa69ce [ONNX] Remove torch.onnx._export (#133824)
- Remove the deprecated torch.onnx._export function
- Remove test/onnx/test_export_modes.py because export modes are no longer supported
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133824
Approved by: https://github.com/titaiwangms
2024-08-20 20:54:48 +00:00
b03381cac2 [dynamo] support cls.__flags__ (#133970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133970
Approved by: https://github.com/jansel
ghstack dependencies: #133969
2024-08-20 20:03:31 +00:00
5229b52bf2 [dynamo] support cls.__base__ (#133969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133969
Approved by: https://github.com/jansel
2024-08-20 20:03:31 +00:00
bb0bf09aff [easy] skip test_sdpa_autocast on windows (#134009)
test is failing because torch.compile doesn't work on windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134009
Approved by: https://github.com/YuqingJ, https://github.com/Skylion007, https://github.com/ZainRizvi
2024-08-20 19:51:55 +00:00
28ce3c0227 [dynamo][itertools] support itertools.tee (#133771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771
Approved by: https://github.com/jansel
ghstack dependencies: #133712, #133769, #133778, #133779
2024-08-20 19:48:57 +00:00
3f58a8051a [dynamo] simplify implementation for builtins.sum (#133779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779
Approved by: https://github.com/jansel
ghstack dependencies: #133712, #133769, #133778
2024-08-20 19:48:57 +00:00
37b4bc60a4 [dynamo] simplify implementation for functools.reduce (#133778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778
Approved by: https://github.com/jansel
ghstack dependencies: #133712, #133769
2024-08-20 19:48:57 +00:00
178e8563b8 [dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769
Approved by: https://github.com/jansel
ghstack dependencies: #133712
2024-08-20 19:48:57 +00:00
71dd52f51a [RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712)
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.

5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)

Example:

```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...

>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
...     for i, item in enumerate(sequence):
...         if item is x or item == x:
...             return i
...     raise ValueError("sequence.index(x): x not in sequence")

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
2024-08-20 19:48:57 +00:00
49430bfd5c [DeviceMesh] Add a _MeshEnv attr to record the mapping of flatten mesh_dim_name to its mesh dim index in root mesh (#133838)
```
# supposed we have a 3d mesh
mesh_3d = init_device_mesh("cuda", (2,2,2), mesh_dim_names=("dp", "cp", "tp")
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()

"""
then we would have
flatten_name_to_root_dims[mesh_3d]: {
    "dp_cp": (0, 1)
}
"""
```

We need this information to validate the order mesh slice including flatten mesh dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133838
Approved by: https://github.com/fegin
2024-08-20 19:43:45 +00:00
c188d419db [BE] [EZ] Allow linux-build workflows to run on the default runner type (#133640)
Replace usage of `runner` with the new `runner_prefix` input, which allows the workflows to use the default runner type (linux.2xlarge) specified by the reusable workflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133640
Approved by: https://github.com/clee2000, https://github.com/jeanschmidt, https://github.com/malfet
2024-08-20 19:37:14 +00:00
81a822ddc9 Back out "[1/N] Fix clang-tidy warnings in inductor (#131979)" (#133922)
Summary:
Original commit changeset: cc9392e5fce2

Original Phabricator Diff: D60464909

Differential Revision: D61501052

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133922
Approved by: https://github.com/22quinn
2024-08-20 19:16:29 +00:00
49f6ea6dd9 Revert "[report_exportability] Avoid re-exporting duplicated modules (#133930)"
This reverts commit 278bc985d71f1ee09a499fba2ea5032b7baf2567.

Reverted https://github.com/pytorch/pytorch/pull/133930 on behalf of https://github.com/izaitsevfb due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/133930#issuecomment-2299513046))
2024-08-20 18:44:09 +00:00
43f78bf37a [MPS] Gather sliced inputs to batch norm (#133610)
This PR removes the `executeGatherOp` flag from batch norm in favor of relying on the logic in 4aa66f68a8/aten/src/ATen/native/mps/OperationUtils.mm (L372) to decide if gathering is necessary.

It's not the most efficient way to solve this issue, but it assures correctness for sliced inputs.

### Performance impact

#### With fix

```
python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
100 loops, best of 5: 282 usec per loop

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
100 loops, best of 5: 448 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 705 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 1.11 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 7.16 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 11.7 msec per loop
```

#### Without fix

```
python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
100 loops, best of 5: 284 usec per loop

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
100 loops, best of 5: 265 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 715 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 675 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 7.19 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 7.13 msec per loop
```

Please feel free to push back or request changes.

Fixes #133520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133610
Approved by: https://github.com/malfet
2024-08-20 18:24:48 +00:00
278bc985d7 [report_exportability] Avoid re-exporting duplicated modules (#133930)
Summary:
Skip re-exporting modules with the duplicated types to speed up the exportability tests.

In real models, there are many duplicated modules, and mostly have the same export issues.

Test Plan: Existing CI

Differential Revision: D61504630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930
Approved by: https://github.com/angelayi

Co-authored-by: bearzx <bearzx@fb.com>
2024-08-20 18:20:49 +00:00
333890b701 Enable CUDA 12.4.1 (#132202)
Trying to keep a record of the steps before I lose track of it.

- 1st Commit: Similar to https://github.com/pytorch/builder/pull/1720
- 2nd Commit:  Update CUDA 12.4 CI CUDA versions from 12.4.0 to 12.4.1 mapping to changes in https://github.com/pytorch/pytorch/pull/125944/files
- 3rd Commit: update for aarch64 install_cuda_aarch64.sh docker step
- 4th Commit: aaa456e3e6 Related https://github.com/pytorch/pytorch/pull/121684
- Synchronization point: Meta helps uploading pypi cuda dependencies specified in .github/scripts/generate_binary_build_matrix.py
- The above pypi upload is done (thanks Andrey!), restarted jobs like https://github.com/pytorch/pytorch/actions/runs/10188203670/job/28369471321
- 77532344e3, use temporary docker containers (generated from a previous successful container build). If merged, these containers would be rebuilt, therefore testing them now.  (5th commit)
- 6th commit 5f93c625b5: revert the 5th commit. Update, done but have to debug seemingly irrelevant failures (rocm/xpu/mps)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132202
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/atalman
2024-08-20 17:52:50 +00:00
e41b520ee3 [3/N] Refactor FR script - Add a processor module (#133933)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133933
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #133927, #133929
2024-08-20 17:36:49 +00:00
bce0caba78 [BE]: Update Typeguard to TypeIs for better type inference (#133814)
Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814
Approved by: https://github.com/ezyang
2024-08-20 17:19:57 +00:00
fbf3fc2a30 [inductor] Use int64_t as index type for all platfroms 4 (#133892)
It is parallel PR to https://github.com/pytorch/pytorch/pull/133819 , and it is append change for @jansel 's comments.
1. For `torch/_inductor/codegen/cpp_wrapper_cpu.py`, revert to origin code to append LL on MacOS and Windows: bdc14ad89a
2. For `torch/_inductor/codegen/cpp_utils.py`, append LL on MacOS and Windows forlarge constants. And fix its UTs: 3a56b76ce0

------------------------------
Another solution for https://github.com/pytorch/pytorch/pull/133615, use `int64_t` as index type for all plartform.

### Development notes:
The metioned PR( https://github.com/pytorch/pytorch/pull/133615) is fix the index type not match to parse_arg args types. As reviewed with @jansel , Jason think we need to unificate `INDEX_TYPE` for all platforms.
Current code is make code cumbersome:
```python
INDEX_TYPE = "int64_t" if _IS_WINDOWS else "long"
```

So, I have some attempts to unificate `INDEX_TYPE` as `long` or `int64_t`.
For use `long` as index type: https://github.com/pytorch/pytorch/pull/133768
For use `int64_t` as index type: https://github.com/pytorch/pytorch/pull/133782

Since that, we still discussed which type we will select as final solution.
![image](https://github.com/user-attachments/assets/b23fa577-2d40-4bd6-b934-fb7994fe0bb0)

`long` type is different define and size in different OSs and different compilers. So, @jansel make decision that, we need to select `int64_t` for all platforms. So, I would comtine my work based on https://github.com/pytorch/pytorch/pull/133782.

As https://github.com/pytorch/pytorch/pull/133782 still has two issues:
1. std::min/std::max could not match function instances by arg types. It as fixed and validated in PR: https://github.com/pytorch/pytorch/pull/133812
4. Cuda TestMemoryPlanning::test_cpp_wrapper issue by wrong index type. It is fixing in this PR.

So, we made final solution in this PR.

### Changes:
**1. Use `int64_t` type as index type for all OSs: `Windows`, `Linux` and `MacOS`.**
**2. Use static_cast<int64_t>(`constant`) to convert constant to `div_floor_integer` with args type(`int64_t`).**
**3. Update `parse_arg` function signature to `int64_t`, which follow the index type.**
**4. Append double L(`LL`) to constant on Windows and MacOS, because of their int64_t are are long long.**
**5. Fix `std::min/std::max` type miss match by static_cast to `INDEX_TYPE`.**
**6. Fix UTs, containts: cuda `TestMemoryPlanning::test_cpp_wrapper`, and `test_indexing.py`.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133892
Approved by: https://github.com/jansel
2024-08-20 16:54:12 +00:00
3caf3baabb [inductor] enable inductor backend for dynamo on Windows. (#133921)
Changes:
Enable inductor backend for dynamo on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133921
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-20 16:46:19 +00:00
cyy
c3d02fa390 [Reland2] Update NVTX to NVTX3 (#109843)
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code.  The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10, https://github.com/eqy

Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>
2024-08-20 16:33:26 +00:00
33f1ee036e [dynamo][user-defined] Simplify call_hasattr (#133935)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133935
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #133745, #133747, #133746, #133799, #133800
2024-08-20 16:27:44 +00:00
cyy
8d93fe510e Remove NestedTensorFactories.h (#133809)
Since it has no code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133809
Approved by: https://github.com/ezyang
2024-08-20 16:16:30 +00:00
187d55018a [BE] Fix MYPY issues (#133872)
Fix some mypy issues that have crept in to the trunk.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133872
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-08-20 16:12:04 +00:00
52dfe99dbf Skip test_custom_op_add_abi_compatible_cpu_with_stack_allocation internally (#133704)
Summary: This test is segfaulting internally. Skip for now so we can get the internal tests green.

Differential Revision: D61399618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133704
Approved by: https://github.com/desertfire
2024-08-20 16:01:39 +00:00
3a2f7192c3 Revert "return state dict without optimized module (#132626)"
This reverts commit e37eef8a7bd5915fa2961d688fd8b02df5cc5fd7.

Reverted https://github.com/pytorch/pytorch/pull/132626 on behalf of https://github.com/ZainRizvi due to Sorry but it seems like this PR broke trunk. distributed/checkpoint/test_state_dict.py::TestStateDict::test_fsdp2 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10458281674/job/28969008325) [HUD commit link](da69a28c6f) ([comment](https://github.com/pytorch/pytorch/pull/132626#issuecomment-2299190664))
2024-08-20 15:54:54 +00:00
f2b57d8831 Fix torch._C submodules population (#133919)
This fixes regression introduced by https://github.com/pytorch/pytorch/pull/132216 that on some Python runtimes failed with
```
>   from torch._C._dynamo.guards import GlobalStateGuard
E   ModuleNotFoundError: No module named 'torch._C._dynamo.guards'; 'torch._C._dynamo' is not a package

c:\users\malfet\git\pytorch\torch\_dynamo\convert_frame.py:28: ModuleNotFoundError
```

Simplify it by always registering submodules by its primary name and do not try to add submodules which are not part of the same namespace as parent. Otherwise module can be registered by alias, rather than by primary name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133919
Approved by: https://github.com/atalman, https://github.com/izaitsevfb, https://github.com/XuehaiPan, https://github.com/albanD, https://github.com/Skylion007
2024-08-20 15:38:32 +00:00
b02695d65f [export] training ir migration, fix export_rle_model (#133937)
Summary:
- exir.capture + to_edge is deprecated. We need to use the export + to_edge.
- Fix quantization pass to be compatible with the new export IR. In the quantization pass, some nodes might have side-effects, so they don't have users, but still are not removed by the DCE pass. We need to consider it.
- now export_rle_model works with the default `capture_pre_autograd_graph`, it should also work with the new training it

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model  -- -r export_rle_model
```

Differential Revision: D61485834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133937
Approved by: https://github.com/tugsbayasgalan
2024-08-20 15:35:25 +00:00
6590f4fb0e [CD] Enable python 3.13 for xpu nightly build (#133670)
Enable python 3.13 for XPU nightly build, it depends on https://github.com/pytorch/pytorch/pull/133454 land. Also update the xpu nightly wheel test env.

Works for https://github.com/pytorch/pytorch/issues/114850
Fixes #130543
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133670
Approved by: https://github.com/atalman, https://github.com/malfet
2024-08-20 15:05:20 +00:00
36376efd06 [2/N] Refactor FR script - add a loader module (#133929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133929
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #133927
2024-08-20 14:27:40 +00:00
2bd02e0c82 Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712)"
This reverts commit 641724ed1daad1e6fc2525cc6858d199e576d5cd.

Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))
2024-08-20 10:34:41 +00:00
91fd270535 Revert "[dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)"
This reverts commit 59ca56e56ca3e2f6dd80db57079725cf61f06810.

Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))
2024-08-20 10:34:41 +00:00
5109c5ef23 Revert "[dynamo] simplify implementation for functools.reduce (#133778)"
This reverts commit ff9be0eda99c59cdbcc269853168657de93043c7.

Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))
2024-08-20 10:34:41 +00:00
241df7e7f8 Add multi-cache autotune test (#133868)
Summary:
The existing tests didn't cover a case where we had multiple autotunes in a single graph.  Add a test to demonstrate that case.

Also added a test dependency on redis and removed the "fake redis" from the previous PR (#133579)

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133868
Approved by: https://github.com/oulgen
2024-08-20 10:26:45 +00:00
11af423eca [SymmetricMemory] make buffer_ptrs_dev, signal_pad_ptrs_dev, buffer_size, and signal_pad_size accessible in python (#133680)
These allows us to experiment with creative applications with triton.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133680
Approved by: https://github.com/Chillee
2024-08-20 10:15:35 +00:00
08b5e07e6c Revert "[dynamo] simplify implementation for builtins.sum (#133779)"
This reverts commit 1fdeb4e32918017ee3a712e0bba86e8482fa293b.

Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests ([comment](https://github.com/pytorch/pytorch/pull/133779#issuecomment-2298285206))
2024-08-20 08:33:29 +00:00
68570fca69 Revert "Add MaskedTensor support to *_like API (#128637)"
This reverts commit 8de56e29581fa2706d44f8c4b0827830c9351470.

Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/jeanschmidt due to Introduced API linting errors ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2298270307))
2024-08-20 08:26:28 +00:00
42097f0ec1 Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814)"
This reverts commit cf60fe53a83bafec0857d5b49c2054de6ba4cddc.

Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/jeanschmidt due to Broke 12k internal signals/jobs, @ezyang please help get those changes merged. More details check D61488368 ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2298210309))
2024-08-20 08:02:49 +00:00
25d5a815f7 [Dynamo] Guard on torch function mode global state (#133135)
Adds guards checking whether torch function mode is in the all disabled state.

There are three torch function enablement states:
* All torch function disabled (modes + subclasses)
* Torch function subclass disabled
* All enabled

We now have guards checking if the state is All enabled and if state is All disabled.
All of the above ternary states are assigned to a unique pair of these two flags.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133135
Approved by: https://github.com/anijain2305
ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134, #133136
2024-08-20 07:15:04 +00:00
48ee0984ac Add C API to return all torch function disablement status (#133136)
This PR adds a C function to check if all torch function is disabled.
Recall that there are three torch function enablement states:
* All disabled
* Torch Function Subclass disabled
* All enabled

The API before this change provides two functions:
* `_is_torch_function_enabled` - returns True iff the current TF state is All enabled
* `_is_torch_function_mode_enabled` - returns True iff the state is not All disabled and the torch function mode stack is non-empty.

The crux of why a new API is needed is the following: If dynamo enters a frame with the torch function mode stack empty, `_is_torch_function_enabled` == False, it is impossible to determine if after a new mode is pushed whether we should enter the mode or not. This is because we don't know if the enablement state is All disabled or only subclass disabled. Adding this API to check if All disabled is True allows us to disambiguate this case.

In the next PR, Dynamo InstructionTranslator will have clearer flags than the underlying C API:
* A flag to indicate if subclasses are disabled (ie All disabled or Subclass Disabled is the current state)
* A flag to indicate if modes are disabled (ie if All disabled is the current state)
* A symbolic stack which can be checked if any modes are present

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133136
Approved by: https://github.com/bdhirsh
ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134
2024-08-20 07:15:04 +00:00
d97ca968cd [Dynamo] Test intermediate tf mode construction (#133134)
Ensures that constructing a torch function mode in the middle of a function is supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133134
Approved by: https://github.com/williamwen42
ghstack dependencies: #133130, #133729, #133131, #133132, #133133
2024-08-20 07:14:56 +00:00
626acaeb16 [Dynamo] Support torch function stack len (#133133)
Adds support for `torch._C._len_torch_function_stack()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133133
Approved by: https://github.com/williamwen42
ghstack dependencies: #133130, #133729, #133131, #133132
2024-08-20 07:14:52 +00:00
d1fdf984c3 [Dynamo] Support push torch function mode stack (#133132)
This PR adds support `torch._C._push_on_torch_function_stack()` by updating `torch.py` to push onto the symbolic torch function mode stack when a push is encountered. The same side effects infra used in the previous PR is used to track the mutation of the torch function mode stack and add bytecode to update it if it is mutated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133132
Approved by: https://github.com/williamwen42
ghstack dependencies: #133130, #133729, #133131
2024-08-20 07:14:47 +00:00
c0b4aaa8c5 [Dynamo] Support pop torch function mode stack (#133131)
This PR adds support for tracing `torch._C._pop_torch_function_stack()` without graph breaking and in order to verify the state change also adds replay of mutations to the torch function mode stack via side_effects appending supplemental bytecode as we do for other python mutable objects.

Details:
To represent the torch function mode stack symbolically a deque field is added to the instruction translator. When the InstructionTranslator is initialized, all modes are read from the current torch function mode stack, and stashed in a global weak ref for later access (using existing sources) without needing to push/pop the python/cpp torch function mode stack.

During tracing, when `_pop_torch_function_stack` is encountered a value is popped from this deque and the variable tracker representing the mode is returned. To ensure the true torch function mode stack matches this state, `TorchFunctionModeStackVariable`, a singleton, is marked as mutated, this adds it to side effects, where during final codegen, side effects will codegen a call to a python helper which will update the python torch function mode stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133131
Approved by: https://github.com/jansel
ghstack dependencies: #133130, #133729
2024-08-20 07:14:42 +00:00
f147349568 Fix DeviceContext bug (#133729)
Fixes https://github.com/pytorch/pytorch/issues/133666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133729
Approved by: https://github.com/bdhirsh
ghstack dependencies: #133130
2024-08-20 07:14:37 +00:00
09e366cb57 [Dynamo] Add torch function mode stack guard to dynamo (#133130)
This PR adds a guard on the torch function mode stack state at the beginning of tracing. The way this is implemented is via a new leaf guard which is passed the initial stack state at construction and compares it to the stack state at the time the guard is run.

Details:
The stack state is extracted via popping all modes, appending them to a list, and pushing all modes back. This list is stored on the output graph and read during guard construction to pass to the stack mode guard. There the length and types of the modes are recorded. Next time the guard is run it compares this recorded state to the current mode stack state.

To implement this in python a helper function was added to utils.py and this is used if cpp guards are not enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133130
Approved by: https://github.com/anijain2305
2024-08-20 07:14:33 +00:00
7492da804f Mark disabled tests as fixed (#133940)
Fixes #132552, #133900, #133901, #133902, #133903, #133904, #133905, #133906, #133908, #133910, #133911, #133912, #133913, #133914, #133915, #133916, #133917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133940
Approved by: https://github.com/oulgen
2024-08-20 06:58:11 +00:00
e8d3c4be36 [dynamo][reland][inline-inbuilt-nn-modules] Mark attributes of nn mod… (#133714)
Relands https://github.com/pytorch/pytorch/pull/132539
Relands https://github.com/pytorch/pytorch/pull/132736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133714
Approved by: https://github.com/jansel
2024-08-20 05:57:52 +00:00
f08d484702 Add itertools.islice support in dynamo (#133893)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133893
Approved by: https://github.com/oulgen
2024-08-20 05:55:53 +00:00
b6891f4002 [1/N] Refactor fr trace script to make it modulized - config (#133927)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133927
Approved by: https://github.com/c-p-i-o
2024-08-20 05:47:17 +00:00
15addb00e6 Update test_control_flow.py to device-agnostic. (#133843)
Fixes #133841

This PR makes the `test_pointwise_associative_scan_CUDA_flip` also work on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133843
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/malfet, https://github.com/jansel, https://github.com/atalman
2024-08-20 05:05:43 +00:00
994fcb9acd Killswitch based rollout for flight recorder (#133237)
Summary: Defaulting TORCH_NCCL_DUMP_ON_TIMEOUT to "true" and adding a kilswitch in case we need to kill this feature in production.

Test Plan: Tests pass manually but need futher testing before this is rolled out fully everywhere.

Differential Revision: D61136320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133237
Approved by: https://github.com/c00w
2024-08-20 04:27:55 +00:00
32f57ac627 [BE] Fix lint issues in qlinear_prepack.cpp (#133797)
Summary: This diff fixed many lint issues in qlinear_prepack.cpp. I'am fixing them as I want to add more ops/funcs into this file later.

Test Plan: Sandcastle

Differential Revision: D61425436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133797
Approved by: https://github.com/Skylion007
2024-08-20 04:23:25 +00:00
b0bafd2be5 remove tensor weak ref from constraint target (#133890)
Summary: `_ConstraintTarget` is an internal data structure that has some redundancy: tensors are identified by their id but also carry a weak reference. The weak reference was probably useful a year back but everything is done with ids right now, and the lifetime of these tensors ensures that using their ids is OK.

Test Plan: existing tests

Differential Revision: D61488816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133890
Approved by: https://github.com/tugsbayasgalan
2024-08-20 03:03:05 +00:00
188cb5e67b Bump scikit-image to 0.22.0 (#133932)
Fixes: https://github.com/pytorch/pytorch/issues/133926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133932
Approved by: https://github.com/malfet
2024-08-20 02:37:16 +00:00
6c82a1c68c [AOTI] Introduce DeferredCudaKernelLine for cuda cpp wrapper (#129135)
Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. This is to prepare for one-pass AOTI codegen implementation.

Differential Revision: [D61018114](https://our.internmc.facebook.com/intern/diff/D61018114)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129135
Approved by: https://github.com/angelayi
2024-08-20 02:15:44 +00:00
cyy
c51fc7e98e Enable clang-tidy in aten/src/ATen/native/nested/ (#133829)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133829
Approved by: https://github.com/Skylion007
2024-08-20 01:52:15 +00:00
c6ea7b3f21 Update xpu CD used driver to rolling version (#133454)
The main purpose of this PR is change the XPU CD use rolling driver to support more clients GPU AOT build and enable Kineto. And also plan to enable python 3.13 for xpu CD.

Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133454
Approved by: https://github.com/atalman
2024-08-20 01:45:45 +00:00
c7af2728d3 Remove aten dispatch to empty in foreach_norm cuda kernel (#133897)
Saves significant time on aten dispatch. For 2k tensors, goes from 38ms to 58us.
Should shave some overhead mentioned in https://github.com/pytorch/pytorch/issues/133586

Before PR:
![image](https://github.com/user-attachments/assets/7813f059-0f7f-4d44-a9f0-1aaf94ae849f)

After:
![image](https://github.com/user-attachments/assets/ad0855b1-2743-432a-ad31-b574c620e2fd)

script:
```
import torch

# warm up caching allocator
a = torch.rand(200, 10, device="cuda")
b = torch.rand(200, 10, device="cuda")
c = a + b
del a, b, c

ts = [torch.rand(2, 3, device="cuda") for _ in range(2000)]

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    torch._foreach_norm(ts)

print(p.key_averages().table(sort_by="cpu_time_total"))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133897
Approved by: https://github.com/albanD, https://github.com/drisspg
2024-08-20 01:27:09 +00:00
874ae854eb [c10d] Land CudaEventCache with roll out flags (#133727)
@zdevito added a cache for CudaEvent in https://github.com/pytorch/pytorch/pull/122732. And we want to productionize it with a flag in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133727
Approved by: https://github.com/shuqiangzhang, https://github.com/eqy
2024-08-20 01:08:00 +00:00
cfcb9e388d [PT2][Optimus] Add move reshape out of split stack pass (#133710)
Summary: We observed a  new pattern in CMF where reshape nodes are in the middle of split stack patter, introducing massive triton_fused_stack_xxx kernels, leading to increased compilation time, we thus move it outside of the pattern, and elimate such split stack nodes.

Test Plan:
# unit test
```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Buck UI: https://www.internalfb.com/buck2/2fb51ae7-832e-436b-b6b7-a81599390182
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173811074971
Network: Up: 10MiB  Down: 5.4GiB  (reSessionID-96a20105-fdc6-4b4f-b465-813a84a71eba)
Jobs completed: 304618. Time elapsed: 25:24.7s.
Cache hits: 99%. Commands: 120772 (cached: 120410, remote: 357, local: 5)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213
```
P1529578588
graph diffing: https://www.internalfb.com/intern/diffing/?paste_number=1529577762

Counter({'pattern_matcher_nodes': 2123, 'pattern_matcher_count': 1715, 'normalization_pass': 404, 'remove_split_with_size_one_pass': 269, 'extern_calls': 193, 'merge_splits_pass': 74, 'normalization_aten_pass': 47, 'fxgraph_cache_miss': 9, 'batch_aten_mul': 6, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 4, 'scmerge_cat_removed': 4, 'unbind_stack_pass': 4, 'batch_sigmoid': 2, 'batch_linear': 2, 'move_reshape_out_of_split_stack_pass': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'split_stack_to_cats_pass': 1, 'split_cat_to_slices_pass': 1, 'batch_aten_add': 1, 'batch_relu': 1})

Trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Ftest%2Fcmf_shrink.Aug_15_10_55_41_trace.json.gz&bucket=pyper_traces

The triton_fused_stack_xxx has been reduced significantly, we can see from the trace that the green part becomes smaller
{F1806406290}

# e2e
ads_dper3:68464f2dc5e849ba2670482079cecaaa
training_platform:8643db0c3453f2658aa7be7d73974ea0

baseline:
f588719502

proposal:
f592116164

Differential Revision: D61249205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133710
Approved by: https://github.com/jackiexu1992
2024-08-20 00:50:07 +00:00
6f738d6434 Remove early exit in constant_pad_nd for export (#132679)
Summary:
Remove the early exit for padding when padding = [0, 0, 0, 0].

This prevents export from specializing when all padding=0, allowing export when all padding >= 0. Specialization will still happen for negative padding.

This change will be used to export image preprocess for multimodal models, where images of dynamic shape are padded. As images are of dynamic shape, we can't be sure if padding will be required or not. Padding is guaranteed to be non-negative.

Preprocess code: https://github.com/pytorch/torchtune/pull/1242

Note: the alternative is to wrap padding in a custom op, which isn't ideal given the custom op will contain the same impl as constant_pad_nd.

Test Plan: ci

Differential Revision: D60687727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132679
Approved by: https://github.com/ezyang
2024-08-20 00:07:41 +00:00
9a998d98f1 Fix edge case in inductor triton clean script (#130837)
The regex in the script is too restrictive, as it excludes examples with parentheses in args, like the following:
```
triton_poi_fused_add_0.run(arg0_1.item(), arg1_1.item(), buf0, 1, grid=grid(1), stream=streamNone)
                                       ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130837
Approved by: https://github.com/Chillee
2024-08-19 23:46:11 +00:00
65b3e42074 Warn on fx graph cache bypass and log it to tlparse (#133826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133826
Approved by: https://github.com/aorenste
2024-08-19 23:39:55 +00:00
2ec95ffe57 [cond] support unbacked symbool inputs (#133589)
Fixes https://github.com/pytorch/pytorch/issues/133577.

In dynamo, when received an unbacked symbool input, we create an unbacked symint to replace it.

The alternative approach of `not realizing the pred LazyVariable in cond` doesn't work because we need to get the proxy of the symbool input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133589
Approved by: https://github.com/ezyang
2024-08-19 23:36:48 +00:00
3f525c9d5d Upgrade nightly wheels to rocm6.2 - 2 of 2 (binaries) (#133238)
Depends on https://github.com/pytorch/pytorch/pull/132875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133238
Approved by: https://github.com/atalman
2024-08-19 22:35:33 +00:00
2b95007d12 [dynamo] support random.Random (#133725)
Fixes the observed graph breaks in https://github.com/pytorch/pytorch/issues/121349 and https://github.com/pytorch/pytorch/issues/121350.

But there are still graph breaks since a random output is being used as a seed, e.g.
```python
import random
import torch

def fn(x):
    seed = random.randint(0, 100)
    rand = random.Random(seed)
    return x + rand.randrange(10)

opt_fn = torch.compile(fn, backend="eager", fullgraph=True)
opt_fn(torch.ones(1))
```

fails with
```
torch._dynamo.exc.InternalTorchDynamoError: UnspecializedPythonVariable() is not a constant
```

when tracing the line
```
rand = random.Random(seed)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133725
Approved by: https://github.com/jansel
2024-08-19 22:34:44 +00:00
06faa15194 [pytorch][counters] add pytorch.wait_counter.fx_codgen_and_compile (#133107)
as titled

Differential Revision: [D60876629](https://our.internmc.facebook.com/intern/diff/D60876629/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133107
Approved by: https://github.com/asiab4
2024-08-19 22:29:16 +00:00
afb3e5ed6a Add onnx and onnxscript to CI requirements (#133647)
Add onnx and onnxscript to requirements-ci.txt to allow for `test_public_bindings` and mypy to function when checking `torch.onnx._internal` code as @malfet suggested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133647
Approved by: https://github.com/titaiwangms, https://github.com/kit1980
2024-08-19 22:15:07 +00:00
1fdeb4e329 [dynamo] simplify implementation for builtins.sum (#133779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779
Approved by: https://github.com/jansel
ghstack dependencies: #133712, #133769, #133778
2024-08-19 22:14:34 +00:00
ff9be0eda9 [dynamo] simplify implementation for functools.reduce (#133778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778
Approved by: https://github.com/jansel
ghstack dependencies: #133712, #133769
2024-08-19 22:14:33 +00:00
59ca56e56c [dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769
Approved by: https://github.com/jansel
ghstack dependencies: #133712
2024-08-19 22:14:33 +00:00
641724ed1d [RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712)
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.

5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)

Example:

```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...

>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
...     for i, item in enumerate(sequence):
...         if item is x or item == x:
...             return i
...     raise ValueError("sequence.index(x): x not in sequence")

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
2024-08-19 22:14:33 +00:00
8de56e2958 Add MaskedTensor support to *_like API (#128637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637
Approved by: https://github.com/cpuhrsch
2024-08-19 22:13:59 +00:00
14ddd932fd Add MaskedTensor support to _is_any_true (#128574)
Fixes #128557

If there is a better way to detect autograd anomalies consistently, feel free to share your ideas. This is a dirty check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128574
Approved by: https://github.com/cpuhrsch
2024-08-19 21:34:31 +00:00
432638f521 Remove useless environment in reusable workflow (#133659)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133659
Approved by: https://github.com/Skylion007
2024-08-19 20:44:17 +00:00
d131048056 Change install_triton to do git checkout, apply patch, pip install (#133878)
Fixes Docker builds: https://github.com/pytorch/pytorch/actions/runs/10458684809/job/28961048777

Follow up after https://github.com/pytorch/pytorch/pull/133694 to apply same patch to Docker build.

Change Rather then doing:
```
pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
```

We do using 4 step: git clone, git checkout, apply patch, pip install
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133878
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2024-08-19 20:42:50 +00:00
66d6d8b1b9 Support TORCH_COMPILER_COLLECTIVES envvar (#133696)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133696
Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o
2024-08-19 20:13:04 +00:00
0d4eacb9d2 [fake tensor] unbacked symint support for binary op fast path (#133584)
Addreses https://github.com/pytorch/pytorch/issues/133525

We have an unbacked symint in `final_shape` and it's a tuple... So, add `guard_size_oblivious` to do size oblivious checks + `sym_eq` for list equality.

```
op.shape
> torch.Size([1])
final_shape
> (u0 + 1,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133584
Approved by: https://github.com/ezyang
2024-08-19 20:03:05 +00:00
565e2ea019 Scale XBLOCK in triton for pointwise (#133300)
Adjust https://github.com/pytorch/pytorch/pull/128826 for also `triton_heuristics.pointwise`.

An example we encountered during training qwen-7b with rocm 6.1:

Note: this kernel also hit the limit of `TRITON_MAX_BLOCK['X']`, shall we increase it from 2048 to 4096?

```

import torch

aten = torch.ops.aten
inductor_ops = torch.ops.inductor
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
alloc_from_pool = torch.ops.inductor._alloc_from_pool

import triton
import triton.language as tl
from triton.compiler.compiler import AttrsDescriptor

from torch._inductor.runtime import triton_heuristics
from torch._inductor.runtime.hints import DeviceProperties

@triton_heuristics.pointwise(
    size_hints=[8589934592],
    filename=__file__,
    triton_meta={'signature': {0: '*bf16'}, 'device': DeviceProperties(type='hip', index=2, cc='gfx942', major=None, regs_per_multiprocessor=None, max_threads_per_multi_processor=None, multi_processor_count=None), 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]},
    inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_nll_loss_backward_0', 'mutated_arg_names': [], 'no_x_dim': False, 'num_load': 0, 'num_reduction': 0, 'backend_hash': None, 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': False, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False, 'is_hip': True},
    min_elem_per_thread=0
)
@triton.jit
def triton_(out_ptr0, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0).to(tl.int64) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:].to(tl.int64)
    x0 = xindex
    tmp0 = 0.0
    tl.store(out_ptr0 + (x0), tmp0, None)

import triton
import triton.language as tl
from torch._inductor.runtime.triton_heuristics import grid
from torch._C import _cuda_getCurrentRawStream as get_raw_stream

if __name__ == "__main__":
    with torch.cuda._DeviceGuard(2):
        torch.cuda.set_device(2)
        buf0 = empty_strided_cuda((32752, 151936), (151936, 1), torch.bfloat16)
        stream2 = get_raw_stream(2)
        triton_.run(buf0, grid=grid(4976207872), stream=stream2)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133300
Approved by: https://github.com/jansel
2024-08-19 19:41:55 +00:00
fb26b84390 Update fused kernels and call _safe_softmax from SDPA (#133882)
# UPDATE:
This is  take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty

# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882
Approved by: https://github.com/soulitzer
2024-08-19 18:53:11 +00:00
f1dc3b108a Back out "[export] fix test for training ir migration" (#133697)
Summary:
Original commit changeset: 0a1cb57e0338

Original Phabricator Diff: D61223356

Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model -- -r  test_export_rle_model

Reviewed By: tugsbayasgalan

Differential Revision: D61395818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133697
Approved by: https://github.com/tugsbayasgalan
2024-08-19 18:30:42 +00:00
a8619c9a1d Add nitpicker, which allows adding comments to PRs when they match a file pattern (#133861)
This message would have helped avoid https://www.internalfb.com/sevmanager/view/440895

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133861
Approved by: https://github.com/albanD, https://github.com/izaitsevfb
2024-08-19 18:29:59 +00:00
64d9afd8a7 Register nll_loss2d decompositions for core aten (#133534)
When exporting a training model for Executorch (which requires all ops to be core aten) with cross entropy loss (`torch.nn.CrossEntropyLoss`), we ran into the following error from the fx verifier in `to_edge`:

```
torch._export.verifier.SpecViolationError: Operator torch._ops.aten.nll_loss2d_forward.default is not Aten Canonical.
```
The aten [implementation](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LossNLL.cpp#L624) of `torch.nn.CrossEntropyLoss` uses `nll_loss2d_forward` for inference and `nll_loss2d_backward` for training, so we need to add the decompositions for both (which already exist) to the list of core aten decompositions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133534
Approved by: https://github.com/JacobSzwejbka
2024-08-19 18:26:48 +00:00
ad7dda7b32 [CI] Bump up TIMM pin (#133528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133528
Approved by: https://github.com/angelayi
2024-08-19 18:13:57 +00:00
773a782249 Decompose _unsafe_index_put into index_put (#133365)
## Description
Create decomposition of _unsafe_index_put (non-core aten) that turns it into index_put (core aten)

## Testing
Phi3 mini + LoRA model successfully passed `to_edge` after failing due to a non-core aten `unsafe_index_put` getting introduced in a decomposition during joint graph calculations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133365
Approved by: https://github.com/pianpwk
2024-08-19 18:07:23 +00:00
517aee5369 [torchscript] Add a sampled logging integration point. (#133484)
Test Plan:
test script:
```
    def test_zhxchen17(self):
        from libfb.py.pyinit import initFacebook

        initFacebook()

        class M(torch.nn.Module):
            def forward(self, x):
                return torch.add(x, x)

        def tmptmp(x, y):
            return torch.mul(x, y)

        m = M()
        n = torch.jit.script(m)
        print(n(torch.tensor(1)))
        print(torch.jit.script(tmptmp)(torch.tensor(1), torch.tensor(2)))
```

```
I0802 12:01:23.932929 4079081 init.cc:407] Logging to scuba: run __torch__.caffe2.test.export.test_export.M.forward sample rate: 1000000
```

Differential Revision: D60920867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133484
Approved by: https://github.com/davidberard98
2024-08-19 18:04:45 +00:00
6564e746ed [PT2] Port remove_noop to PT2 pre_grad passes (#132183)
Summary: migrate to aten IR, `reshape` -> `view.default`, not covering `flatten` as there are already optimazation done in PT2, see the example here P1506057533

Differential Revision: D60476525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132183
Approved by: https://github.com/frank-wei
2024-08-19 17:46:51 +00:00
da69a28c6f [pipelining] Add schedule runtime for lowered schedule (#130488)
Creates a new runtime that shifts complexity from runtime to
ahead-of-time.

The existing runtime (PipelineScheduleMulti) accepts a
compute-only schedule (forward, backward, weight) actions only are
specified, and it infers the communication operations at runtime.
Compared to that runtime, PipelineScheduleRuntime has less logic that
happens at runtime and relies on lowering passes to transform the
compute-only schedule to add communications.

Advantages include
- easier to verify the correctness by dumping a compute+comm schedule
- posible to manually edit the compute+comm schedule if the lowering
  heuristics are insufficient

Functionality included inside the PipelineScheduleRuntime is limited to
- accepting a compute-only schedule and lowering it to add comms
- executing the compute or comm operations specified by the given
  schedule
- handling work.wait() automatically by calling it just before the
  matching compute operation (for RECV ops) or at the end of step (for
  SEND ops)

Follow ups for later PRs
- Some refactoring should be done to replace PipelineScheduleMulti with
  this runtime
- Optimizer execution is not considered (e.g. for zero-bubble cases)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488
Approved by: https://github.com/H-Huang
2024-08-19 17:44:24 +00:00
f31404ba6f Revert "Update xpu CD used driver to rolling version (#133454)"
This reverts commit 32ed4a3beb746c94c702c80c79c812e45ab3b2f4.

Reverted https://github.com/pytorch/pytorch/pull/133454 on behalf of https://github.com/ZainRizvi due to Sorry, there's [an outage](https://github.com/triton-lang/triton/issues/4527) that's preventing triton from being installed correctly, which has the side effect of breaking our docker builds. Reverting this PR since it requires a docker rebuild (which now fails) to give us more time to properly fix the docker builds. ([comment](https://github.com/pytorch/pytorch/pull/133454#issuecomment-2297073937))
2024-08-19 17:28:50 +00:00
6ca68357b3 [dynamo] Save class vt in UserDefinedObjectVariable (#133800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133800
Approved by: https://github.com/jansel
ghstack dependencies: #133745, #133747, #133746, #133799
2024-08-19 17:21:48 +00:00
08f14d5492 [refactor][dynamo][side-effects] Helper function for __new__ for user defined class (#133799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133799
Approved by: https://github.com/jansel
ghstack dependencies: #133745, #133747, #133746
2024-08-19 17:21:48 +00:00
d6f30b91e5 Add a smaller default config option for decode (#133646)
## Before
A100
| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)     |
|---------|-----------|-------------|------------|----------------|---------------------------|
| Average |     0.461 |             |            |                |                           |
| Max     |     0.996 | None        | causal     | torch.bfloat16 | (16, 16, 1, 16, 1024, 64) |
| Min     |     0.188 | None        | causal     | torch.bfloat16 | (2, 16, 1, 16, 512, 128)  |

H100
| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)     |
|---------|-----------|-------------|------------|----------------|---------------------------|
| Average |     4.528 |             |            |                |                           |
| Max     |    16.710 | None        | offset     | torch.bfloat16 | (2, 16, 1, 2, 4096, 64)   |
| Min     |     1.612 | None        | offset     | torch.bfloat16 | (16, 16, 1, 16, 512, 128) |

## After

A100:
| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)     |
|---------|-----------|-------------|------------|----------------|---------------------------|
| Average |     0.472 |             |            |                |                           |
| Max     |     1.110 | None        | causal     | torch.bfloat16 | (16, 16, 1, 16, 1024, 64) |
| Min     |     0.182 | None        | causal     | torch.bfloat16 | (2, 16, 1, 16, 4096, 128) |

H100:
| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)     |
|---------|-----------|-------------|------------|----------------|---------------------------|
| Average |     4.535 |             |            |                |                           |
| Max     |    16.691 | None        | offset     | torch.bfloat16 | (2, 16, 1, 2, 4096, 64)   |
| Min     |     1.607 | None        | offset     | torch.bfloat16 | (16, 16, 1, 16, 512, 128) |

### Failing example code

``` Python
import torch
import torch.nn as nn
import functools
from torch.nn.attention.flex_attention import flex_attention, create_block_mask

class AttentionModel(nn.Module):
    def __init__(self, initial_kv_len):
        super().__init__()
        self.kv_len = initial_kv_len
        self.q_len = 1

    def causal_mask_decode(self, b, h, q_idx, kv_idx):
        offset = self.kv_len - self.q_len
        return offset + q_idx >= kv_idx

    def forward(self, queries, keys, values, mask):
        self.kv_len = keys.shape[-2]
        bs, nh, seq_len, _ = queries.shape

        attention = functools.partial(flex_attention, block_mask=mask, enable_gqa=True)
        attention = torch.compile(attention)
        attn_output = attention(queries, keys, values)

        return attn_output

# Driver code
def main():
    # Set up parameters
    d_model = 256
    q_heads = 32
    kv_heads = 8
    kv_len = 128
    q_len = 1
    batch_size = 1

    # Initialize the model
    model = AttentionModel(kv_len)
    mask = create_block_mask(
        lambda a, b, c, d: model.causal_mask_decode(a, b, c, d), 1, 1, q_len, kv_len
    )

    # Create sample input tensors
    queries = torch.randn(batch_size, q_heads, q_len, d_model, device="cuda")
    keys = torch.randn(batch_size, kv_heads, kv_len, d_model, device="cuda")
    values = torch.randn(batch_size, kv_heads, kv_len, d_model, device="cuda")

    # Forward pass
    output = model(queries, keys, values, mask)

    print(f"Input shapes:")
    print(f"  Queries: {queries.shape}")
    print(f"  Keys: {keys.shape}")
    print(f"  Values: {values.shape}")
    print(f"Output shape: {output.shape}")

if __name__ == "__main__":
    main()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133646
Approved by: https://github.com/Chillee, https://github.com/joydddd
2024-08-19 17:13:26 +00:00
e37eef8a7b return state dict without optimized module (#132626)
Fixes #123625

We should consider changing the current behaviour and make it similar to 1fb498d6e3/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py (L69-L101)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132626
Approved by: https://github.com/williamwen42
2024-08-19 16:58:41 +00:00
8d404581fc Revert "[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530)"
This reverts commit 5fab35d77c7d1db7dbb9d5c516254a510b4f4f64.

Reverted https://github.com/pytorch/pytorch/pull/132530 on behalf of https://github.com/ZainRizvi due to Sorry but it seems like Dr. CI incorrectly flagged the [pull / linux-docs / build-docs-python-false](https://hud.pytorch.org/pr/pytorch/pytorch/132530#28918577682) failure as being flaky. The job started failing consistently on CI once your PR was merged. [GH job link](https://github.com/pytorch/pytorch/actions/runs/10454830880/job/28949386844) [HUD commit link](5fab35d77c) ([comment](https://github.com/pytorch/pytorch/pull/132530#issuecomment-2297001423))
2024-08-19 16:47:15 +00:00
68fcd54226 Lower cache mocking to test more pytorch code (#133579)
Summary: Previously we were mocking out FbRemoteFxGraphCacheBackend which meant that we were missing testing a whole bunch of the cache code. Cache at a lower level (CacheClient, LocalAutotuneCacheBackend, ManifoldClient, Redis) so we cover a larger amount of the caching code.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D60937966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133579
Approved by: https://github.com/oulgen
2024-08-19 16:32:36 +00:00
32ed4a3beb Update xpu CD used driver to rolling version (#133454)
The main purpose of this PR is change the XPU CD use rolling driver to support more clients GPU AOT build and enable Kineto. And also plan to enable python 3.13 for xpu CD.

Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133454
Approved by: https://github.com/atalman
2024-08-19 16:01:47 +00:00
df6831562c [Flight Recorder] Add more basic analysis to the script (#133412)
This is the first step to make sure we have a basic function of analyzer for FR in production.

- We want to use this script to find out abnormalities in collectives and report it to users.
- We also fixed some type errors.

- [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412
Approved by: https://github.com/c-p-i-o, https://github.com/atalman
2024-08-19 15:55:00 +00:00
76b0284744 Revert "[inductor][cpp] complete vectorization for int32/int64 (#122961)"
This reverts commit 99b3b58f39507bb8ad5b4bb1b9bedf7f47b64fa3.

Reverted https://github.com/pytorch/pytorch/pull/122961 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](a0ef8888e6) ([comment](https://github.com/pytorch/pytorch/pull/122961#issuecomment-2296852418))
2024-08-19 15:29:15 +00:00
318d3b39c4 Revert "[Inductor][CPP] Support vectorization of load_seed and randn (#130317)"
This reverts commit a0ef8888e60d934ae7e4ddaec1c1274b12d0d39d.

Reverted https://github.com/pytorch/pytorch/pull/130317 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](a0ef8888e6) ([comment](https://github.com/pytorch/pytorch/pull/130317#issuecomment-2296819045))
2024-08-19 15:13:39 +00:00
5153550e4b [CI] Add FP32 dynamic, AMP static, AMP dynamic for AOT inductor accuracy CPU CI test (#132836)
This PR added 3 more accuracy test for AOT inductor CPU side.
1. FP32 dynamic shape accuracy test, torchbench suite
2. AMP static shape accuracy test, torchbench suite
3. AMP dynamic shape accuracy test, torchbench suite

**Test Time cost:**
| Precision 	| Shape Type 	| Suite      	| Time cost 	|
|-----------	|------------	|------------	|-----------	|
| FP32      	|    dynamic 	| Torchbench 	|  1h40m         	|
| AMP       	|     Static 	| Torchbench 	|  1h38m        	|
| AMP       	|    dynamic 	| Torchbench 	|  1h48m        	|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132836
Approved by: https://github.com/desertfire
2024-08-19 14:26:48 +00:00
5fab35d77c [ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530)
1/n PR to

- Move code from torch-onnx from commit 395495e566 into torch.onnx and fixes imports.
- Integrate the new export logic with the torch.onnx.export API and include basic set of tests.
- Refactor the API for the change.
- Improve documentation.

Next PRs will be more tests and docs.

Fix https://github.com/pytorch/pytorch/issues/129277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132530
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2024-08-19 14:01:07 +00:00
92151c814b [ROCm] Set _HAS_PYNVML to false if amdsmi not installed (#132990)
This is a bugfix that was recently encountered in ROCm/Deepspeed. Currently if a library installs pynvml and runs on ROCm pytorch will break as _HAS_PYNVML is set to true and it will attempt to use amdsmi library for the device_count call which will not be installed.

This fix will set _HAS_PYNVML to false on ROCm if amdsmi is not installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132990
Approved by: https://github.com/pruthvistony, https://github.com/eqy, https://github.com/malfet
2024-08-19 09:45:58 +00:00
0a976b8899 Enable bf16 float32 mkldnn matmul when float32 precision is 'medium' (#130919)
This fixes an issue on AArch64 cpus supporting BF16, caused when torch.set_float32_matmul_precision("highest") does not disable the bf16 downconversion in mkldnn_matmul.

This was discovered from a unit test failure where the decorator `torch.testing._internal.common_mkldnn.bf32_on_and_off`, which internally switches the float32_matmul_precision between "medium" and "highest" was not having the desired effect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130919
Approved by: https://github.com/jgong5
2024-08-19 09:18:12 +00:00
8b6b1721c8 remove StrobelightCompileTimeProfiler.profile_compile_time from stacktrace when strobelight profiling not enabled (#133831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133831
Approved by: https://github.com/oulgen
2024-08-19 09:14:52 +00:00
4bae7ae3d9 [DeviceMesh][Easy] Fix typo (#133790)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133790
Approved by: https://github.com/Skylion007
2024-08-19 05:20:22 +00:00
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464d17fcf4c1fc67c29868fa30d0c16e1.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
42e61c783c [Inductor][CPP] Align Half load with BFloat16 load (#132011)
Remove `static_cast<float>` for Half load to align with BFloat16.
Before:
```
extern "C"  void kernel(const half* in_ptr0,
                       half* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = static_cast<float>(in_ptr0[static_cast<long>(x0)]);
            out_ptr0[static_cast<long>(x0)] = tmp0;
        }
    }
}
```

After:
```
extern "C"  void kernel(const half* in_ptr0,
                       half* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<long>(x0)];
            out_ptr0[static_cast<long>(x0)] = tmp0;
        }
    }
}

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132011
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-08-19 04:52:39 +00:00
ae00063570 Change default runner's AMI to Amazon 2023 AMI - Part 1 (#133641)
Upgrades the LF scale configs to change the default AMI in accordance with the Amazon 2023 rollout plan.

This PR will be merged on Monday Aug 19 in the morning, and over the next 2-3 days as new linux runners are spun up (and old ones spun down) they'll start using this new AMI

This PR will be paired with https://github.com/pytorch/test-infra/pull/5558, which will be merged after this one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133641
Approved by: https://github.com/jeanschmidt
2024-08-19 01:32:25 +00:00
e72e924eb5 Add correct typing annotations to rsample() for all distributions (#133516)
Fixes #133514
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133516
Approved by: https://github.com/Skylion007
2024-08-18 20:31:54 +00:00
eqy
c0c82a5f6a [CUDA][SDPA] Bump tolerances for test_mem_efficient_attention_attn_mask_vs (#133738)
Same thing as #133051 but for efficient attention

CC @drisspg @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133738
Approved by: https://github.com/drisspg, https://github.com/nWEIdia, https://github.com/Skylion007
2024-08-18 19:14:29 +00:00
cf60fe53a8 [BE]: Update Typeguard to TypeIs for better type inference (#133814)
Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814
Approved by: https://github.com/ezyang
2024-08-18 19:10:16 +00:00
cyy
0d4cedaa47 [13/N] Fix clang-tidy warnings in aten/src/ATen (#133807)
Follows #133425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133807
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-08-18 17:54:12 +00:00
cyy
47ed5f57b0 [12/N] Fix clang-tidy warnings in aten/src/ATen (#133425)
Follows  #133758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133425
Approved by: https://github.com/ezyang
2024-08-18 11:03:55 +00:00
fbd020fce6 Add new prop to _XpuDevicePropertie for triton gemm optimization (#131738)
# Motivation
This PR aims to add new properties to `_XpuDevicePropertie` for triton gemm optimization.

# Additional Context
`ext_oneapi_supports_cl_extension` is not a ABI-neutral API. It depends on compiler 2025.0. For more details, see https://github.com/intel/llvm/pull/13212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131738
Approved by: https://github.com/gujinghui
2024-08-18 08:32:30 +00:00
fed6096e73 [dynamo] Support object.__new__ call (#133746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133746
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #133745, #133747
2024-08-18 07:18:52 +00:00
d56a395971 [dynamo] Support os.fspath (#133747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133747
Approved by: https://github.com/yanboliang, https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #133745
2024-08-18 07:18:52 +00:00
27dfd63ee8 remove unnecessary slicing in EffectTokensWrapper (#133737)
In the cases that `outs ` is a tensor, `[0:]` will cause a nadditional slicing ops that's unnecessary and failed some of XLA's unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133737
Approved by: https://github.com/IvanKobzarev
2024-08-18 05:52:48 +00:00
d717df2071 [compiled autograd] fix flaky tests due to torch.cuda.memory_allocated() != 0 (#133733)
FIXES https://github.com/pytorch/pytorch/issues/123949 https://github.com/pytorch/pytorch/issues/124376
torch.cuda.memory_allocated returns the amount of memory allocated in the current process, so if it isn't 0 it means another test didn't properly clean up after itself. I'm keeping the memory check and isolating these tests in subprocess as we don't have a good way to test for activation refcount

e.g. https://github.com/pytorch/pytorch/runs/28838386083
```
_______________ TestCompiledAutograd.test_free_activation_memory _______________
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/inductor/test_compiled_autograd.py", line 1892, in test_free_activation_memory
    self.assertTrue(torch.cuda.memory_allocated() == 0)
  File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133733
Approved by: https://github.com/jansel
2024-08-18 05:43:35 +00:00
cyy
fb9d2dc641 Remove Wno-invalid-partial-specialization from CMake (#133398)
The code base is clean enough that Winvalid-partial-specialization can be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133398
Approved by: https://github.com/ezyang
2024-08-18 04:06:21 +00:00
cyy
f8cf1829b5 [Reland] [11/N] Fix clang-tidy warnings in aten/src/ATen (#133758)
Reland of #133298. Remove possible changes that may increase the build time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133758
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-08-17 23:09:44 +00:00
0bde3c4f2f Run cudagraphs on AOTAutograd cache hit (#132294)
This threads through all of the necessary parts into aot autograd from the FXGraphCache changes so that we can run cudagraphs properly on a AOTAutograd cache hit.

Specifics:
- AOTAutograd needs access to the `cudagraphs` boxedbool in order to properly set the backward to not use cudagraphs on a cache hit from the forward.
- We have lots of tests that test this already from the previous PR, so I just added an extra test and made the previous test work with both AOTAutogradCache and FXGraphCache at the same time.

```
TORCH_LOGS=torch._functorch._aot_autograd.autograd_cache,cudagraphs ENABLE_AOT_AUTOGRAD_CACHE=1 TORCHINDUCTOR_FX_GRAPH_CACHE=1 tlp python benchmarks/gpt_fast/benchmark.py --output ~/gpt_fast_benchmark.csv
```
Twice, once on cache miss and once and cache hit.

Here is the perfetto trace for each(FB only link):

**Cache Miss:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.66 seconds
I0813 10:53:34.416000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [0/0] AOTAutograd cache miss for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:53:51.395000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [0/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey/entry
I0813 10:54:17.579000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [1/0] AOTAutograd cache miss for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:54:38.636000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [1/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt/entry
I0813 10:54:39.228000 911030 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:54:39.939000 911030 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:10.615000 911030 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 101.24 seconds
Average tokens/sec: 147.96 tokens/sec
Average bandwidth achieved: 1955.22 GB/s
Memory used: 14.51 GB
```

Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

![image](https://github.com/user-attachments/assets/47fdd77e-3cc1-437e-8e68-7901646269bb)

**Cache Hit:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.67 seconds
I0813 10:55:51.821000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [0/0] AOTAutograd cache hit for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:55:55.465000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [1/0] AOTAutograd cache hit for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:55:56.030000 944420 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:55:56.192000 944420 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:56.426000 944420 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 9.40 seconds
Average tokens/sec: 147.94 tokens/sec
Average bandwidth achieved: 1954.98 GB/s
Memory used: 14.51 GB
```
Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json&local_cache_key

![image](https://github.com/user-attachments/assets/9bdd14ec-d12a-4c89-8705-135c999ac746)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132294
Approved by: https://github.com/eellison
2024-08-17 21:24:54 +00:00
d6368985af [BE]: Fix setuptools not installed with Python 3.12 (#133561)
setuptools is not installed correctly for Python 3.12.
See https://github.com/python-poetry/poetry/issues/9630#issuecomment-2291114885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133561
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-08-17 17:42:04 +00:00
b4a1673a67 profiler/unwind: include <dlfcn.h> for dladdr (#133582)
This fixes a compilation error on linux systems using the musl c library.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133582
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-08-17 16:15:18 +00:00
215b14530a Add Half for sparse.mm reduce (#133672)
This PR is to add Half support for sparse.mm reduce in CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133672
Approved by: https://github.com/Skylion007
2024-08-17 15:20:39 +00:00
1c6fbae579 [Easy][dynamo] fix builtin function names for itertools (#133711)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133711
Approved by: https://github.com/Skylion007
2024-08-17 15:12:01 +00:00
a0ef8888e6 [Inductor][CPP] Support vectorization of load_seed and randn (#130317)
**Summary**
Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130317
Approved by: https://github.com/jgong5
ghstack dependencies: #122961
2024-08-17 07:15:57 +00:00
99b3b58f39 [inductor][cpp] complete vectorization for int32/int64 (#122961)
**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node:

- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961
Approved by: https://github.com/jansel

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
2024-08-17 07:07:49 +00:00
d5f6d68d68 [PT2] Resolve PT2 compatility issue in slice and diff (#133740)
Summary:
# context
* when running an IG FM training with PT2 we found there are a few graph break due to torch.diff call in [jagged_tensor.py](https://fburl.com/code/cwssxabc)
```
_length: List[int] = (
    _length_per_key_from_stride_per_key(torch.diff(offsets), stride_per_key)
    if variable_stride_per_key
    else torch.sum(torch.diff(offsets).view(-1, stride), dim=1).tolist()
)
```
* look into the failure, we found the TORCH_CHECK in diff should be TORCH_SYM_CHECK
* slice_forward error: df3d7729e, [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpxXZ2em/index.html)
```
RestartAnalysis
Tried to use data-dependent value in the subsequent computation. This can happen when we encounter unbounded dynamic value that is unknown during tracing time.  You will need to explicitly give hint to the compiler. Please take a look at torch._check OR torch._check_is_size APIs.  Could not guard on data-dependent expression ((5*u37 + u38)//(u37 + u38)) < 0 (unhinted: ((5*u37 + u38)//(u37 + u38)) < 0).  (Size-like symbols: u38, u37)

ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.

Potential framework code culprit (scroll up for full backtrace):
  File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/e99934938a0abe90/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 771, in slice_forward
    if end_val < 0:
```
* after this diff: [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpAhv2Sh/failures_and_restarts.html)

Test Plan:
# command
* run model
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2
```
* generate tlparse
```
tlparse `ls -t /var/tmp/tt/* | head -1`
```

Reviewed By: ezyang

Differential Revision: D56339251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133740
Approved by: https://github.com/ezyang
2024-08-17 06:07:21 +00:00
cd89bf77c8 [inductor][cpp][gemm] easy: adjust indentation of template, var renaming etc. (#133312)
Indent the template instructions separately from the generated code, for readability. Also, renaming M0,N0,K0 to Mr,Nr,Kr ("r" meaning "register") to consistent naming.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133312
Approved by: https://github.com/Skylion007, https://github.com/leslie-fang-intel
ghstack dependencies: #132729, #132730
2024-08-17 05:49:14 +00:00
4dc9795ebf [refactor][easy] Directly call var_getattr method for PythonModuleVariable (#133745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133745
Approved by: https://github.com/yanboliang
2024-08-17 05:30:01 +00:00
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
1a4709cef5 [dtensor] add more documentations (#133306)
This PR adds more documentations to the DTensor APIs, to prepare for the
module be public

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133306
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l, https://github.com/wz337
ghstack dependencies: #133305
2024-08-17 05:09:52 +00:00
addee9f4d1 [dtensor] add missing __all__ to public modules (#133305)
as titled, some submodules are missing __all__ for API exposures, this
PR adds necessary __all__ to those modules, and private some non public
APIs explicitly together in this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133305
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l, https://github.com/wz337
2024-08-17 05:09:48 +00:00
702c810780 move param's device check to _init_group for fused (#131153)
There could be some cases where the params have the meta device when calling optimizer's dunder init and those params are materialized in the first computation. This change would allow such situation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131153
Approved by: https://github.com/mlazos, https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-08-17 04:49:47 +00:00
12b8e29203 Add a fudge factor to ephemeral NCCL timeout increase (#133722)
Differential Revision: [D61422431](https://our.internmc.facebook.com/intern/diff/D61422431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133722
Approved by: https://github.com/c00w, https://github.com/aorenste
ghstack dependencies: #133504
2024-08-17 03:08:40 +00:00
695d7db2d6 remove dead code for suggesting legacy dynamic shapes fixes (#133700)
Summary: `dynamic_dim` based dynamic shapes are long gone, so pretty-printing suggested fixes for them is dead code.

Test Plan: existing tests

Differential Revision: D61398303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133700
Approved by: https://github.com/zhxchen17
2024-08-17 01:59:34 +00:00
455f6bda56 Add cache timings info to tlparse (#133504)
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpLR1T85/rank_1/0_0_0/fx_graph_cache_hash_11.json

Differential Revision: [D61422432](https://our.internmc.facebook.com/intern/diff/D61422432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133504
Approved by: https://github.com/jamesjwu
2024-08-17 01:37:53 +00:00
dcfa415e6e [Inductor UT] Reuse inductor UT for intel GPU test/inductor/test_compiled_optimizers.py (#133083)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_compiled_optimizers.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133083
Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/mlazos
2024-08-17 01:15:26 +00:00
983bea399d [compiled autograd] move non-hot path logs into default logger (#133541)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133541
Approved by: https://github.com/yf225, https://github.com/bdhirsh
ghstack dependencies: #133115, #133148
2024-08-17 00:46:52 +00:00
0a6cc15079 [compiled autograd] use same graph node names as AOTDispatcher (#133148)
FIXES https://github.com/pytorch/pytorch/issues/132939

Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, trace_wrapped HOPs, so the graphs may be slightly offset from each other

hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html
fsdp2 example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpPdKssS/rank_0/index.html
Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html
```python
 ===== Compiled autograd graph =====
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):
    def forward(self, inputs, sizes, scalars, hooks):
        # No stacktrace found for following nodes
        getitem: "f32[]cpu" = inputs[0]
        aot1_primals_1: "f32[4]cpu" = inputs[1]
        aot1_primals_2: "f32[4]cpu" = inputs[2]
        aot0_sin: "f32[4]cpu" = inputs[3]
        aot0_cos: "f32[4]cpu" = inputs[4]
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1)
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)
        aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None
        aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1);  aot1_sin_1 = None
        aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg);  aot1_neg = None
        aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None
        aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1);  aot1_tangents_1 = aot1_cos_1 = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)
        aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin);  aot0_sin = None
        aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg);  aot0_tangents_2 = aot0_neg = None
        aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos);  aot0_tangents_1 = aot0_cos = None
        aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1);  aot0_mul = aot0_mul_1 = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add);  getitem_5 = aot0_add = accumulate_grad_ = None
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None
        return []
```

where aot1 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos()
        sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2);  primals_2 = None
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1);  sin_1 = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg);  neg = None
        cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1);  primals_1 = None
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1);  tangents_1 = cos_1 = None
        return (mul_1, mul)
```

and aot0 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos()
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg);  tangents_2 = neg = None

         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None

         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1);  mul = mul_1 = None
        return (add,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133148
Approved by: https://github.com/jansel
ghstack dependencies: #133115
2024-08-17 00:46:52 +00:00
4b3ed8bc52 [compiled autograd] log aot id for CompiledFunctionBackward (#133115)
Partially addresses https://github.com/pytorch/pytorch/issues/132939. Adds the AOT ID after the CompiledFunctionBackward annotation in verbose compiled autograd logging

default (no change):
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_xw3ktsi_.log/index.html

TORCH_LOGS="compiled_autograd_verbose":
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_gsc9q_43.log/index.html

```python
# File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)
clone: "f32[4]" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
cos: "f32[4]" = torch.ops.aten.cos.default(getitem_1);  getitem_1 = None
mul: "f32[4]" = torch.ops.aten.mul.Tensor(clone, cos);  clone = cos = None

# File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)
cos_1: "f32[4]" = torch.ops.aten.cos.default(getitem_2)
mul_1: "f32[4]" = torch.ops.aten.mul.Tensor(mul, cos_1);  mul = cos_1 = None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133115
Approved by: https://github.com/jansel
2024-08-17 00:46:52 +00:00
b0803129e8 Added meta registration for _fused_adamw_ (#133728)
See https://github.com/pytorch/pytorch/issues/123461#issuecomment-2294335273

<img width="1463" alt="Screenshot 2024-08-16 at 5 38 25 PM" src="https://github.com/user-attachments/assets/fe940c0e-775f-4047-bf69-34a3677d539b">
same signature so should be ok to just add the op to the decorator
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133728
Approved by: https://github.com/janeyx99, https://github.com/fegin
2024-08-17 00:28:31 +00:00
ec28121017 [inductor] Fix test_cudagraph_trees_expandable_segments.py for internal (#133698)
Summary:
These tests aren't running internally because the outer test harness is crashing without listing the tests. To fix we need:
* Add a target for the tools/stats/ folder since this test imports it
* Add a dependence to that target so it's included in the par
* Fix up the relative import syntax, which is somehow different internally vs. fbcode (not sure why this works, but many other tests are doing it)

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --run-disabled`

Differential Revision: D61396711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133698
Approved by: https://github.com/xuzhao9
2024-08-17 00:09:32 +00:00
648fc6c9c1 [Inductor][CPP] Refactor the tiling select into a standalone module to enhance its extensibility (#130892)
**Summary**
After enabling more vectorization, we found that vectorization does not always bring performance benefits. For example, a kernel with several non-contiguous index computations or non-contiguous buffer load/store operations can experience performance regression. A typical case is what we observed in the next PR: after fully enabling vectorization of `index_expr`, we saw a performance regression of `hf_BigBird`.

In this PR, we refactor the tiling select into a standalone module to enhance its extensibility for further advanced tiling select heuristic. A standalone class `TilingSelect` with its method `select_tiling` has been added. `select_tiling` accepts the inputs of `fn_list`, `var_sizes_list` and return `tiling_factors`, `tiling_indices`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130892
Approved by: https://github.com/jgong5
2024-08-16 23:55:38 +00:00
d04cd7f3ba Improvements for associative_scan - Reverse feature (#133011)
This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `reverse` flag to the `associative_scan` to establish a similar interface as for `jax.associative_scan`. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307.

@ydwu4 @Chillee @zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133011
Approved by: https://github.com/ydwu4
2024-08-16 23:06:31 +00:00
19ff9059eb Revert "[Inductor][CPP] Support vectorization of remainder (#129849)"
This reverts commit 8624a571b4eecd11547867591d70992843265e97.

Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to ptedge_executorch_benchmark build failed again with LLVM crash ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2294408526))
2024-08-16 22:41:05 +00:00
98d6a6eb7d [inductor] clean up TODO comments. (#133718)
clean up TODO comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133718
Approved by: https://github.com/henrylhtsang
2024-08-16 22:12:01 +00:00
271ee90851 [easy] Fix type annotation for ExportedProgram.run_decompositions (#133720)
Fix the tuple type annotation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133720
Approved by: https://github.com/Skylion007
2024-08-16 22:11:42 +00:00
99e789b52b [Fix 1/n] GPU Test skips - fbcode/ caffe2/test/quantization (#133158)
Summary:
This diff aims to fix the GPU Test skips in the quantization tests under the `caffe2/test/quantization` directory. The changes made in the `TARGETS` files include adding the `should_use_remote_gpu` flag to enable remote GPU testing. This should help to resolve the skipped tests and improve the overall test coverage.

[This diff] Fixed skip count: 4
[Running total] Fixed skip count: 4

Note: Creating separate diffs for each test-group.

Test Plan:
**281475054644766**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_channel_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/5629499773981783

**281475054644780**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_tensor_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/11540474087422107

**281475054644853**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_quant_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/11540474087422477

**844425008078016**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_cuda_quantization_does_not_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/1407375259845199

Differential Revision: D60055277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133158
Approved by: https://github.com/jovianjaison
2024-08-16 22:00:57 +00:00
fd33499b0c [PT2][Optimus] Fix mixed precison training problem in decompose mem bound (#133626)
Summary: Recently we observed in AI CMF, enabling decompose_mm pass will lead to mixed dtype for aten.mm and aten.addmm errors. By investigation, we figure out that the error comes from torch.sum, which has an implicit type casting to avoid the possible overflow (a similar discussion in github: https://github.com/pytorch/pytorch/issues/115832). Thus we do the output cast to avoid the error.

Test Plan:
# unit test
```
buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm -- test_decompose_mm_mixed_precision
```
Buck UI: https://www.internalfb.com/buck2/00dc168e-4d65-40f8-b169-f4a58206f641
Test UI: https://www.internalfb.com/intern/testinfra/testrun/17169973624867151
Network: Up: 25KiB  Down: 44KiB  (reSessionID-b7e2ecc7-16ca-476d-95b2-09ea74645eb0)
Jobs completed: 19. Time elapsed: 1:07.6s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0

# e2e
ads_dper3:68464f2dc5e849ba2670482079cecaaa
training_platform:2c41d916ad5dd82f196372a8c7bd37a0
### build training_platform
```
buck2 run fbcode//fblearner/flow/projects/training_platform:training_platform
```

### register training_platform
```
buck2 run mode/opt fblearner/flow/projects/training_platform:workflow -- register-workflows --project-name training_platform --flow_version training_platform:2c41d916ad5dd82f196372a8c7bd37a0
```

### build ads_dper 3

```
fbpkg build -E ads_dper3 --yes --expire 14d
```

### register ads_dper 3
```
 buck2 run //pyper/core/eval_app_utils:flow_utils_script -- register --pkg-version ads_dper3:68464f2dc5e849ba2670482079cecaaa
```

### extend package (optional)
```
fbpkg expire --extend-only training_platform:2c41d916ad5dd82f196372a8c7bd37a0 30d
```

### before fix
f591360990

### after fix

baseline
f591395056
proposal

Differential Revision: D61351815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133626
Approved by: https://github.com/jackiexu1992
2024-08-16 21:53:12 +00:00
be207af6e1 Disable unwrapping scalar tensors when used as outputs (#132859)
If the scalar tensor is an output tensor, it shouldn't be unwrapped (i.e. `.item()` called) since `tl.store` requires a pointer type for outputs. This issue only occurs for mutated buffers: the input tensor is also used as an output tensor.

Fixes #ISSUE_NUMBER

@yanboliang @jansel @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132859
Approved by: https://github.com/jansel
2024-08-16 21:40:45 +00:00
861bdf96f4 [MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393)
Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors.

Summary of changes (starting with macOS 15):
- Add support for **MPS strided API** (strides/storage offsets etc):
   - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc)
   - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc)
   - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc)
   - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc)
- Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW).
- Add support for strided output buffers (previously we would create a contiguous buffer

OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets.

---

Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14:
```
- test_train[functorch_maml_omniglot-mps]: 27% faster
- test_train[timm_vision_transformer-mps]: 12% faster
- test_train[hf_T5-mps]: 9.46% faster
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128393
Approved by: https://github.com/albanD

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
2024-08-16 21:07:50 +00:00
447f428d6d [ROCm] Fix text_export cudnn_attention UT (#133234)
On ROCm we should decompose to flash_attention for sdpa instead of cudnn_attention. Need additional conditionalisation in this code.

Issue observed: https://hud.pytorch.org/failure?name=rocm%20%2F%20linux-focal-rocm6.1-py3.8%20%2F%20test%20(default%2C%203%2C%206%2C%20linux.rocm.gpu.2)&jobName=undefined&failureCaptures=%5B%22export%2Ftest_export.py%3A%3ATestOneOffModelExportResult%3A%3Atest_scaled_dot_product_attention_cuda%22%5D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133234
Approved by: https://github.com/malfet
2024-08-16 20:49:13 +00:00
f57b00704e [Traceable FSDP2][Dynamo] Support reconstructing CUDA event object within Dynamo graph (#133635)
`torch.cuda.Event` objects are different from `torch.cuda.Stream` in that events are not pooled, meaning we can't look up a previously created CUDA event object by ID. This prevents CUDA event object created outside of the Dynamo graph from being used within the graph (since Dynamo needs a way to emit a `call_function` line in the graph that does the retrieval of the event object for downstream op use). This PR adds a simple object pool within Dynamo utility, to support looking up CUDA event object by ID from within the Dynamo graph.

After this PR, if a user creates a CUDA event object outside of the graph and use that event within the graph, the behavior will exactly match eager.

Test commands:
- `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_created_outside_of_graph`
- `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_across_graph_break`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133635
Approved by: https://github.com/yifuwang
ghstack dependencies: #133532, #133531, #133636
2024-08-16 20:40:46 +00:00
bc9e20b927 Move the layout constraint registration of aten._scaled_mm.default to module scope (#133669)
During Inductor lowering, layout constraints for an op is applied before the op's lowering is called. Currently `add_layout_constraint(aten._scaled_mm.default, constrain_to_fx_strides)` is called inside `aten._scaled_mm.default`'s lowering. This means that if the first `_scaled_mm` to be lowered relies on the layout constraint, it won't be applied and the generated code would fail. The issue won't manifest if the first `_scaled_mm` doesn't rely on the layout constraint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133669
Approved by: https://github.com/drisspg, https://github.com/yangsiyu007
2024-08-16 20:30:13 +00:00
88ba50279c Consolidate the format for --max-acc-splits flag (#133724)
fixes the partial export of [lowering] Add max_acc_splits (#133041) ([D60133589](https://www.internalfb.com/diff/D60133589))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133724
Approved by: https://github.com/kit1980
2024-08-16 20:28:55 +00:00
3ac527ac5f [BE][Ez]: Update cudnn_frontend submodule to 1.6.0 (#133687)
Updates CUDNN_frontend header only library to make the most of the newest CUDNN features and decrease the overhead of the library.

Copied from commit:
New API
- Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added.
- SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED.
Bug Fixes
- Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API.
- SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node.
Enhancements
- Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size.
- Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph).
- Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
- Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
- Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph.
- JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
- Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls.
- CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details.
Samples
- Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133687
Approved by: https://github.com/eqy, https://github.com/malfet
2024-08-16 20:27:23 +00:00
41e6619509 [codemod] Del un at::native::metal @ MPSCNNFullyConnectedOp.h:6 (export D59157302) (#133515)
Manual export of D59157302

Original description:
Removes a using namespace from the global namespace in pursuit of enabling -Wheader-hygiene. Qualifies instances that relied on the using namespace.

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133515
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-08-16 19:59:07 +00:00
a0cb54ab46 Revert "C++ network flow implementation in c10 (#132188)"
This reverts commit e6272acaec63c960486b3ac558d0199cd65d7b97.

Reverted https://github.com/pytorch/pytorch/pull/132188 on behalf of https://github.com/izaitsevfb due to breaks aps models and builds internally ([comment](https://github.com/pytorch/pytorch/pull/132188#issuecomment-2294120234))
2024-08-16 19:48:54 +00:00
fb59440791 Use dedicated docker-build environment for manywheel, libtorch and conda Docker builds - 2 (#133709)
Follow up after https://github.com/pytorch/pytorch/pull/133699. 2 more placed where we need to pass these env vars.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133709
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2024-08-16 19:41:11 +00:00
678a8f9e66 [Inductor][FlexAttention] Small cleanup for FlexAttention kernel template (#133664)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133664
Approved by: https://github.com/drisspg
2024-08-16 19:33:36 +00:00
611c104370 [MPS] Add workaround for nonzero with large/complex inputs (#126188)
Fixes Issue #122916

Resolves correctness issue seen with large inputs to the mps nonzero op by using a different scatter mode. Native nonzero op is still used with smaller inputs for better performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126188
Approved by: https://github.com/kulinseth, https://github.com/malfet
2024-08-16 19:04:04 +00:00
0063e56949 Make FX Graph Cache work with distributed training (#133374)
During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save.

Differential Revision: [D61363722](https://our.internmc.facebook.com/intern/diff/D61363722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374
Approved by: https://github.com/ezyang
2024-08-16 18:51:14 +00:00
5ee070266f Workaround ASAN failure (#133623)
Summary:
ASAN in llvm 17.x and newer reads 8 bytes in front of every function called. This means the JIT must not place a function immediately at the beginning of a freshly `mmap`ed page. This adds an 8 byte sized dummy variable as the first thing to work around the problem.

See also:
- https://reviews.llvm.org/D148665
- https://github.com/llvm/llvm-project/issues/65253

Test Plan:
- `servicelab create cogwheel_adfinder_ubsan_multi_trial_test --local-commit`: https://www.internalfb.com/servicelab/experiment/3701354882
- sandcastle

Differential Revision: D61348865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133623
Approved by: https://github.com/Skylion007
2024-08-16 18:48:10 +00:00
cyy
90c3669cd9 Make sure T::is_traceable is bool (#133673)
Add static_assert to C++ templates in custom_function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133673
Approved by: https://github.com/Skylion007
2024-08-16 18:28:02 +00:00
eb3d517605 [Test] Add SkipIfRocm to test_grad_acc_cpu_offload (#132975)
Fixes #123726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132975
Approved by: https://github.com/malfet
2024-08-16 18:26:20 +00:00
e5baf43b61 [Inductor] short-term fix for needs_fixed_stride_order silent incorrectness (#133452)
This is a low-risk short-term fix for
https://github.com/pytorch/pytorch/issues/128084, for the purposes of
2.4.1. The actual fix for that issue is more risky and we'll target 2.5.

needs_fixed_stride_order is silently incorrect with args that are
mutable because it creates clones of those args, writes into them, and
doesn't update the original args.

This PR makes it so that needs_fixed_stride_order doesn't apply to
inputs that are being mutated.

This PR doesn't completely fix the problem, but it makes it less
incorrect: most of the time the input already has the correct strides
but inductor fails to recognize it, and in those cases writing directly
to the input is fine.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133452
Approved by: https://github.com/eellison
2024-08-16 18:14:57 +00:00
caaa339e0f Use dedicated docker-build environment for manywheel, libtorch and conda Docker builds (#133699)
BE change. Apply logic simiar to: https://github.com/pytorch/pytorch/blob/main/.github/workflows/docker-builds.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133699
Approved by: https://github.com/seemethere
2024-08-16 18:10:43 +00:00
b833990a8f Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)"
This reverts commit 4aa66f68a803927ddd127ceaaa1521b8d6e90e5f.

Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/izaitsevfb due to breaks internal builds with identifier "std::numeric_limits< ::cutlass::half_t> ::infinity" is undefined in device code ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2293939390))
2024-08-16 18:09:33 +00:00
4ee65c7e4e Add message text to BypassFxGraphCache exceptions. (#133505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133505
Approved by: https://github.com/oulgen
2024-08-16 18:02:59 +00:00
1df1d00ffc [Traceable FSDP2] Remove usage of tuple() generator and simplify code (#133636)
Dynamo doesn't support `tuple()` generator, and this change also simplifies code a bit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133636
Approved by: https://github.com/awgu
ghstack dependencies: #133532, #133531
2024-08-16 17:47:28 +00:00
374c61cc82 [inductor] make conv template work with symbolic stride/padding (#132938)
Fix https://github.com/pytorch/pytorch/issues/132716

The triton template for convolution does not work when the stride or padding contains dynamic shape. Use the hint and add guards to handle that. An alternative is to fallback to eager, but since I've seen the lowering rule for convolution use the hint in other cases, I'll just follow the convention.

I don't really know how to add a unit test here since I need create symbolic strides (not strides of a tensor but the stride parameter for convolution) and paddings. I can try harder if reviewer swants me to add unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132938
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #132952
2024-08-16 17:45:12 +00:00
2cffe82dea Fix triton build failure due to tritonlang.blob.core.windows.net not available (#133694)
This should mitigate https://github.com/triton-lang/triton/issues/4527
We should also remove this once our triton pin moves past: https://github.com/triton-lang/triton/pull/4216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133694
Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet
2024-08-16 17:34:30 +00:00
f735038c8f [PT2][Optimus] Add unbind_stack_to_slices pass (#133420)
Summary: We find another pattern to be optimized in AI CMF, thus we add the new pattern

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Buck UI: https://www.internalfb.com/buck2/b0b9bdf6-1bd1-45db-ba2c-a6892d9d557e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900285323964
Network: Up: 595KiB           Down: 1.7MiB           (reSessionID-e527c3b3-03ac-45f8-bd08-3eb9a28b7dc0)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ai_cmf" --flow_id 558295195 -n
```
P1520513078

Counter({'pattern_matcher_nodes': 1756, 'pattern_matcher_count': 936, 'normalization_pass': 280, 'merge_splits_pass': 250, 'scmerge_cat_removed': 14, 'scmerge_cat_added': 12, 'scmerge_split_removed': 7, 'unbind_stack_pass': 7, 'split_stack_to_cats_pass': 4, 'scmerge_split_sections_removed': 3, 'split_cat_pass': 2, 'scmerge_split_added': 2, 'split_cat_to_slices_pass': 2, 'unbind_stack_to_slices_pass': 1}

# e2e (OBA AFOC)

baseline
f590253290
proposal
f591051921

### QPS and NE
{F1804187079}

### trace analysis
baseline trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff590283096-TrainingApplication%2F4%2Frank-1.Aug_12_08_52_03.3628.pt.trace.json.gz&bucket=pyper_traces

proposal trace link:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff591081210-TrainingApplication%2F0%2Frank-1.Aug_12_22_23_35.3401.pt.trace.json.gz&bucket=pyper_traces

{F1804227687}{F1804227675}
Based on the traces, the green part has been shrinked due to optimus transformation.

Differential Revision: D61039466

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133420
Approved by: https://github.com/jackiexu1992
2024-08-16 17:30:35 +00:00
6790eb52f9 [Traceable FSDP2] Set torch._dynamo.config.skip_fsdp_hooks to True by default (#133531)
Setting `torch._dynamo.config.skip_fsdp_hooks = True` is required for graph-break compiled FSDP2, thus setting it to default will make this adoption easier. If users want to use Traceable FSDP2, they can set this to False manually (which will allow FSDP2 hooks to be traced through).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133531
Approved by: https://github.com/awgu
ghstack dependencies: #133532
2024-08-16 17:18:42 +00:00
6d85077168 [Traceable FSDPS] Allow tracing through FSDP2 impl in trace_rules.py (#133532)
Test commands:
- `python test/distributed/_composable/fsdp/test_fully_shard_training.py TestFullyShard1DTrainingCompose.test_train_parity_with_activation_checkpointing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133532
Approved by: https://github.com/yanboliang
2024-08-16 17:13:47 +00:00
18705e371d S390x nightly binaries for python 3.13 (#132984)
Enable building python 3.13 nightly binaries for s390x
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132984
Approved by: https://github.com/malfet
2024-08-16 17:07:27 +00:00
770086fe39 [Dynamo] Support torch.cuda.device ctx manager (#133385)
Fixes #128059

I'm not sure if this is the right way, since Inductor doesn't always respect the device id set by users, so probably we should just wrap it as null context manager and print a warning. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @jansel @anijain2305 @mlazos @williamwen42

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133385
Approved by: https://github.com/jansel
2024-08-16 17:05:55 +00:00
38e5ee1a34 mixed_mm: add more extensive dtype testing (#133292)
This PR adds a test that tests more combinations of dtypes. The bfloat16 and uint8 combination causes a crash somewhere in triton during the generation of LLVM code. Tests like these would have also prevented segfaults like this one https://github.com/pytorch/pytorch/pull/133173.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133292
Approved by: https://github.com/shunting314
2024-08-16 16:49:27 +00:00
9c2d119194 [Profiler/CPU] Add API for Dynamic Activity Toggling [3/n] (#133353)
Summary:
In this diff, we add the CPU activity implementation of being able to dynamically toggle profiling in between steps. To do this we remove the callbacks for Torch Ops and add them back in when an enable call is made.

This diff also adds some support code for doing the same in python; however, the python stack comes with its own set of compilcations when enabling this feature. For one, we get into a scenario where the python stack during the toggle never gets an exit as it the tracing gets turned off which makes for some tricky post processing. For this reason, we can leave the python dynamic toggling off for now and revisit if there is enough demand.

Test Plan: Got the following tracing by disabling torch and cuda ops: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Aug_13_13_03_02.606577.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D61221497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133353
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2024-08-16 16:36:57 +00:00
46af996ce7 [c10d] Do not call ncclCommAbort if comm is not initialized (#133630)
Summary:
We saw ncclCommAbort was called and hang during the NCCLComm:create.
If NCCL comm is not properly initialized, ncclCommAbort behavior is
'undefined', avoid calling it would allow the process to properly throw
exception
Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133630
Approved by: https://github.com/wconstab
2024-08-16 16:25:07 +00:00
8b8b4e5ae9 AutoHeuristic: documentation for mm (#133611)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133611
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710, #131714, #133608
2024-08-16 16:20:38 +00:00
0e0077f3b6 AutoHeuristic: mm ranking heuristic h100 (#133608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133608
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710, #131714
2024-08-16 16:20:38 +00:00
e51c8ad369 AutoHeuristic: Heuristic that ranks choices for mm (#131714)
This PR adds a heuristic for tuned_mm that predicts the top 10 best choices. To be safe, aten.mm is always included.

Perf run: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2008%20Aug%202024%2020%3A20%3A28%20GMT&stopTime=Thu%2C%2015%20Aug%202024%2020%3A20%3A28%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/AlnisM/22/head&lCommit=905826f4ab5344efb0bcaa87e3b27a25299927ab&rBranch=main&rCommit=79ca596dc6ea16b6cdd0f2517451e19840717d37

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131714
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710
2024-08-16 16:20:38 +00:00
51e13745be [BE]: Update ruff to 0.6.0 (#133609)
Updates ruff and fixes a couple false negatives it discovered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133609
Approved by: https://github.com/malfet
2024-08-16 14:11:01 +00:00
eca8b4220f [inductor][cpp][gemm] fix k-slicing bug and add thread blocking config (#132730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132730
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #132729
2024-08-16 13:50:19 +00:00
a6aa451bde Move python 3.8 to 3.9 for linux-binary-manywheel workflow (#133621)
Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133621
Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet
2024-08-16 13:49:26 +00:00
e1b9b89d94 Revert "[Flight Recorder] Add more basic analysis to the script (#133412)"
This reverts commit fcc2fc1a70c35628939611b496b209fa0a1d19bf.

Reverted https://github.com/pytorch/pytorch/pull/133412 on behalf of https://github.com/atalman due to New test: distributed/flight_recorder/test_fr_analysis is constantly failing ([comment](https://github.com/pytorch/pytorch/pull/133412#issuecomment-2293506539))
2024-08-16 13:26:25 +00:00
b444343087 Fix printing symfloat pow in triton (#133614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133614
Approved by: https://github.com/Skylion007
2024-08-16 13:08:29 +00:00
762b1b4c17 [inductor] [cpp] fix accuracy when template_buffer has users other than the epilogue nodes (#133073)
This PR fixes the accuracy issues when template_buffer has users other than the epilogue nodes. This will fix the accuracy failure of the below models using max-autotune:

- MobileBertForMaskedLM
- MobileBertForQuestionAnswering
- convnext_base
- swin_base_patch4_window7_224

## Issue 1:
Previously we always add `template_buffer` as an alias of `Y`. In case the `template_buffer` has users other than the epilogue nodes, we shouldn't set it as an alias of `Y`. This PR adds the check in such case.

Wrong code before the fix where `tmp4` and `tmp9` are both stored to `Y` while we need 2 different buffers for them since `tmp4` will be used by nodes other than the epilogue node:
```cpp
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp4; // tmp4 is the output of the template
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp9; // tmp9 is the output of the epilogue node
```

Correct code after the fix:
```cpp
out_ptr2[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp4;
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp9;
```

## Issue 2:
When fixing the above issue, we found that there's correctness issue when `bias` is `False`. The root cause is that in the case where `bias` is `False`, the `template_buffer` has users other than the epilogue nodes and the GEMM output buffer is localized, we need to add an extra copy epilogue to ensure that the GEMM output (a local buffer) is stored to the `template_buffer` that will be used later by other nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133073
Approved by: https://github.com/jgong5
ghstack dependencies: #133070
2024-08-16 12:13:10 +00:00
dd69013c7a deprecate search_autotune_cache (#133628)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133628
Approved by: https://github.com/oulgen
2024-08-16 09:29:39 +00:00
15183f5ebf overestimate time_taken_ns for autotuning (#133633)
tldr; in `autotune_to_one_config` we now include the precompile time, and in coordesc tuning we include the time from `autotune_to_one_config`, since this is a precursor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133633
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-08-16 09:28:49 +00:00
30fbf5b19c Remove AMD restrictions on triton hashing (#133616)
Summary: When we added these functions, AMD's triton checkout was very old, it appears to have caught up. Remove restrictions.

Test Plan: unit tests

Differential Revision: D61351473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133616
Approved by: https://github.com/mxz297, https://github.com/nmacchioni, https://github.com/eellison
2024-08-16 08:02:48 +00:00
5ed3b70d09 remove redundant upper bound check at runtime (#133627)
Summary: Some symbols (unbacked symints?) can have upper bound that is `sys.maxsize - 1` but our code for runtime assertions assumes that such upper bounds would come in as `sympy.oo` (like backed symints?) in order to drop them. So we weren't dropping them, which this PR fixes.

Test Plan: added test

Differential Revision: D61352056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133627
Approved by: https://github.com/SherlockNoMad
2024-08-16 06:57:12 +00:00
f64146aff0 Update source matcher to use torch_fn (#133642)
Updating the source matcher to also accept pattern matching on the torch_fn metadata, which exists in both strict and non-strict export. We want to replace the use of source_fn_stack with torch_fn, as it's not possible for us to get source_fn_stack in non-strict export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133642
Approved by: https://github.com/ydwu4
2024-08-16 06:42:52 +00:00
d12bbcd785 Add auto-tuning for sparse semi-structured MM operator (#123742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123742
Approved by: https://github.com/kadeng
2024-08-16 06:40:24 +00:00
3d45717219 [ROCm][CK][Inductor] enable dynamic shapes for CK backend to gemm max autotune (#133285)
This PR enables dynamic shapes for the CK backend for gemm max autotune (see #125453).

This is achieved via unhardcoding the problem sizes from the template body and passing them as parameters instead.

We handle passing the problem sizes for the kernel call as well as for the benchmark call.

# Testing

`pytest test/inductor/test_ck_backend.py [-k dynamic]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133285
Approved by: https://github.com/ColinPeppler
2024-08-16 06:05:23 +00:00
8ea5b572a6 [PT2][Optimus] Add missing example value for the nodes introduced in group batch fusion (#133414)
Summary: Recently we observed more missing example values in nodes introduced in Optimus, which causes problem to have further optimization when this node info needs to be used. Thus we add the meta for these nodes in the diff.

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Buck UI: https://www.internalfb.com/buck2/c0ad506f-ce9d-4b80-947a-cb79074b72f0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2251800058834808
Network: Up: 1.4GiB  Down: 2.0GiB  (reSessionID-fb781425-f29b-44b5-8a5b-daffe7274f86)
Jobs completed: 300289. Time elapsed: 13:19.5s.
Cache hits: 99%. Commands: 119360 (cached: 118494, remote: 824, local: 42)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213
```

P1520691492

Differential Revision: D61039772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133414
Approved by: https://github.com/jackiexu1992
2024-08-16 04:52:16 +00:00
8a2b064236 [dynamo][user_defined][stable-diffusion] Raise ObservedAttributeError on UserDefinedObject var_getattr (#132806)
Fixes https://github.com/pytorch/pytorch/issues/132551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132806
Approved by: https://github.com/williamwen42
2024-08-16 04:30:06 +00:00
fcc2fc1a70 [Flight Recorder] Add more basic analysis to the script (#133412)
This is the first step to make sure we have a basic function of analyzer for FR in production.

- We want to use this script to find out abnormalities in collectives and report it to users.
- We also fixed some type errors.

- [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412
Approved by: https://github.com/c-p-i-o
2024-08-16 03:53:12 +00:00
d9f17cf4e4 [fx] Do not add Proxy on Tensor (#133470)
Summary: Switch to set_proxy_slot instead of set the proxy directly on the Tensor. We do not want to add Proxy to tensor objects, because Proxy cannot be deepcopied or pickeled and can cause problems when users want to deepcopy or pickle models.

Test Plan: CI

Differential Revision: D61277650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133470
Approved by: https://github.com/zou3519
2024-08-16 03:39:50 +00:00
8a5708ba3d [dynamo] Support object creation of classes with custom __new__ (#132977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132977
Approved by: https://github.com/jansel
2024-08-16 03:09:23 +00:00
a1a869f2f5 [ts_converter][reland] Add support for LinearOpContext and Conv2dOpContext in quantization pass (#133622)
Summary: Reland of D60871242

Test Plan: CI

Differential Revision: D61352600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133622
Approved by: https://github.com/SherlockNoMad
2024-08-16 01:55:45 +00:00
1653f7786d Fix type promotion for ldexp (#133519)
According to the documentation, ldexp of half and int should return half tensor and ldexp of double should not overflow for 64-bit exponent

Introduce `_pow2` helper function that does not follow scalar to float32 promotion pattern if `self` is reduced precision float or double

Add regression tests to `test_ldexp` and enable it to run on both CPU and GPU

Fixes https://github.com/pytorch/pytorch/issues/133267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133519
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2024-08-16 01:26:26 +00:00
3a904d1163 AutoHeuristic: Enable explicit support for ranking (#131710)
This PR adds support for heuristics that rank choices in AutoHeuristic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131710
Approved by: https://github.com/eellison
ghstack dependencies: #131705
2024-08-16 01:20:52 +00:00
add0f0085c AutoHeuristic: Support ranking/pruning choices (#131705)
This PR adds support in train_decision if one wants to learn a heuristic for ranking. The main idea is that the user has to provide a number of choices the heuristic should return. I added a way to prune the learned decision tree such that it always returns the number of choices provided by the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131705
Approved by: https://github.com/eellison
2024-08-16 01:20:52 +00:00
cyy
929d2f8253 [3/N] Fix clang-tidy warnings in torch/csrc/autograd (#133389)
Follows #133295
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133389
Approved by: https://github.com/Skylion007
2024-08-16 00:57:54 +00:00
c22f51ce7c [inductor][cpp][gemm] improve large bs perf with better cache blocking (#132729)
Improve the cache blocking by reducing Mc_blocks to make A reside in L2 and reused by B as much as possible. This improves large bs perf for both scenarios: 1) N is large and K is of medium sizes; 2) K is large. Different strategies are used to handle these scenarios. Check the notes in `get_cache_blocking` in the changes.

Measured with 56-core Intel (R) Xeon (R) CPU Max 9480, jemalloc 5.1 and intel omp, bf16. Run with code cache of B matrix (weights).

Model Shapes | Before Optimization | After Optimization | Speedup | onednn linear | Speedup over onednn
-- | -- | -- | -- | -- | --
M=1024, N=12288, K=4096 (Llama2-8b) | 5.69 ms | 3.71 ms | 1.53 | 4.53 ms | 1.22
M=1024, N=4096, K=4096 (Llama2-8b) | 1.69 ms | 1.63 ms | 1.04 | 2.05 ms | 1.26
M=1024, N=22016, K=4096 (Llama2-8b) | 10.32 ms | 6.57 ms | 1.57 | 8.46 ms | 1.29
M=1024, N=4096, K=11008 (Llama2-8b) | 5.21 ms | 3.26 ms | 1.60 | 4.65 ms | 1.43
M=1024, N=5120, K=4096 (Llama3-8b) | 1.99 ms | 1.78 ms | 1.12 | 2.31 ms | 1.30
M=1024, N=28672, K=4096 (Llama3-8b) | 13.41 ms | 8.56 ms | 1.57 | 10.96 ms | 1.28
M=1024, N=4096, K=14336 (Llama3-8b) | 6.93 ms | 4.31 ms | 1.61 | 6.24 ms | 1.45

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132729
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel
2024-08-16 00:57:51 +00:00
cyy
8f7cf796ea [14/N] Use std::optional (#133417)
Follows #132527
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133417
Approved by: https://github.com/ezyang
2024-08-16 00:48:34 +00:00
d9576c9440 Fix failures when default is flipped for weights_only (#127627)
Tests on XLA shard not fixed yet but there is an issue here https://github.com/pytorch/xla/issues/7799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127627
Approved by: https://github.com/albanD
ghstack dependencies: #132349
2024-08-16 00:22:43 +00:00
c8ad5e37e8 Fix all RuntimeErrors during weights_only load from being erroneously reported with the weights_only message (#132349)
Caught in above PR #127627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132349
Approved by: https://github.com/albanD
2024-08-16 00:22:43 +00:00
0d2be06d94 [export] fix test for training ir migration (#133587)
Summary:
Fix quantization pass to be compatible with the new export IR.

Some nodes might have side-effects, so they don't have users, but still are not removed by the DCE pass.

Test Plan:
CI

buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model  -- -r export_rle_model

Differential Revision: D61223356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133587
Approved by: https://github.com/tugsbayasgalan
2024-08-15 23:55:09 +00:00
eqy
7ad3108ef2 [CUTLASS][FP8] Skip scaled_mm rowwise test on sm89 (#133612)
Rowwise implementation currently uses sm90-specific features incl. TMA
CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133612
Approved by: https://github.com/Skylion007
2024-08-15 23:43:30 +00:00
413416cf33 [PT2] Consolidate args and kwargs usage in pre_grad passes (#133518)
Summary: with acc_tracer disabled, the nodes generated use `args` instead of `kwargs` like before, in the current passes there are a mixed usage of `args` and `kwargs` and normalize nodes to switch between them can cause following passes to work/not work, in this diff we create a pass to normalize all the nodes to use `kwargs` at the beginning and changed all the passes to follow the same

Reviewed By: frank-wei

Differential Revision: D61049898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133518
Approved by: https://github.com/frank-wei
2024-08-15 23:41:39 +00:00
f347174d61 Hipify Pytorch3D (#133343)
Summary:
X-link: https://github.com/fairinternal/pytorch3d/pull/45

X-link: https://github.com/facebookresearch/pytorch3d/pull/1851

Very minor change to extend hipification to a missing hipcub constant. This is needed to hipify some of the kernels in pytorch3d.

Differential Revision: D61171993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133343
Approved by: https://github.com/houseroad
2024-08-15 23:39:07 +00:00
29c4b4ea5a [executorch] Refactor delegation code (#132773)
Summary: Refactoring partitioner-based delegation to prepare for allowing buffer mutations in the delegate (following diff).

Test Plan: CI

Differential Revision: D60813405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132773
Approved by: https://github.com/ydwu4, https://github.com/cccclai
2024-08-15 22:52:12 +00:00
86aa327e4a [FSDP2] Added eager fast-path for fp32->bf16 param cast (#133369)
Some recommendation models have a high number of `nn.Parameter`s. This exacerbates per-tensor CPU overheads in FSDP2 compared to FSDP1.

This PR adds a fast-path for the common bf16/fp32 mixed precision case for the casting the parameters from fp32 to bf16 to reduce CPU overhead and possibly have more efficient copy.
- Old: `for` loop + `.to(torch.bfloat16)`, incurring dispatcher overhead per parameter
- New: `torch.empty` + `torch.split` + `torch._foreach_copy_`, incurring three dispatches

---

Example on Llama3-8B which does not have many `nn.Parameter`s (compared to recommendation models):

(Old) on Llama3-8B (0.46 ms CPU overhead for all-gather):
![Screenshot 2024-08-13 at 6 19 39 PM](https://github.com/user-attachments/assets/e6390e9f-ee54-4208-9d60-9451a4142efa)

(New) on Llama3-8B (0.37 ms CPU overhead for all-gather):
![Screenshot 2024-08-13 at 6 20 32 PM](https://github.com/user-attachments/assets/a5dc1d38-53d2-4984-b3cc-85ce5a538ede)

---

Same example as above but now with float8 all-gather:

(Old) on Llama3-8B with float8 (0.996 ms CPU overhead for all-gather):
![Screenshot 2024-08-15 at 11 27 46 AM](https://github.com/user-attachments/assets/2b7e9c9c-56ea-4375-851e-a2a704689d8d)

(New) on Llama3-8B with float8 (1.014 ms CPU overhead for all-gather):
![Screenshot 2024-08-15 at 11 26 33 AM](https://github.com/user-attachments/assets/160cf8f6-bb97-4633-b802-baeae74e3262)

The times are relatively comparable for float8 with the new one possibly slightly slower, but this is mainly because for Llama's transformer blocks, there are only two norm weights that need to cast to bf16. These screenshots are mainly to show that the optimization still works in the mixed case.

Differential Revision: [D61236983](https://our.internmc.facebook.com/intern/diff/D61236983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133369
Approved by: https://github.com/weifengpy
ghstack dependencies: #133498
2024-08-15 22:27:20 +00:00
90d2593b3e Revert #132806, #132736, #132539, #132487 (#133570)
This reverts commit 25df063f044202899ab92d6f3d77950af5de482f.
This reverts commit de00c7958301ce81b9716bdef5731ed40d4d14ca.
This reverts commit 419b76c4ac80c8b1c95120cd52db622333a3a688.
This reverts commit bc57d5b6ff8725bbe93f0e67db72459720c750cf.

Differential Revision: [D61335013](https://our.internmc.facebook.com/intern/diff/D61335013)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133570
Approved by: https://github.com/albanD, https://github.com/jansel, https://github.com/anijain2305
2024-08-15 20:54:21 +00:00
5f1470d45d [export] Add InterpreterModule to trace_rules (#132949)
Summary: Added InterpreterModule to trace_rules so that it can be torch.compiled. Fixes https://github.com/pytorch/pytorch/issues/132921

Test Plan: CI

Differential Revision: D60426372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132949
Approved by: https://github.com/zhxchen17
2024-08-15 20:46:13 +00:00
09a489b177 Fix serialization for tensor list output (#133539)
Summary: Some element of tensor list output doesn't not have a user. In such case, create a name as `{node_name}_unused_{index}` for it.

Test Plan: OSS CI

Differential Revision: D61309011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133539
Approved by: https://github.com/zhxchen17
2024-08-15 20:31:44 +00:00
cdf217cda1 Disable distributed nccl tests to unblock Amazon2023 ami upgrade (#133355)
These tests keep failing on the Linux Amazon 2023 AMI.  The distributed team is looking into them, but until then, disabling the tests in order to unblock the AMI upgrade

Examples of the failures:
Failure 1: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175
```
FAILED [90.0880s] distributed/test_c10d_nccl.py::NCCLTraceTestDumpOnTimeout::test_timeout_dumps_timing_enabled_False - AssertionError: None mismatch: None is not -6
```

Failure 2: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963494
```
____ NCCLTraceTestTimeoutDumpOnStuckRanks.test_timeout_dumps_on_stuck_ranks ____
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/distributed/test_c10d_nccl.py", line 4214, in test_timeout_dumps_on_stuck_ranks
    self.assertEqual(self._wait_process(0, timeout=90), -6)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3721, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: None mismatch: None is not -6
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133355
Approved by: https://github.com/kit1980, https://github.com/wconstab
2024-08-15 20:15:00 +00:00
161cc137d2 [DTensor] Add naive replicate strategy for aten.triu.default and aten.tril.default (#133545)
Shampoo uses triu and tril [here](https://github.com/facebookresearch/optimizers/blob/main/matrix_functions.py#L63). As the matrix input is replicated, we register the naive replicate strategy to unblock.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133545
Approved by: https://github.com/awgu
2024-08-15 20:05:03 +00:00
99cf567714 Make SCRIBE_GRAPHQL_ACCESS_TOKEN available to test jobs running on main (#133536)
It is possible to write to Meta's internal in-memory database Scuba via the Scribe Graph API: https://www.internalfb.com/intern/wiki/Scribe/users/Knowledge_Base/Interacting_with_Scribe_categories/Graph_API/ This is currently being used by pytorch/benchmark repo to upload torchbench performance results.

I want to make this API generally available to all jobs running on CI in a semi-trusted context. To talk to Scribe, you need a secret access token. I have initially configured an environment prod-branch-main which contains `SCRIBE_GRAPHQL_ACCESS_TOKEN`, and switched a single class of jobs (linux-test) to use this environment when they are running on the main branch. Because we require approvals for running CI on untrusted contributions, we could potentially allow all jobs to run in this environment, including jobs on PRs, but I don't need this for my use case (per-PR benchmark result reporting, and miscellaneous statistics on main.)

If this works, I'll push out this environment to the rest of our test jobs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133536
Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/albanD
2024-08-15 19:53:17 +00:00
5dfb22d4c8 AutoHeuristic: tests (#133496)
This PR adds tests to AutoHeuristic that ensure that when existing heuristics are re-generated, the generated code stays the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133496
Approved by: https://github.com/eellison
2024-08-15 19:22:44 +00:00
7673ee5456 remove benchmarks/__init__.py (#133390)
trying to address https://github.com/pytorch/pytorch/issues/133377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133390
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/ezyang
2024-08-15 19:08:10 +00:00
dff388491b Fix docs for L1Loss and MSELoss (#133501)
The total number of elements is `N` not `n`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133501
Approved by: https://github.com/mikaylagawarecki
2024-08-15 18:56:55 +00:00
cyy
27538671ae Enable clang-tidy coverage on torch/*.h (#133422)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133422
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-08-15 18:52:08 +00:00
4aa66f68a8 [CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)
Unblocks/unbreaks against newer CUTLASS (3.5+)

CC @nWEIdia @xwang233 @ptrblck @thakkarV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493
Approved by: https://github.com/Skylion007
2024-08-15 18:33:22 +00:00
41d6cabca1 [c10d]Control logging c++ traces with a flag (#133490)
Summary:
Logging C++ stack traces occasionally races with shutdown processes on exception. It isn't safe and we've seen SIGSEGVs in the field.
These crashes prevent flight recorder dumps from completing.

For now, default this dumping to `true` and provide a knob if we need to control things in production.

Test Plan:
Tested locally on a job named `torchx-chirag_test_run` to make sure that the JK was honored by the code.
It was correctly disabled on my test job.
see (TORCH_NCCL_LOG_CPP_STACK_ON_EXCEPTION: 0) below.

```
] [trainer2]:I0814 11:21:20.152419  3708 ProcessGroupNCCL.cpp:874] [PG ID 0PG GUID 0 Rank 10] ProcessGroupNCCL environments: NCCL version: 2.20.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 0, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 2000, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_LOG_CPP_STACK_ON_EXCEPTION: 0
```

Differential Revision: D61283335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133490
Approved by: https://github.com/fduwjj
2024-08-15 18:25:02 +00:00
546c53b784 Bump max runners for linux.24xlarge to 500 (#133569)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133569
Approved by: https://github.com/ZainRizvi
2024-08-15 18:02:46 +00:00
59b3f5911d [sigmoid] Support custom obj deserialization. (#133463)
Summary:
It seems we have multiple places deserializing torchbind objects. Moving the code around so that every load essentially share the same implementation.

Also added a test case "package_reader_testing" which load back the archive file in Python and eagerly validate the numerical result.

Test Plan: buck test mode/opt sigmoid/inference/test:e2e_test_cpu

Reviewed By: SherlockNoMad

Differential Revision: D61235770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133463
Approved by: https://github.com/ydwu4
2024-08-15 17:58:44 +00:00
5ec9c0bc4a Fix linearize(grad(...)) call (#133364)
Fixes #124550

Also moves `graph.eliminate_dead_code()` call to a few lines after
`_inline_module(...)` in `const_fold.py`

* Test plan:

Add a new test on `test_eager_transforms.py` to ensure the reported
issue was indeed fixed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133364
Approved by: https://github.com/zou3519
2024-08-15 17:55:36 +00:00
cfec69e2a1 Revert "Update fused kernels and call _safe_softmax from SDPA (#131863)"
This reverts commit caba37e99b03d2199848197de4e452b78c8c2a23.

Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))
2024-08-15 17:55:07 +00:00
d3b458e603 [export] Do not use export.export for capture_pre_autograd_graph (#133370)
Summary:
Do not use export.export for `capture_pre_autograd_graph` in unittests anymore.

#buildall

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D60996041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133370
Approved by: https://github.com/tugsbayasgalan
2024-08-15 17:37:45 +00:00
2236194c6b [traced-graph][sparse] cleanup test guards (#133375)
Rather than repeating the same guard for every test, simply express it once on the test fixture instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133375
Approved by: https://github.com/ezyang
2024-08-15 17:32:06 +00:00
a7c6e30a3f [c10d][ez] Add space between PG ID and PG UID (#133497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133497
Approved by: https://github.com/shengbao-zheng, https://github.com/wz337
2024-08-15 17:20:12 +00:00
018e48c337 [Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489)
Reland #130633

USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
c23dceb8f1 Add Adafactor foreach impl (#132336)
This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR:
- we have a foreach flag for Adafactor
- It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency.

Next steps:
- make torch.compile possible on it
- make it faster (by adding more foreach apis)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132336
Approved by: https://github.com/albanD
ghstack dependencies: #133360
2024-08-15 17:00:33 +00:00
3434a54fba [CP] Rewrite ring attention backward algorithm and enablement APIs (#131351)
**What does this PR achieve**
1. This PR rewrite ring attention backward algorithm to fuse the alltoall and overlap the gradient communication with computation.

2. Enables memory efficient attention with CP by templating the ring attention backward to verify the accuracy as fp32 gives us higher confident about the implementation correctness.

3. Provides some experimental APIs to enable context parallelism.

4. Ensures CP work with torch.compiler. The combination of causal masking and torch.compiler has not
yet worked.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131351
Approved by: https://github.com/wanchaol
2024-08-15 16:41:51 +00:00
7470ae85e4 Fix triton codegen with math.trunc (#133354)
Fixes https://github.com/pytorch/pytorch/issues/133172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133354
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-08-15 16:38:26 +00:00
fc5aa24a6e Rewording doc string for clip_grad_norm_ (#133406)
The doc string for `torch.nn.utils.clip_grad_norm_` needed some clarity, it was earlier unclear that the norm was being computed over the norms of individual gradient parameters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133406
Approved by: https://github.com/mikaylagawarecki
2024-08-15 16:21:27 +00:00
a75248528f [export] refactor _process_dynamic_shapes (#133391)
Sorryyyyy for another refactor. This splits `_process_dynamic_shapes` into 3 parts:
1. `_combine_args` - mostly the same thing
2. `_check_dynamic_shapes`, which is responsible for raising 99% of UserErrors if the dynamic shapes spec is invalid (minus 1 UserError with DerivedDims)
3.  `_process_dynamic_shapes`, which for now, is the same thing, minus the stuff in 2.

This refactor is helpful for incoming automatic dynamic shapes work, because, we're switching to `assume_static_by_default=False`, which is what `_dynamo.export` currently does. This means any unspecified dims are allocated a symbol, in contrast to export today which keeps unspecified dims static. Historically this has been desirable - export users don't want too much dynamism. So we want to change how the spec is translated into constraints.

This means when we switch over to automatic dynamic shapes, we want to plug in something in between steps 2. and 3. which patches up the spec for `assume_static_by_default=False`, filling in static shapes for any unspecified dims, and potentially clearing out the auto-dynamic dims (since they're no-ops). We would do this in-between 2. and 3. to keep `_process_dynamic_shapes` semantically the same, since it's used with `_dynamo.export`.

We could do this without a refactor, plugging in this transform before `_process_dynamic_shapes`, but since that function's responsible for both spec checking + constraint production, moving spec checking to before we transform the specs helps guarantee we're raising errors on what the user's specified, and not an internal export bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133391
Approved by: https://github.com/avikchaudhuri
2024-08-15 16:21:21 +00:00
dd6ce2fe7c Restore mixed dtypes GEMM auto-tuning for Ampere (#129058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129058
Approved by: https://github.com/kadeng
2024-08-15 15:56:09 +00:00
758a0a88a2 [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200)
This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change.

Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980
2024-08-15 15:50:19 +00:00
57d1ffc512 Ignore torch.onnx._internal in test_circular_dependencies (#133110)
Ignore the whole `_internal` module as code will depend on onnxscript and onnx.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133110
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2024-08-15 15:37:24 +00:00
a6ad834fa8 Fix counting execution time in run_test.py (#133199)
Counting `elapsed_time` immediately after `start_time`, not reflect real execution time of `test_batch`.

Move `elapsed_time` and print method after `run_tests` method call to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133199
Approved by: https://github.com/clee2000
2024-08-15 15:29:44 +00:00
ec49ce5f8e [CUDA]: Add frexp CUDA bfloat16 support (#133313)
Fixes #133263 Add CUDA bfloat16 support to cuda_frexp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133313
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-08-15 15:20:00 +00:00
59e33cd1f4 [FSDP2] Set ctx.set_materialize_grads(False) for post-backward (#133498)
https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.set_materialize_grads.html
This avoids unnecessarily `aten::zeros` for the inputs in the post-backward custom autograd backward. We do not need the gradient values for the post-backward logic.

Differential Revision: [D61291210](https://our.internmc.facebook.com/intern/diff/D61291210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133498
Approved by: https://github.com/weifengpy
2024-08-15 14:58:26 +00:00
07adae3dac Revert "Make FX Graph Cache work with distributed training (#133374)"
This reverts commit dcdb25453e0ddc6a83e0052fffc544d4d03cdffd.

Reverted https://github.com/pytorch/pytorch/pull/133374 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/133374#issuecomment-2291289260))
2024-08-15 13:43:16 +00:00
32d890745d Revert "Add cache timings info to tlparse (#133504)"
This reverts commit 7eb31e5023fa16c51a984257ee7ee4e17fb3c682.

Reverted https://github.com/pytorch/pytorch/pull/133504 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/133374#issuecomment-2291289260))
2024-08-15 13:43:16 +00:00
bbddde311a Migrate inductor jobs to runner determinator (#133457)
Updates inductor jobs to use the runner determinator script.

Depends-On: pytorch/pytorch#133352
Closes: pytorch/ci-infra#257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133457
Approved by: https://github.com/ZainRizvi
2024-08-15 12:16:39 +00:00
9876aa39c0 AutoHeuristic: pad_mm documentation (#133411)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133411
Approved by: https://github.com/Chillee
ghstack dependencies: #133409, #133410
2024-08-15 10:49:56 +00:00
f32a9e953f AutoHeuristic: mixed_mm documentation (#133410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133410
Approved by: https://github.com/Chillee
ghstack dependencies: #133409
2024-08-15 10:49:56 +00:00
142353eca3 AutoHeuristic: util scripts (#133409)
This PR introduces scripts that make it easier to use autoheuristic:
- `collect_data.sh`: The user can specify things like the number of GPUs to be used and the number of training samples to collect. This script will open one tmux pane per GPU and collect num_training_samples/num_gpus samples per GPU.
- `merge_data.py`: This script can be used to merge multiple training data files into a single file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133409
Approved by: https://github.com/Chillee
2024-08-15 10:49:56 +00:00
b0fc6aa412 fix a typo in the householder_product docs (#124279)
The function argument is A, not V.

Remaining inconsistency is the matrix $A$ with columns $v_i$.
It seems, a better solution would be to rename the argument $A \rightarrow V$, but this might lead to backward compatibility issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124279
Approved by: https://github.com/lezcano
2024-08-15 09:34:17 +00:00
b6335cfeab Add an option to use do_bench_using_profiling in TORCHINDUCTOR_PROFILE (#133523)
When I did profiling using the "TORCHINDUCTOR_PROFILE" option, some kernel shows less bandwidth than expected. So, added the option to exclude the CPU overheads from the profiling time:

```
# With the option:
(pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_WITH_DO_BENCH_USING_PROFILING=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py
0.038ms         0.067 GB         1777.11GB/s     triton_poi_fused__to_copy_clamp_clone_mul_0
SUMMARY (/tmp/torchinductor_shuqiyang/tmp03wdg8e4/m6/cm6vdqp62ofwsone3u3fmb42vs3fti5omseo3qn4ddh2bhalsvbn.py)
0.04ms           0.07 GB         1777.11GB/s

# Without the option:
(pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py
0.040ms         0.067 GB         1663.09GB/s     triton_poi_fused__to_copy_clamp_clone_mul_0
SUMMARY (/tmp/torchinductor_shuqiyang/tmpwr6rraao/s4/cs4npkh77myatwpcmsizyduyfm6ne6o4pg4n3eodejdvvg2j3xzd.py)
0.04ms           0.07 GB         1663.09GB/s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133523
Approved by: https://github.com/nmacchioni
2024-08-15 09:27:11 +00:00
cf1fc07bd4 [DTensor][Easy] Minor fix to Partial Placement Docstring (#133149)
Minor doc fix: The reduce op string for product should be "product" instead of "prod".
https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L1045

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133149
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l
2024-08-15 08:09:30 +00:00
e6272acaec C++ network flow implementation in c10 (#132188)
The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency.

So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness.

Differential Revision: [D61284135](https://our.internmc.facebook.com/intern/diff/D61284135)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188
Approved by: https://github.com/Chillee
2024-08-15 07:32:51 +00:00
c88174df95 typing for remote_cache (#133446)
Summary:
typing annotations for remote_cache
Redo of #133299 with fixed annotations.

Test Plan: unit tests

Differential Revision: D61271883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133446
Approved by: https://github.com/oulgen
2024-08-15 06:36:13 +00:00
7eb31e5023 Add cache timings info to tlparse (#133504)
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpLR1T85/rank_1/0_0_0/fx_graph_cache_hash_11.json

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133504
Approved by: https://github.com/jamesjwu
ghstack dependencies: #133362, #133363, #133374
2024-08-15 05:53:00 +00:00
448d54ee92 AutoHeuristic: instructions (#132894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132894
Approved by: https://github.com/Chillee
2024-08-15 04:54:54 +00:00
8624a571b4 [Inductor][CPP] Support vectorization of remainder (#129849)
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```

Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-08-15 02:06:30 +00:00
1120b5ab55 Revert "[CI] Change inductor-perf-test-nightly naming (#131476)"
This reverts commit 86cb24e6ebf1b85840568fbc62d22629abaf5739.

Reverted https://github.com/pytorch/pytorch/pull/131476 on behalf of https://github.com/desertfire due to manually trigged dashboard run failed ([comment](https://github.com/pytorch/pytorch/pull/131476#issuecomment-2290224084))
2024-08-15 01:18:06 +00:00
c2b2969b5d made some args optional in create_mask (#133413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133413
Approved by: https://github.com/yanboliang, https://github.com/drisspg
2024-08-15 00:34:55 +00:00
8676401707 [MPS] Enable MPS mm from macOS >= 14.4 (#133494)
Summary of changes:
- [MPS] Enable MPS `mm` op from macOS >= 14.4. Previously this was disabled in https://github.com/pytorch/pytorch/pull/117549 as it was causing crashes with large matrices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133494
Approved by: https://github.com/malfet
2024-08-15 00:25:22 +00:00
dcdb25453e Make FX Graph Cache work with distributed training (#133374)
During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374
Approved by: https://github.com/ezyang
ghstack dependencies: #133362, #133363
2024-08-14 22:58:48 +00:00
6d4287419c [ONNX] Disable op_level_debug tests (#133485)
op_level_debug is being deprecated. So we disable the tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133485
Approved by: https://github.com/titaiwangms
2024-08-14 22:02:12 +00:00
7a74294786 [sparse] enable meta tests (#133379)
The skip for dynamo is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133379
Approved by: https://github.com/ezyang
2024-08-14 21:58:23 +00:00
3965f11837 Minor type annotation updates following up D60954888 (#133382)
Summary: As title.

Test Plan:
CI

Ran lintrunner locally but might have to continue to keep an eye on more oss linting issue if comes up.

Differential Revision: D61240900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133382
Approved by: https://github.com/ColinPeppler
2024-08-14 21:36:42 +00:00
d8c494910b [EZ] Enable explicitly opting into the old Linux Amazon 2 ami - Pt 1 (#133469)
For the next phase of the Amazon 2023 migration we'll be bulk migrating the remaining jobs over to the new AMI by changing the default AMI that we use.

In preparation for that, we're adding the old Linux Amazon 2 ami as a fixed variant for runners, so that if any of the less frequently jobs breaks on Amazon 2023 AMI then they can shift to explicitly using the Amazon 2 AMI temporarily while the underlying problem is debugged and fixed.

This PR is part 1, and there's a corresponding scale config PR in test-infra: https://github.com/pytorch/test-infra/pull/5551
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133469
Approved by: https://github.com/clee2000
2024-08-14 21:33:02 +00:00
3fc9ee5a31 [DeviceMesh] Directly retrieve flattened mesh if already created (#133195)
Add mapping to keep track of root_to_flatten relationship and directly retrieve the flattened mesh if already created (no pg creation).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133195
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #133193
2024-08-14 21:11:04 +00:00
44eaef46d0 [DCP] Fix meta tensor loading (#133256)
We realized the fix for (https://github.com/pytorch/pytorch/pull/129683) loading the learning rate in place actually broke the meta tensor initialization. After the PR #129683, the learning rate is loading correctly, the param with meta tensors are still un-initialized.

We cannot use `tree_map_only_` to iterate over state_dict for initialization in-place,  as `empty_like` and `to("cuda")` are both not in-place option. More context in https://github.com/pytorch/pytorch/issues/130709 Therefore, with changes in (https://github.com/pytorch/pytorch/pull/129683), the tensor after loading are still meta tensors. We previously did not catch that since `self.assertEqual()` does not distinguish a DTensor with meta DTensor.

In this PR, we added a _iterate_state_dict() function to implement in-place update for state_dict and updated the test to make sure that the params are no longer meta tensors after loading.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133256
Approved by: https://github.com/fegin
2024-08-14 21:07:11 +00:00
c0be0105c7 [aarch64] Replace OpenBLAS with NVPL in cuda arm docker (#132811)
Add NVPL to CUDA ARM docker build

original https://github.com/pytorch/builder/pull/1928 moving to pytorch/pytorch repo now

Need to go with builder repo change https://github.com/pytorch/builder/pull/1950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132811
Approved by: https://github.com/atalman
2024-08-14 21:01:50 +00:00
2e8c1be947 Update date for 2.5 in RELEASE.md (#133503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133503
Approved by: https://github.com/atalman
2024-08-14 20:45:58 +00:00
86cb24e6eb [CI] Change inductor-perf-test-nightly naming (#131476)
Summary: To make it consistent with inductor-perf-test-nightly-x86
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131476
Approved by: https://github.com/huydhn, https://github.com/zou3519
2024-08-14 20:42:15 +00:00
bedf96d7ff [AOTI] Switch fbcode HIP to C shim version v2 (#133105)
Summary: Completely switch over the default value of c_shim_version to 2

Test Plan: CI

Differential Revision: D60674018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133105
Approved by: https://github.com/ColinPeppler, https://github.com/zoranzhao
2024-08-14 19:39:10 +00:00
6980e9e569 [AOTI] Disable split_cat_aten passes (#133014)
Summary: disable passes with negative performance impact

Test Plan: run UT

Differential Revision: D60970288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133014
Approved by: https://github.com/frank-wei
2024-08-14 19:36:17 +00:00
63e5b09218 Add unit test for asymmetric compilation (#133363)
Unit test for asymmetric compilation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133363
Approved by: https://github.com/jamesjwu
ghstack dependencies: #133362
2024-08-14 19:32:18 +00:00
6f51782a59 Add comptime.sleep (#133362)
Add comp time sleep for NCCL timeout testing. The unit test is not great..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133362
Approved by: https://github.com/jamesjwu
2024-08-14 19:32:18 +00:00
cf81180007 allow SubConfigProxy of arbitrary depth (#133418)
Before, having arbitrary depth nested configs like

```
class Foo:
    foo: List[int] = [1, 2, 3]
    class Bar:
        bar: str = "1"
        class Baz:
            baz: int = 1
```

would cause problems beyond the first layer. For example, if we tried

```
from torch._inductor import config as inductor_config

print(inductor_config.Foo)
print(repr(inductor_config.Foo.foo))
print(inductor_config.Foo.Bar)
print(repr(inductor_config.Foo.Bar.bar))
print(inductor_config.Foo.Bar.Baz)
print(repr(inductor_config.Foo.Bar.Baz.baz))
```

we would get some output like

```
<torch.utils._config_module.SubConfigProxy object at 0x7fac65de00a0>
[1, 2, 3]
...
AttributeError: torch._inductor.config.Foo.Bar does not exist
```

Obviously, this is not what we want. With these changes, we get the right values

```
<torch.utils._config_module.SubConfigProxy object at 0x7f840d05bf40>
[1, 2, 3]
<torch.utils._config_module.SubConfigProxy object at 0x7f840cedc940>
'1'
<torch.utils._config_module.SubConfigProxy object at 0x7f840cedc100>
1
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133418
Approved by: https://github.com/oulgen
2024-08-14 18:43:00 +00:00
d46e0761ca Revert "[11/N] Fix clang-tidy warnings in aten/src/ATen (#133298)"
This reverts commit 35785984013a74469de8c1d29eaecb25aa0c141e.

Reverted https://github.com/pytorch/pytorch/pull/133298 on behalf of https://github.com/izaitsevfb due to causes build time regression in aten/src/ATen/native/cpu/ReduceOpsKernel.cpp ([comment](https://github.com/pytorch/pytorch/pull/133298#issuecomment-2289453440))
2024-08-14 17:47:12 +00:00
07c73a964b [MPS][BE] Delete MacOS-12.3 specific checks (#133141)
And make MPS device unavailable on Sonoma releases As lots of those checks 2 years old, are no longer validated in CI and probably much more such checks are missing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133141
Approved by: https://github.com/kulinseth, https://github.com/clee2000, https://github.com/atalman
2024-08-14 17:42:40 +00:00
7b269cc484 [TD] llm retrieval to not use bash -l {0} (#133464)
https://github.com/pytorch/pytorch/pull/129720 swapped the action used to setup miniconda from [conda incubator](https://github.com/conda-incubator/setup-miniconda) to the [custom action](2aba8f107a/.github/actions/setup-miniconda/action.yml (L1)) we have in test-infra that comes with caching.

The original miniconda [relies on bash profiles](e5293c8fd2/README.md (L746)) to set the environment variables needed to run conda, but the test infra version relies on the user using the env vars that are set during the step.

This PR changes the job to not use `bash -l {0}` to see if not activating bash profile has an effect on the run.  Unfortunately this failure happens rarely on main so I'm not sure I will be able see if this has an effect.  On the plus side, changing this doesn't seem to have a negative effect on the job, so it should be a noop at worst.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133464
Approved by: https://github.com/kit1980
2024-08-14 16:53:41 +00:00
4bb1650ca3 Bump maxinum num warps (#132458)
Fix for https://github.com/pytorch/pytorch/issues/129104

Our heuristic for num_warps was giving the optimal number, but we were capping maximum num_warps at 8. Gives 1% speedup on HF and TIMM in inference, 2% speedup in TIMM training, neutral otherwise.

ultimately, I think we want live var analysis for register usage.. still worth landing this now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132458
Approved by: https://github.com/Chillee, https://github.com/shunting314
2024-08-14 16:51:05 +00:00
d114fd78bd [FSDP2] Enable HSDP + TP (#133335)
This PR enables HSDP + TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133335
Approved by: https://github.com/awgu
2024-08-14 16:34:04 +00:00
7f40ac9be2 Migrate periodic jobs to use runner determinator (#133124)
This updates the Linux & Windows jobs in periodic.yml to use the runner determinator script.

Closes: pytorch/ci-infra#261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133124
Approved by: https://github.com/ZainRizvi
2024-08-14 16:04:15 +00:00
118b2a4139 Convert inductor jobs to Linux Amazon 2023 (#133352)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133352
Approved by: https://github.com/zxiiro, https://github.com/seemethere
2024-08-14 15:59:33 +00:00
62cd065de2 Validate that node TK_ASSIGN have field initialized (#127878)
Fixes segmentation fault during model load via C++ API.

An `Assign` statement (`TK_ASSIGN` type) have 3 fields: `lhs`, `rhs` and `type`. Field `type` is of type `Maybe`, which means it could be not presented. During model load in `import_source.cpp` field `type` is dereferenced without validation.

It is similar error that have been fixed in #106041.

Fixes #127877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127878
Approved by: https://github.com/malfet
2024-08-14 15:27:58 +00:00
e554f71d7e Implement filter in dynamo (#131674)
Fixes https://github.com/pytorch/pytorch/issues/128944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131674
Approved by: https://github.com/amjames, https://github.com/jansel
2024-08-14 14:54:13 +00:00
854a5ba958 [lint] fix lint broken by #131912 (#133428)
lint

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133428
Approved by: https://github.com/aaronenyeshi
2024-08-14 14:50:18 +00:00
378b12f3ad Improve namespace for c10::MemoryFormat::Contiguous in torchgen/api/cpp.py (#131622)
Top-level namespaces are more convenient for out-of-tree device extensions.

For example, now we have a patch for it in `torch_npu`:

98c50ced16/codegen/gen_backend_stubs.py (L772-L778)

```python
JIT_TO_CPP_DEFAULT["contiguous_format"] = "c10::MemoryFormat::Contiguous"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131622
Approved by: https://github.com/zou3519
2024-08-14 14:41:01 +00:00
efc6e8457a [inductor] [cpp] fix the reindexer from template_buffer to Y (#133070)
This PR fixes the accuracy of jx_nest_base and part of the accuracy issue of convnext_base of the max-autotune path. Another fix (https://github.com/pytorch/pytorch/pull/133073 in this ghstack) is needed to make convnext_base fully pass the accuracy check.

The index calculated via the reindexer was wrong before this PR. Both the shape of the reshape reindexer and the stride order of the stride reindexer needs to be fixed.

Index calculated before this PR:
```
# in_ptr4 points to arg4_1: size = (1, 32, 18, 18), stride = (10368, 1, 576, 32))
auto tmp7 = in_ptr4[static_cast<long>((32L*(static_cast<long>((n_start + x1 + (32L*m_start) + (32L*x0))) % static_cast<long>(18L))) + (576L*(static_cast<long>(c10::div_floor_integer((n_start + x1 + (32L*m_start) + (32L*x0)), 324L)) % static_cast<long>(32L))))];
```

The correct one after the fix is:
```
auto tmp7 = in_ptr4[static_cast<long>(n_start + x1 + (32L*(static_cast<long>((m_start + x0)) % static_cast<long>(324L))))];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133070
Approved by: https://github.com/jgong5
2024-08-14 11:42:03 +00:00
52741043e7 [Inductor][FlexAttention] Support non-divisible sequence lengths (#133019)
Perf benchmark script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
* Update ```Q_LEN``` and ```KV_LEN``` to ```8192-9``` for testing non divisible cases.

Run ```python perf_bench.py --partial-mask```.

* Before this PR

| Seqence length        | Forward | Backward |
|---------------------|-----------------|------------------|
| **Divisible(8192)**       | 0.87            | 0.85             |
| **Non-divisible(8192-9)**   | N/A            | N/A             |

* After this PR

| Seqence length        | Forward | Backward |
|---------------------|-----------------|------------------|
| **Divisible(8192)**       | 0.87            | 0.85             |
| **Non-divisible(8192-9)**   | 0.83            | 0.78             |

Memory out of bounds check passed:
* ```PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool memcheck python perf_bench.py --partial-mask```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133019
Approved by: https://github.com/Chillee
2024-08-14 10:27:39 +00:00
b5711297a0 Add support for SetVariable.discard (#133317)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133317
Approved by: https://github.com/Skylion007
2024-08-14 09:10:36 +00:00
ef580a0e5c [DeviceMesh] Restrict slicing to be a contiguous or non-contiguous subsequence of the root mesh_dim_names (#133193)
This PR adds restriction for DeviceMesh slicing. No out-of-order subsequence slicing is allowed. To create a flatten mesh_dim_names, only the in-order slicing is allowed.

```
mesh_3d = init_device_mesh(
    self.device_type, (2,2,2), mesh_dim_names=("dp", "cp", "tp"),
)

# valid 2d slicing
mesh_2d = mesh_3d["dp", "cp"]
mesh_2d = mesh_3d["dp", "tp"]
mesh_2d = mesh_3d["cp", "tp"]

# invalid 2d slicing
mesh_2d = mesh_3d["cp", "dp"]
mesh_2d = mesh_3d["tp", "cp"]
mesh_2d = mesh_3d["tp", "dp"]

# valid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()
# invalid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["cp", "dp"]._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133193
Approved by: https://github.com/fegin, https://github.com/wanchaol
2024-08-14 07:18:41 +00:00
d143f879e2 [DTensor] Add more aten._foreach ops to _pointwise_ops.py (#133271)
Fixes #ISSUE_NUMBER

Follow up for https://github.com/pytorch/pytorch/pull/132056. Added the missing foreach ops pointed out by @ad8e.

```
_foreach_sub.Scalar
_foreach_exp
_foreach_exp_
_foreach_cos_
_foreach_log_
```

As @ad8e mentioned, since the list of _foreach ops at https://pytorch.org/cppdocs/api/library_root.html is long and overload-heavy, it could be annoying to manually keep this file updated. We might need to come up with a way to update the list and add associated tests systematically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133271
Approved by: https://github.com/awgu
2024-08-14 07:14:29 +00:00
a6413d2924 Regression test for S429861 (#133376)
Adds repro test to verify that https://www.internalfb.com/sevmanager/view/429861 does not occur again.

I haven't been able to reduce the size of the repro further, if I remove any buffers the error disappears!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133376
Approved by: https://github.com/eellison
2024-08-14 06:55:05 +00:00
a30504b2a2 fix silly error when printing diff (#133345)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/133336

When we fail to suggest fixes for a data dependent error because some symbols couldn't be mapped to sources, we print out those symbols but there was a silly bug in the printing code.

New error:
```
...
    raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0 + 1, CeilToInt(IntTrueDiv(u0 + 1, 1))) (unhinted: Eq(u0 + 1, CeilToInt(IntTrueDiv(u0 + 1, 1)))).  (Size-like symbols: u0)

Potential framework code culprit (scroll up for full backtrace):
  File "/data/users/avik/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/pyspeech/fb/tools/__export_speech_llama__/export_speech_llama#link-tree/torch/_refs/__init__.py", line 2972, in expand
    guard_size_oblivious(requested_length == x)

For more information, run with TORCH_LOGS="dynamic"
For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u0"
If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing

For C++ stack trace, run with TORCHDYNAMO_EXTENDED_DEBUG_CPP=1

The following call raised this error:
  File "/data/users/avik/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/pyspeech/fb/tools/__export_speech_llama__/export_speech_llama#link-tree/pyspeech/nn/utils.py", line 271, in lengths_to_padding_mask
    ).expand(batch_size, max_length)
```

Test Plan: Repro gets past reported error, hits new error

Differential Revision: D61221994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133345
Approved by: https://github.com/ezyang
2024-08-14 06:52:55 +00:00
4d11a9b783 [CI] Fix rowwise scaling tests on h100 (#133384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133384
Approved by: https://github.com/malfet, https://github.com/nWEIdia
2024-08-14 05:58:33 +00:00
7aee3376e2 [aotd] HOP effect tokens wrapper above SubclassWrapper (#131672)
Original issue:
https://github.com/pytorch/pytorch/issues/129486

Before subclass_wrapper() got inputs containing additional effect tokens and failed as this did not match SubclassMeta indexes.

This happened as functionalization was responsible to add / remove those tokens.

Functionalization can not be run above Subclasses, as args/outs are duplicated in case of mutations.

The main design thought is to  keep logic of EffectTokens, Subclasses, Functionalization to know as less as possible about each others transformations.

For that extracting EffectTokens manipulation to a separate wrapper, which will be processed above SubclassWrapper, while functionalization will happen below SubclassWrapper as before.

In that case subclass wrap/unwrap works without information of additional arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131672
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2024-08-14 05:57:17 +00:00
2a4304329b [wip][lowering] Add max_acc_splits (#133041)
Summary: Model owners can set the lower_settings with max_acc_splits=2, and lowering will fail during model iteration, to alert them of possible performance degradation from increased fragmentation.

Test Plan: Added unit tests

Differential Revision: D60133589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133041
Approved by: https://github.com/hl475
2024-08-14 03:50:31 +00:00
f951fcd1d7 Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887)
## Summary

As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond).

WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations.
The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue.

Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel.
While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would  use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.

### Performance
#### AMX
Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded.

In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead.

Benchmarked with unit-tests.

Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442

The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel.

#### AVX2/AVX512 micro-kernels

Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437

### Follow-up
1. int4 WoQ GEMM micro-kernel will also be added in a separate PR.
2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.

E2E perf measurement should be done with #131310.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131887
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-14 03:14:45 +00:00
918367ebb0 Add new runner: G4DN Extra Large with T4 for windows binary builds (#133229)
Prep for #103104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133229
Approved by: https://github.com/ZainRizvi
2024-08-14 03:08:49 +00:00
1206958d89 [Dynamo] add EventVariable reconstruct (#133236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133236
Approved by: https://github.com/yifuwang
2024-08-14 02:56:11 +00:00
d1d6b370ce Upgrade nightly wheels to rocm6.2 - 1 of 2 (docker images) (#132875)
Fixes https://github.com/pytorch/pytorch/issues/132570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132875
Approved by: https://github.com/atalman
2024-08-14 02:46:48 +00:00
14750dd737 Correct return type of grouping helper function in Optimizer (#133360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133360
Approved by: https://github.com/albanD
2024-08-14 01:56:02 +00:00
5fff960004 [PT2][Optimus] Extend split_stack_to_cats when split and stack have different dims (#133060)
Summary: We observed a special case in AI CMF where the split and stack nodes have different dims, thus we extend our current implementation to include the special case.

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Buck UI: https://www.internalfb.com/buck2/6d0502bc-c840-425e-b686-b00b0b7da5f5
Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923577411786
Network: Up: 353KiB  Down: 611KiB  (reSessionID-1f80d74b-543f-4856-b3bf-181283c0f7e3)
Jobs completed: 29. Time elapsed: 5:36.7s.
Cache hits: 0%. Commands: 4 (cached: 0, remote: 1, local: 3)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ai_cmf" --flow_id 558295195 -n
```

 Counter({'pattern_matcher_nodes': 2321, 'pattern_matcher_count': 1320, 'normalization_pass': 280, 'merge_splits_pass': 250, 'extern_calls': 95, 'normalization_aten_pass': 28, 'scmerge_cat_removed': 14, 'scmerge_cat_added': 12, 'scmerge_split_removed': 7, 'unbind_stack_pass': 7, 'split_stack_to_cats_pass': 4, 'scmerge_split_sections_removed': 3, 'batch_aten_add': 3, 'batch_aten_mul': 3, 'split_cat_pass': 2, 'scmerge_split_added': 2, 'split_cat_to_slices_pass': 2, 'fxgraph_cache_miss': 2, 'batch_linear_post_grad': 1})

torch graph
https://www.internalfb.com/intern/everpaste/?color=0&handle=GK5kwRZRtEMCZTAJAJlRpekhPhp0br0LAAAz

# e2e

Differential Revision: D60998945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133060
Approved by: https://github.com/jackiexu1992
2024-08-14 01:45:12 +00:00
4af4910b1a Reland "Construct NJT without graph breaks" (#133196)
This reverts commit 154d40ca488e6979ce9c2de89d8a35b53129ebea.

and adds changes from https://github.com/pytorch/pytorch/pull/133061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133196
Approved by: https://github.com/ezyang
ghstack dependencies: #133145
2024-08-14 01:11:13 +00:00
f23dbefe52 [export] Support "custom" metadata field. (#131912)
Summary:
Add a special field in Graph and Node level metadata called "custom" which should be mapped to a json-serializable object, and we guarantee this field should be always preversed across the following transformations:
1. copy/deepcopy
2. run_decompositions()
3. serialization
4. re-exporting

Test Plan: :test_export -- -r custom_tag

Reviewed By: angelayi

Differential Revision: D60291839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131912
Approved by: https://github.com/angelayi
2024-08-14 01:09:01 +00:00
cyy
c2eeda5da0 [structural binding][12/N] Replace std::tie with structural binding (#131031)
Follows #130830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131031
Approved by: https://github.com/ezyang
2024-08-14 00:51:34 +00:00
7666ef9d9b [GHF] Fix co-authors attribution (#133372)
Acording to https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors Co-authors must be mentioned at the very end of commit message and separated by 2 newlines

Test plan:
```python
from trymerge import GitHubPR
pr = GitHubPR("pytorch", "pytorch", 133189)
print(pr.gen_commit_message())
```

Fixes https://github.com/pytorch/pytorch/issues/133310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133372
Approved by: https://github.com/kit1980
2024-08-14 00:48:24 +00:00
cyy
3578598401 [11/N] Fix clang-tidy warnings in aten/src/ATen (#133298)
Follows #133155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133298
Approved by: https://github.com/ezyang
2024-08-14 00:29:38 +00:00
fbb0adbc32 [TunableOp] lazy init TuningContext singleton (#133347)
Forward fix after #132464 because TuningContext had been created during static library init, which creates the TuningResultsValidator, which tries to query HIP device properties before the HIP runtime has initialized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133347
Approved by: https://github.com/zixi-qi
2024-08-14 00:20:11 +00:00
5947169c9d Add missing endline in exception message (#133240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133240
Approved by: https://github.com/Skylion007
2024-08-14 00:11:39 +00:00
c91bc499f7 [CI] Do not emit color escape sequence during testing (#133350)
By forcing term to vt100

Fixes problem reported in  https://github.com/pytorch/pytorch/issues/133330 but more broadly it should be fixed on Nova/Infra side

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133350
Approved by: https://github.com/zou3519
2024-08-13 23:39:16 +00:00
caba37e99b Update fused kernels and call _safe_softmax from SDPA (#131863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863
Approved by: https://github.com/jbschlosser, https://github.com/Chillee
2024-08-13 23:37:50 +00:00
9de023d44d [Dynamo] Make torch.Size can be reconstructed by LOAD_CONST (#133342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133342
Approved by: https://github.com/mlazos, https://github.com/jansel
2024-08-13 23:18:38 +00:00
c17d26c3c1 [AOTI][Tooling] A couple fixes / minor updates for initial debug printer (#133016)
Summary:
Follow up small diff to fix a couple issues:
-  add condition for cuda/gpu case to only print kernel name list in the second pass i.e. when we do the cpp wrapper codegen

- other minor fixes around `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` option

Test Plan:
```
AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_0" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda
```

Differential Revision: D60954888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133016
Approved by: https://github.com/ColinPeppler
2024-08-13 23:00:29 +00:00
41da528378 [BE] Skip inductor+profiler test for templates if we didn't run select_autotune (#133344)
Sometimes we don't have enough SMs to do autotuning and then we fall back to aten, in which case we won't run the template kernel and it won't show up in the profile trace.

Differential Revision: [D61222101](https://our.internmc.facebook.com/intern/diff/D61222101/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133344
Approved by: https://github.com/masnesral
2024-08-13 22:58:24 +00:00
8e074c4583 [ROCm] skip SymmetricMemory related UTs for ROCm (#133241)
This features is not yet supported on ROCm.
Skipping:
distributed/test_symmetric_memory.py::SymmetricMemoryTest::test_low_contention_all_gather_symm_mem_input_False
With the errors:
RuntimeError: CUDASymmetricMemory requires PYTORCH_C10_DRIVER_API_SUPPORTED

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133241
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-08-13 22:33:51 +00:00
5a1d4f7ddc Migrate lint.yml to runner determinator (#133320)
Update the jobs in lint.yml to use the runner determinator.

Closes: pytorch/ci-infra#258

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133320
Approved by: https://github.com/Skylion007
2024-08-13 22:16:32 +00:00
a9d34138df [traced-graph][sparse] add to_dense() operation to sparse export test (#133175)
This works for sparse COO but surprisingly still fails for the other compressed sparse cases. I filed the following bug report:

https://github.com/pytorch/pytorch/issues/133174
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133175
Approved by: https://github.com/ezyang
2024-08-13 20:36:40 +00:00
69de9e78e9 Revert "typing for remote_cache (#133299)"
This reverts commit 2fde1934f9efc418cc5a398bd0b09b29551cc091.

Reverted https://github.com/pytorch/pytorch/pull/133299 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133299#issuecomment-2287067434))
2024-08-13 20:26:24 +00:00
fa7ae6cdbc can't infer device on benchmarked function with no args or kwargs (#133290)
when we call benchmarker.benchmark(fn, (), {}) it attempts to infer the device from the args and kwargs, which are both empty. in this case the default behavior is to assume CPU, since `is_cpu_device` is implemented as `all([x.device == "cpu" for x in ... if x is Tensor])`, and `all([]) == True`. I've added a PR that makes this raise an error, but we should just fix this one callsite first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133290
Approved by: https://github.com/eellison
2024-08-13 20:13:44 +00:00
dfc7c860e4 Allow SymInt input for torch.fx reinplace pass (#133178)
Fixes #133176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133178
Approved by: https://github.com/ezyang
2024-08-13 20:07:17 +00:00
61625a18ef [profiler] Only parse kineto requests and build tree when required (#132713)
To avoid high overheads of constructing datastructure in python when the user is simply saving trace to a file, we only process things lazily.

## Details
1. Delay function event parsing, add a flag to denote when needed.
1. Make profiler.function_events a computed property so code using `prof.function_events` does not need to change.
1. Fix coverage for `str(prof)` in profiler tests.

## Test run
Test program
```
import torch
from torch.profiler import profile, record_function, ProfilerActivity

def payload(use_cuda=False):
    x = torch.randn(10, 10)
    if use_cuda:
        x = x.cuda()
    y = torch.randn(10, 10)
    if use_cuda:
        y = y.cuda()
    z = torch.mm(x, y)
    z = z + y
    if use_cuda:
        z = z.cpu()

with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    with record_function("model_inference"):
        payload()

prof.export_chrome_trace("/tmp/test_trace.json")
#print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
```

The print "this is computing events" will happen lazily.

```
>]$ python3 profiler_test.py
Brian: this is computing function events
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
       model_inference         6.77%     441.628us       100.00%       6.523ms       6.523ms             1
           aten::randn         1.86%     121.108us        46.93%       3.061ms       1.530ms             2
              aten::mm        45.36%       2.959ms        45.44%       2.964ms       2.964ms             1
         aten::normal_        44.72%       2.917ms        44.72%       2.917ms       1.458ms             2
             aten::add         0.87%      56.646us         0.87%      56.646us      56.646us             1
           aten::empty         0.35%      22.808us         0.35%      22.808us      11.404us             2
    aten::resolve_conj         0.08%       5.173us         0.08%       5.173us       1.724us             3
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 6.523ms

$> python3 profiler_test.py
(pytorch) [bcoutinho@devgpu038.ftw6 /data/users/bcoutinho/pytorch (profiler_optimize_parsing)]$
$>ls -a profiler_test.py
$> ls -l /tmp/test_trace.json
-rw-r--r-- 1 bcoutinho users 16471 Aug  5 16:10 /tmp/test_trace.json
```
## Unit test
Updates some tests and they all pass now.
`pytest test/profiler/test_profiler.py`

Also
`python test/test_autograd.py TestAutogradWithCompiledAutograd.test_record_function`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132713
Approved by: https://github.com/sraikund16
2024-08-13 18:58:20 +00:00
657d58bbd8 [inductor] Fix test_codecache::test_inductor_counters (#133244)
Summary: This test is flakey internally, but it's not a great test in the first place since it's relying on the max-autotune step to bump a related counter. Instead of doing that, directly install a mock that bumps a counter specifically for this test. Additionally, test that the caching logic correctly accommodates an arbitrary counter delta (previously the relevant counter is only bumped by +1).

Differential Revision: [D61141164](https://our.internmc.facebook.com/intern/diff/D61141164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133244
Approved by: https://github.com/eellison
2024-08-13 18:52:27 +00:00
2fde1934f9 typing for remote_cache (#133299)
Summary: typing annotations for remote_cache

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D60937968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133299
Approved by: https://github.com/Skylion007
2024-08-13 18:28:41 +00:00
a1ca4dfe0b [ONNX] Fix onnx conversion scaled_dot_product_attention (#133314)
Fixes error message raised by the torch>=2.5: A mismatch between the number of arguments (8) and their descriptors (7) was found at symbolic function 'scaled_dot_product_attention' by adding the newly introduced use_gqa parameter.

From https://github.com/pytorch/pytorch/pull/132689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133314
Approved by: https://github.com/Skylion007, https://github.com/justinchuby
2024-08-13 18:22:24 +00:00
19416bf38b Reland "[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675)" (#133291)
Reland by reverting commit 844103197d3e8cf6b4b59176e473365113f4f962. #131675 failed a few internal tests because it imported a diff version which wasn't rebased on the proper dependent diffs. Reland from OSS only to avoid the out-of-sync issue.

Original description from #131675
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py

ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py

This is part 2 pull request which deals with the 2nd case above:

The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.

Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.

Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details.

Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133291
Approved by: https://github.com/wdvr
2024-08-13 18:18:12 +00:00
dadb20a9d6 [Memory Snapshot][Viz] Add Allocator Settings Tab (#132518)
Summary: Since we are storing the allocator settings in the snapshot files for awhile now (since https://github.com/pytorch/pytorch/pull/119404), we can expose this to users with a new tab in the visualizer.

Test Plan:
Ran it locally:
![image](https://github.com/user-attachments/assets/5f79ccd0-fe1c-4e42-bb58-106d8f3cccd6)

Differential Revision: D60673548

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132518
Approved by: https://github.com/tianfengfrank, https://github.com/zdevito
2024-08-13 17:35:12 +00:00
7172c732d9 [Memory Snapshot] Skip C++ warmup unwind() call if context is not set (#133038)
Summary: Should skip C++ warmup `unwind::unwind();` if there is no context set. This call is sometimes causing hanging issues since C++ stack collection is not robust.

Test Plan: CI

Differential Revision: D60965985

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133038
Approved by: https://github.com/eqy
2024-08-13 17:25:24 +00:00
be400ee2b4 [inductor][test] Fix test_vertical_pointwise_reduction_fusion (#133276)
Summary: Fix after https://github.com/pytorch/pytorch/pull/131649 changes behavior for fusion.

Test Plan: ci

Differential Revision: D61165949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133276
Approved by: https://github.com/ColinPeppler
2024-08-13 17:18:43 +00:00
89795da5e3 [inductor] process compile_only case in all build options class. (#129975)
Optimize `compile_only` logical. Origin code only apply for `CppTorchCudaOptions`, this PR make it apply for all build option classes.
Changes:
1. Remove `libraries_dirs` and `libraries` settings, when `compile_only`.
2. Remove compile_only from CppTorchCudaOptions.
3. Make the `compile_only` apply for all classes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129975
Approved by: https://github.com/henrylhtsang
2024-08-13 16:45:27 +00:00
19270cff61 Add a reference for the LRScheduler class (#133243)
The `LRScheduler` class provides methods to adjusts the learning rate during optimization (as updated in this PR). Also, as a note, all the classes of lr_scheduluer are already provided in the `How to adjust learning rate` section.

Fixes #127884

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133243
Approved by: https://github.com/janeyx99
2024-08-13 16:20:22 +00:00
aa4fbba42d Make q info optional in prep for inference (#133261)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133261
Approved by: https://github.com/Chillee
ghstack dependencies: #132969
2024-08-13 16:09:39 +00:00
660436d843 Convert Periodic to use Amazon2023 runners (#133036)
Fixes #ISSUE_NUMBER

Co-authored-by: clee2000 <44682903+clee2000@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133036
Approved by: https://github.com/clee2000, https://github.com/zxiiro
2024-08-13 15:59:50 +00:00
cyy
2f30473fba [19/N] Fix clang-tidy warnings in jit (#133067)
Follows  #132963
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133067
Approved by: https://github.com/Skylion007
2024-08-13 15:59:43 +00:00
2e7d67e6af Migrate slow.yml jobs to use runner determinator (#133232)
Update the jobs in slow.yml to use the runner determinator script.

Closes: pytorch/ci-infra#259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133232
Approved by: https://github.com/ZainRizvi
2024-08-13 15:44:55 +00:00
c518b50c4c Remove functorch dispatch keys in legacyExtractDispatchKey (#133018)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133018
Approved by: https://github.com/zou3519
2024-08-13 15:32:01 +00:00
cd565bc455 Refactor process_inputs outside of create_aot_dispatcher_function (#130962)
This PR refactors process_inputs so that it occurs earlier outside of create_aot_dispatcher_function for the purpose of calculating a cache key with the inputs after they have been processed.

This way, if tensors have symint sizes/strides, we successfully factor that into the cache key instead of specializing on every possible size and stride. Test that utilizes this incoming.

# Guard behavior
Note that it's technically possible for tensors with symint arguments to introduce guards in aot_dispatch, if they trace through decompositions that branch on tensor size/stride. This can result in multiple graph modules with differing guards having the same key in the cache.

FXGraphCache has this same issue, and the remote FXGraphCache intentionally does not handle this: instead it only saves the first result in the cache, and cache misses if guards miss. The local FXGraphCache does handle this by storing multiple files and iterating through them, but we opt not to introduce that complexity just yet for AOTAutogradCache until we deem it necessary (i.e., models appear where saving multiple cache results with different guards but the same cache key becomes important). Instead, AOTAutogradCache will save a single entry per result, overriding it if it cache misses due to guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130962
Approved by: https://github.com/bdhirsh
2024-08-13 14:56:00 +00:00
4cca18d5b6 Revert "Update fused kernels and call _safe_softmax from SDPA (#131863)"
This reverts commit e61def65d5c6268e79f52776f75277ee60f01462.

Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/albanD due to Broke forward AD tests in main ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2286432628))
2024-08-13 14:44:08 +00:00
095c5ccf9f [CD] Change XPU nightly build back to ABI=0 (#132854)
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132854
Approved by: https://github.com/atalman
2024-08-13 13:46:29 +00:00
cyy
e0a5536cc9 [2/N] Fix clang-tidy warnings in torch/csrc/autograd (#133295)
Follows #133180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133295
Approved by: https://github.com/Skylion007
2024-08-13 13:23:46 +00:00
7756175273 Add Sleef Implementation for maximum Kernel for ARM (#131642)
The NEON Vectorized<float> implementation does not use SLEEF functions for maximum Implementation. So updated maximum function with sleef calls for better performance on graviton3.It showed good performance improvement in LLM models.
The results are taken in graviton3 machine as follows:
<img width="268" alt="perf_result" src="https://github.com/user-attachments/assets/8c575873-b985-44e1-ba8e-880fe6494c5f">

This maximum kernel is used in softmax. The performance timing of softmax with default and sleef change is as below:(graviton3 machine)
<img width="265" alt="softmax" src="https://github.com/user-attachments/assets/3be22c0e-7c99-407e-a8d1-891cb1e035ad">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131642
Approved by: https://github.com/snadampal, https://github.com/jgong5
2024-08-13 11:08:14 +00:00
40061bd61e [export] overwrite placeholder names when deepcopying (#133269)
In joint-graph export we have a `copy.deepcopy(ep.graph_module)` call. This turns out to be an imperfect deepcopy, because deepcopy allows objects to overwrite their `__deepcopy__` methods. For fx.Graph, this ends up deferring to `Graph.create_node()`, which checks the graph namespace, and can avoiding copying the exact name in niche examples, like where the name is a Python keyword (e.g. `input` gets renamed to `input_1`).

Names like `input` happen because export's placeholder naming pass overwrites what the namespace creates, based on the model's `forward()` signature. So we can either 1) avoid overwriting such cases, which requires rewriting the naming pass logic, or 2) force another overwrite after deepcopying. This goes with 2).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133269
Approved by: https://github.com/zhxchen17, https://github.com/dvorjackz, https://github.com/ydwu4
2024-08-13 10:20:43 +00:00
947a446be4 [executorch hash update] update the pinned executorch hash (#131420)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131420
Approved by: https://github.com/pytorchbot
2024-08-13 08:30:51 +00:00
9f17037e8b [dtensor] move tensor constructors to the api module (#133129)
This is to ensure __init__.py only contain public APIs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133129
Approved by: https://github.com/awgu, https://github.com/tianyu-l
2024-08-13 06:09:56 +00:00
cyy
50e837d9c2 [10/N] Fix clang-tidy warnings in aten/src/ATen (#133155)
Follows  #132842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133155
Approved by: https://github.com/janeyx99, https://github.com/ezyang
2024-08-13 03:48:58 +00:00
cyy
af7830e353 [1/N] Fix clang-tidy warnings in torch/csrc/autograd (#133180)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133180
Approved by: https://github.com/albanD
2024-08-13 03:36:10 +00:00
4671e98656 [export] fix node.users when inlining HOOs (#133144)
The process of inlining HOO subgraphs (e.g. set_grad_enabled) seems to break node.users when a node is present in multiple subgraphs, for example:
```
class SetGradCase(torch.nn.Module):
    def forward(self, x):
        _x = x.shape[0] + 2
        _xx = _x + 2
        with torch.no_grad():
            y = _x * 4
        return _xx, y
```

The `_x` node contains 2 users (_xx and y) after being inlined, but with inspection it seems to only contain y as a user.

Previously we were completely clearing node.users for output nodes in HOO subgraphs before inlining them - we should just be deleting the subgraph output nodes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133144
Approved by: https://github.com/larryliu0820, https://github.com/ydwu4
2024-08-13 03:21:42 +00:00
fa36eba77d Turn off remote caching in unit tests unless explicitly on (#133258)
Summary: This PR turns off remote caching in unit tests unless the unit test explicitly turns it on.

Test Plan: existing tests

Differential Revision: D61152154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133258
Approved by: https://github.com/masnesral
2024-08-13 02:49:43 +00:00
1e9bedf688 Add _codecs.encode and builtins.bytearray to _get_allowed_globals to support bytes and bytearray serialization (#133189)
Fixes #133163

Debugged in collaboration with @hariveliki

The `byte` type is demanding the global `_codecs.encode`. That means, the following currently works:
```python
import torch

torch.save(b'hello', '/tmp/dummy.pth')

torch.serialization.add_safe_globals([_codecs.encode])
torch.load('/tmp/dummy.pth', weights_only=True)
```

Similarly, `bytearray` needs `builtins.bytearray`.

Following the `torch.loads` docs promise, both types should be supported without `add_safe_globals` as they are both primitive types:
>         weights_only: Indicates whether unpickler should be restricted to
>            loading only tensors, primitive types, dictionaries
>           and any types added via :func:`torch.serialization.add_safe_globals`.

This PR adds both `_codecs.encode` and `builtins.bytearray` to `_get_allowed_globals` and test for saving and loading of both types with and without `weights_only`.

Co-authored-by: hariveliki <98284163+hariveliki@users.noreply.github.com>
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133189
Approved by: https://github.com/mikaylagawarecki
2024-08-13 02:20:28 +00:00
f1c439cbed AutoHeuristic: refactoring (#133170)
This PR refactors train_decision.py and adds some basic logging, which I'll extend in another PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133170
Approved by: https://github.com/Chillee
2024-08-13 01:46:53 +00:00
cyy
e76f0e0646 Remove QNNPACK reference from setup.py (#133177)
QNNPACK has been removed from third party
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133177
Approved by: https://github.com/albanD
2024-08-13 01:19:12 +00:00
7be77658e9 [Inductor] support masked vectorization for the tail_loop for INT8 datatype (#131155)
This PR supports masked vectorization for the tail_loop for torch.uint8 and torch.int8 datatype to improve performance.
BTW, I fixed the UT of `byte` by setting the range of the sample inputs  to [0, 255] since the range of `torch.uint8` is [0, 255].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131155
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130724
2024-08-13 01:12:05 +00:00
370b072d8d [Inductor] support masked vectorization for the tail_loop of the 2d tiles kernel (#130724)
This PR supports masked vectorization for the tail_loop of the 2d tiles kernel to improve the performance.

Example:
```
import torch

def fn(a):
    return torch.permute(a, (2, 0, 1)).contiguous()

input = torch.randn(2, 20, 40)
compiled_fn = torch.compile(fn)

with torch.no_grad():
    for _ in range(3):
        compiled_fn(input)
```

Generated code:
- Before:
```
cpp_fused_clone_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/z2/cz2ry4ghylembzwx7hkbanur76fi3mkiu7s6jm3zdi2amy5egq4b.h"
extern "C"  void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(16L))
        {
            #pragma GCC ivdep
            for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L))
            {
                float tmp0[16*16] __attribute__ ((aligned (16)));
                at::vec::transpose_mxn<float,16,16>(in_ptr0 + static_cast<long>(x0 + (40L*x1)), static_cast<long>(40L), tmp0, 16);
                for (long x0_inner = 0; x0_inner < 16; x0_inner++)
                {
                    auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16L*x0_inner), 16);
                    tmp1.store(out_ptr0 + static_cast<long>(x1 + (40L*x0) + (40L*x0_inner)));
                }
            }
            #pragma GCC ivdep
            for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(1L))
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0 + (40L*x1)), 16);
                [&]
                {
                    __at_align__ std::array<float, 16> tmpbuf;
                    tmp0.store(tmpbuf.data(), 16);
                    #pragma GCC unroll 16
                    for (long x0_inner = 0; x0_inner < 16; x0_inner++)
                    {
                        out_ptr0[static_cast<long>(x1 + (40L*x0) + (40L*x0_inner))] = tmpbuf[x0_inner];
                    }
                }
                ()
                ;
            }
        }
        #pragma GCC ivdep
        for(long x0=static_cast<long>(32L); x0<static_cast<long>(40L); x0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long x1=static_cast<long>(0L); x1<static_cast<long>(40L); x1+=static_cast<long>(1L))
            {
                auto tmp0 = in_ptr0[static_cast<long>(x0 + (40L*x1))];
                out_ptr0[static_cast<long>(x1 + (40L*x0))] = tmp0;
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (2, 20, 40), (800, 40, 1))
    buf0 = empty_strided_cpu((40, 2, 20), (40, 20, 1), torch.float32)
    cpp_fused_clone_0(arg0_1, buf0)
    del arg0_1
    return (buf0, )
```

- After:
```
cpp_fused_clone_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/z2/cz2ry4ghylembzwx7hkbanur76fi3mkiu7s6jm3zdi2amy5egq4b.h"
extern "C"  void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(16L))
        {
            #pragma GCC ivdep
            for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L))
            {
                float tmp0[16*16] __attribute__ ((aligned (16)));
                at::vec::transpose_mxn<float,16,16>(in_ptr0 + static_cast<long>(x0 + (40L*x1)), static_cast<long>(40L), tmp0, 16);
                for (long x0_inner = 0; x0_inner < 16; x0_inner++)
                {
                    auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16L*x0_inner), 16);
                    tmp1.store(out_ptr0 + static_cast<long>(x1 + (40L*x0) + (40L*x0_inner)));
                }
            }
            #pragma GCC ivdep
            for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(8L))
            {
                float tmp0[16*8] __attribute__ ((aligned (16)));
                at::vec::transpose_mxn<float,8,16>(in_ptr0 + static_cast<long>(x0 + (40L*x1)), static_cast<long>(40L), tmp0, 8);
                for (long x0_inner = 0; x0_inner < 16; x0_inner++)
                {
                    auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(8L*x0_inner), 8);
                    tmp1.store(out_ptr0 + static_cast<long>(x1 + (40L*x0) + (40L*x0_inner)), 8);
                }
            }
        }
        #pragma GCC ivdep
        for(long x0=static_cast<long>(32L); x0<static_cast<long>(40L); x0+=static_cast<long>(8L))
        {
            #pragma GCC ivdep
            for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L))
            {
                float tmp0[8*16] __attribute__ ((aligned (16)));
                at::vec::transpose_mxn<float,16,8>(in_ptr0 + static_cast<long>(x0 + (40L*x1)), static_cast<long>(40L), tmp0, 16);
                for (long x0_inner = 0; x0_inner < 8; x0_inner++)
                {
                    auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16L*x0_inner), 16);
                    tmp1.store(out_ptr0 + static_cast<long>(x1 + (40L*x0) + (40L*x0_inner)));
                }
            }
            #pragma GCC ivdep
            for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(8L))
            {
                float tmp0[8*8] __attribute__ ((aligned (16)));
                at::vec::transpose_mxn<float,8,8>(in_ptr0 + static_cast<long>(x0 + (40L*x1)), static_cast<long>(40L), tmp0, 8);
                for (long x0_inner = 0; x0_inner < 8; x0_inner++)
                {
                    auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(8L*x0_inner), 8);
                    tmp1.store(out_ptr0 + static_cast<long>(x1 + (40L*x0) + (40L*x0_inner)), 8);
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (2, 20, 40), (800, 40, 1))
    buf0 = empty_strided_cpu((40, 2, 20), (40, 20, 1), torch.float32)
    cpp_fused_clone_0(arg0_1, buf0)
    del arg0_1
    return (buf0, )
```

Co-authored-by: CaoE <e.cao@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130724
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-13 01:02:24 +00:00
e61def65d5 Update fused kernels and call _safe_softmax from SDPA (#131863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863
Approved by: https://github.com/jbschlosser
2024-08-13 00:51:55 +00:00
00aa086298 Revert "[dtensor] move tensor constructors to a separate module (#133129)"
This reverts commit e890d888d916b4f38b383a59e0e9445513c67313.

Reverted https://github.com/pytorch/pytorch/pull/133129 on behalf of https://github.com/fbgheith due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/133129#issuecomment-2285090400))
2024-08-12 23:55:08 +00:00
89670d5bdd Revert "Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887)"
This reverts commit 8fbd7d92a81b61d41363edb1b3902ba7701d5a27.

Reverted https://github.com/pytorch/pytorch/pull/131887 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/131887#issuecomment-2285082401))
2024-08-12 23:45:46 +00:00
844103197d Revert "[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675)"
This reverts commit bb6eef8ed1de0eb48bde10a07da57b6acc82fb05.

Reverted https://github.com/pytorch/pytorch/pull/131675 on behalf of https://github.com/fbgheith due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/131675#issuecomment-2285069508))
2024-08-12 23:31:16 +00:00
656465fc77 Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749)"
This reverts commit ed97fb77f9a9d9d815f4975caccbc961ebbcb714.

Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/izaitsevfb due to fails internal jobs, see [S440348](https://www.internalfb.com/sevmanager/view/440348) ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2285051164))
2024-08-12 23:14:19 +00:00
d4b31f7bcf Refactor BlockMask constructorr and add Factory func (#132969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132969
Approved by: https://github.com/Chillee
2024-08-12 22:38:42 +00:00
e553ef69d0 [BE] Fix typo (#133247)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133247
Approved by: https://github.com/c-p-i-o, https://github.com/zxiiro
2024-08-12 21:58:55 +00:00
8585dee85d [inductor] Add some more reinplacing tests (#132839)
Also add some tests around the counters we added in a previous PR.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132839
Approved by: https://github.com/eellison
2024-08-12 21:34:45 +00:00
592682fe22 Migrate nightly.yml to use runner determinator (#133225)
Updates the nightly.yml jobs to use the runner determinator script.

Closes: pytorch/ci-infra#260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133225
Approved by: https://github.com/ZainRizvi
2024-08-12 21:25:55 +00:00
80ed3e9ccd s/dipatch/dispatch/g (#133192)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133192
Approved by: https://github.com/albanD
2024-08-12 20:26:58 +00:00
4f0d5f6551 Pin sympy to 1.13.1 (#133235)
Sympy 1.13.2 release yesterday, and it results in test failures on windows and mac

454713fe9d/1

Hopefully these are the places it needs to get pinned
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133235
Approved by: https://github.com/atalman, https://github.com/ZainRizvi
2024-08-12 20:10:09 +00:00
36c4ed8e49 [inductor] add FreeLibrary to DLLWrapper for Windows. (#133184)
For previous PR https://github.com/pytorch/pytorch/pull/132630 . We found `DLLWrapper` class doesn't have `_dlclose` implemention for Windows.

I write a small test project to figure out how to make it works on Windows: https://github.com/xuhancn/ctypes_all_lifecycle/blob/main/pysrc/module_manage.py#L30-L61
Test result: https://github.com/xuhancn/ctypes_all_lifecycle/tree/main?tab=readme-ov-file#ctypes_cyclepy

So, I have port the Windows FreeLibrary implemention to pytorch DLLWrapper in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133184
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-12 19:55:48 +00:00
cdcc7dc891 update comit pin for xla (#133120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133120
Approved by: https://github.com/janeyx99
2024-08-12 19:38:37 +00:00
cc1cc71c46 [MPS] Fix relu for 0-element input case (#133191)
Fixes #133182

Should already be tested by `test/test_mps.py::MPSReluTest::testNumbersGPU`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133191
Approved by: https://github.com/albanD
2024-08-12 19:24:17 +00:00
666362865c [test/profiler] Make test_profiler_pattern_matcher_json_report write … (#133009)
Makes it possible to run `test/profiler/test_profiler.py#test_profiler_pattern_matcher_json_report` on CI environments where the test runner doesn't have write permissions to the current-working-directory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133009
Approved by: https://github.com/zou3519
2024-08-12 18:56:50 +00:00
fa1d7b0262 Revert "Remove unused Caffe2 macros (#132979)"
This reverts commit da65cfbdea4f1f2176f6242004bda940a24f9ddb.

Reverted https://github.com/pytorch/pytorch/pull/132979 on behalf of https://github.com/ezyang due to these are apparently load bearing internally ([comment](https://github.com/pytorch/pytorch/pull/132979#issuecomment-2284666332))
2024-08-12 18:34:56 +00:00
afb73d253c [custom_ops] torch.library.{custom_op, register_kernel} disable Dynamo (#133125)
We promise the user that these custom ops (and their kernels) are black
boxes w.r.t. torch.compile. Unfortunately Dynamo can turn itself back
on in the implementation of the custom operator, so we force it off by
disabling Dynamo

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133125
Approved by: https://github.com/ezyang
2024-08-12 18:29:18 +00:00
d53dfa4680 [BE] Raise when the target model has scalar parameters (#132934)
Address the issue, https://github.com/pytorch/pytorch/issues/130810.

Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934
Approved by: https://github.com/awgu, https://github.com/wz337
2024-08-12 18:28:02 +00:00
0e4c0ef29f fix type of eta_min parameter in CosineAnnealing (int -> float) (#132482)
This fixes errors with type checkers such as `pyright`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132482
Approved by: https://github.com/janeyx99
2024-08-12 18:22:26 +00:00
e7d8d73582 [minor] Correct in-code documentation for complex numbers and LBFGS (#133020)
To @lezcano's credit, this should be associative, as floating point add is actually commutative per IEEE754.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133020
Approved by: https://github.com/soulitzer, https://github.com/lezcano
2024-08-12 18:04:11 +00:00
d51e5467fd TunableOp unconditionally add all validators (#132464)
For workloads that only exercised scaled_mm, the csv result file would not contain the same set of validators as a gemm workload.  Trying to reuse the same csv file between workloads would discard the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132464
Approved by: https://github.com/zixi-qi
2024-08-12 17:35:00 +00:00
d61815cb7d [torch][ao] Use returned model from Quantizer.transform_for_annotation in prepare_pt2e (#132893)
Summary:
The Quantizer subclass can return a new model from `transform_for_annotation`,
and this is common if it uses any ExportPass subclass which does not mutate in-place.

Use the returned model instead of assuming its the same.

Differential Revision: D60869676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132893
Approved by: https://github.com/jerryzh168
2024-08-12 17:23:19 +00:00
1371c420c3 Migrate binary builds to use Amazon2023 runners (#131826)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250

Migrates all linux binary builds.

The failures are windows jobs which aren't touched by this PR

prev runs (for tracking):
- https://hud.pytorch.org/pytorch/pytorch/pull/131826?sha=e1ee074b1e7b17008e3f3774e4842b5e1d4c1355
- https://hud.pytorch.org/pytorch/pytorch/pull/131826?sha=50a3488ae776f86bd6bead8b048b051c49a25ec7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131826
Approved by: https://github.com/malfet
2024-08-12 17:18:55 +00:00
b06959e614 [export] change deepcopy to copy in _replace_with_hop passes (#133142)
Summary:
Add back the change in 19897a1647.

The change was lost in refactoring due to a bad rebase.

Test Plan:
CI

```
buck2 run 'fbcode//mode/dev-nosan'  fbcode//torchrec/distributed/tests:test_pt2 -- --filter-text test_sharded_quant_fpebc_non_strict_export
```

Differential Revision: D61052687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133142
Approved by: https://github.com/ydwu4
2024-08-12 17:15:04 +00:00
3128640c31 [Memory Snapshot][Viz] Show event timestamps if collected (#132523)
Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events.

Test Plan:
Ran it locally.

![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064)

Differential Revision: D60673800

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523
Approved by: https://github.com/tianfengfrank, https://github.com/zdevito
2024-08-12 16:12:04 +00:00
454713fe9d Add inductor-cu124, inductor-rocm to upload test stats (#133143)
Forgot to add them in https://github.com/pytorch/pytorch/issues/128250 and https://github.com/pytorch/pytorch/issues/131637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133143
Approved by: https://github.com/huydhn
2024-08-12 15:51:51 +00:00
9641abe97a Revert "[export] change deepcopy to copy in _replace_with_hop passes (#133142)"
This reverts commit 2d71f03db124bd1517627d34896dd2d9248227af.

Reverted https://github.com/pytorch/pytorch/pull/133142 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133142#issuecomment-2284327241))
2024-08-12 15:48:11 +00:00
e9eb8795bb Revert "[Memory Snapshot][Viz] Show event timestamps if collected (#132523)"
This reverts commit 27c44c884e28c9378677fb295a528c36c429c3f7.

Reverted https://github.com/pytorch/pytorch/pull/132523 on behalf of https://github.com/clee2000 due to broke some tests on mac ex export/test_retraceability.py::RetraceExportTestExport::test_disable_forced_specializations_ok_retraceability [GH job link](https://github.com/pytorch/pytorch/actions/runs/10344621336/job/28630686528) [HUD commit link](27c44c884e) Possibly a landrace since I see that some of the failing tests ran on the PR ([comment](https://github.com/pytorch/pytorch/pull/132523#issuecomment-2284312426))
2024-08-12 15:42:07 +00:00
26b0a0c2f3 Fix fsdp_state_dict_type_without_warnings (#132621)
Do actually ignore the warnings. Otherwise this is a no-op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132621
Approved by: https://github.com/fegin
2024-08-12 10:33:09 +00:00
f5e704a6f2 Add instruction count benchmark to run on pull requests (#131475)
This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing.

to access results goto test pr_time_benchmarks and inspect logs:
you should see
```
+ echo 'benchmark results on current PR: '
benchmark results on current PR:
+ cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt
update_hint_regression,instruction_count,27971461254
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475
Approved by: https://github.com/ezyang
2024-08-12 05:20:26 +00:00
27c44c884e [Memory Snapshot][Viz] Show event timestamps if collected (#132523)
Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events.

Test Plan:
Ran it locally.

![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064)

Differential Revision: D60673800

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523
Approved by: https://github.com/tianfengfrank, https://github.com/zdevito
2024-08-12 01:48:23 +00:00
7f08b73980 Revert "[Memory Snapshot][Viz] Show event timestamps if collected (#132523)"
This reverts commit 456909e5d350810e941290ee61c1dfc3315a9a69.

Reverted https://github.com/pytorch/pytorch/pull/132523 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/132523#issuecomment-2282925079))
2024-08-11 23:33:37 +00:00
456909e5d3 [Memory Snapshot][Viz] Show event timestamps if collected (#132523)
Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events.

Test Plan:
Ran it locally.

![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064)

Differential Revision: D60673800

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523
Approved by: https://github.com/tianfengfrank, https://github.com/zdevito
2024-08-11 23:27:48 +00:00
2d71f03db1 [export] change deepcopy to copy in _replace_with_hop passes (#133142)
Summary:
Add back the change in 19897a1647.

The change was lost in refactoring due to a bad rebase.

Test Plan:
CI

```
buck2 run 'fbcode//mode/dev-nosan'  fbcode//torchrec/distributed/tests:test_pt2 -- --filter-text test_sharded_quant_fpebc_non_strict_export
```

Differential Revision: D61052687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133142
Approved by: https://github.com/ydwu4
2024-08-11 21:47:52 +00:00
e7b870c88b mixed_mm: fix segfault when allow_tf32=True (#133173)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133173
Approved by: https://github.com/Chillee
2024-08-11 15:02:24 +00:00
04f37ed57d Add support for returning LSE from FlexAttention (and also differentiating through it) (#133159)
This PR changes the "contract" of `flex_attention_hop` to return LSE in base 2. However, we undo that and return LSE in base e from the `flex_attention` frontend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133159
Approved by: https://github.com/yanboliang
2024-08-11 10:29:16 +00:00
78ccbad678 [inductor] remove dtype check/assert for reduction vec and support bool for min/max (#132473)
This PR is to remove the dtype check/assert for vectorized reduction. And support bool for min/max reduction.

After removing dtype check and assertion, failed on UT.
```
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/inductor/test_torchinductor_opinfo.py -k TestInductorOpInfoCPU.test_comprehensive_max_reduction_no_dim_cpu_bool
```
Now it is supported, generated code:
```
cpp_fused_max_0 = async_compile.cpp_pybinding(['const bool*', 'bool*'], '''
#include "/tmp/torchinductor_root/xf/cxf75ftbahznonqovnsugw7v6sldrabizgtx3j4rhgdmu3r36wlu.h"
extern "C"  void kernel(const bool* in_ptr0,
                       bool* out_ptr0)
{
    {
        {
            bool tmp_acc0 = std::numeric_limits<bool>::min();
            at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(std::numeric_limits<bool>::min());
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(112L); x0+=static_cast<long>(16L))
            {
                auto tmp0 = at::vec::VecMask<float,1>::from(in_ptr0 + static_cast<long>(x0));
                tmp_acc0_vec = tmp_acc0_vec | tmp0;
            }
            #pragma omp simd simdlen(8)
            for(long x0=static_cast<long>(112L); x0<static_cast<long>(125L); x0+=static_cast<long>(1L))
            {
                auto tmp0 = in_ptr0[static_cast<long>(x0)];
                tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
            }
            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp_acc0_vec.all_zero());
            out_ptr0[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0);
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132473
Approved by: https://github.com/jgong5
2024-08-11 08:37:54 +00:00
79ca596dc6 Optimize test_transformers.py (#133049)
- Reduced number of skipped test cases
- Merged redundant test cases

**Benchmark:**

| | Original | New |
| ----- | ----- | ----- |
| Run time | 60 mins | 35 mins |
| Total tests | 75k | 18k |
| Skipped tests | 20k | 4k |

_These are approximate numbers from running test_transformers.py on a single H100, and can change based on the device._

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133049
Approved by: https://github.com/drisspg
2024-08-11 05:20:58 +00:00
a7912bf9dc Make step != 0 test in slice irrefutable (#133091)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133091
Approved by: https://github.com/bdhirsh
2024-08-10 23:56:45 +00:00
cyy
5b7b3e4af0 Fix some issues detected by static analyzer (#132970)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132970
Approved by: https://github.com/ezyang
2024-08-10 16:02:46 +00:00
92f650c5b3 [Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255. (#126678)
[Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126678
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/eellison
2024-08-10 14:33:39 +00:00
4a3a30c36e [inductor] remove deprecated cpp_builder implementation. (#133161)
I have worked with @henrylhtsang to switch the cpp_builder to new one. We have removed the dependency to the old implementation.
So, it is time to remove the old implementation now. This PR is done the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133161
Approved by: https://github.com/ezyang
2024-08-10 14:21:22 +00:00
cyy
32be3e942c Remove -Wno-error=pedantic from CMake (#133074)
The codebase is largely clean so that we can turn it on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133074
Approved by: https://github.com/ezyang
2024-08-10 13:11:21 +00:00
b9922f7a5a [compiled autograd][cpp node] No recaptures from saved float scalars (#133048)
Partially addresses https://github.com/pytorch/pytorch/issues/130170 for float scalars saved from forward pass of a custom c++ autograd function. With this PR, compiled autograd no longer recaptures when the float value changes, but downstream support isn't there yet: 4bdb4bbd86/torch/_dynamo/config.py (L58-L61)

Currently, any non-tensors passed in ctx->saved_data is specialized on by compiled autograd. To stop specializing on float values, we lift the float. We also require user code to use IValue::toSymFloat instead of IValue::toDouble in order to swap the SymFloat to proxy during compiled autograd tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133048
Approved by: https://github.com/jansel
ghstack dependencies: #132771
2024-08-10 11:05:44 +00:00
c860889a65 [compiled autograd][cpp node] No recompiles from saved int scalars (#132771)
Addresses https://github.com/pytorch/pytorch/issues/130170 for int scalars saved from forward pass of a custom c++ autograd function

Currently, any non-tensors passed in ctx->saved_data is specialized on by compiled autograd. To stop specializing on int values, we lift the ints. We also require user code to use IValue::toSymInt instead of IValue::toInt in order to swap the SymInt to proxy during compiled autograd tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132771
Approved by: https://github.com/jansel
2024-08-10 11:05:44 +00:00
2ad011ca73 [inductor] remove debug code of AotCodeCompiler (#132823)
Since we switch AotCodeCompiler to new cpp_builder: https://github.com/pytorch/pytorch/pull/132766
We can remove debug code of AotCodeCompiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132823
Approved by: https://github.com/henrylhtsang
2024-08-10 08:04:48 +00:00
343071cd96 Fix privateuse1 backend name case (#132980)
### Problem

`get_privateuse1_backend(bool lower_case)` always returns a lower case name and `lower_case` is not used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132980
Approved by: https://github.com/albanD
2024-08-10 07:39:54 +00:00
c8275e25a7 fix requirement for error classification (#133122)
Test Plan: none

Differential Revision: D61039300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133122
Approved by: https://github.com/yushangdi
2024-08-10 04:59:09 +00:00
9f0d90655d [inductor] cpp_builder add dynamo time trace for compile_file (#133103)
trace `compile_file` time for cpp_builder.
Ref: https://github.com/pytorch/pytorch/pull/132328/files#diff-c9b517f8db609ffa866804dfa2689188a4fee20abacaa0b0dca91625c1b5cb8dR2224

<img width="994" alt="image" src="https://github.com/user-attachments/assets/862c7943-79dc-4d06-b398-a09595ad1295">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133103
Approved by: https://github.com/ezyang
2024-08-10 04:55:02 +00:00
cc5a57d185 Return from monitoring thread on TCPStore failure (#133150)
Summary: Pessimisticly assume that things are being torn down if TCPStore is not available and do not attempt to dump stack traces.

Test Plan:
Seeing crashes in production when Flight Recorder is enabled.
Here's the relevant mast link: https://fburl.com/mlhub/qia257xh

Reviewed By: fduwjj

Differential Revision: D61055124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133150
Approved by: https://github.com/fduwjj
2024-08-10 03:45:00 +00:00
e888f401c5 Fix autotuning for flex_decoding (#132157)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132157
Approved by: https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #131559
2024-08-10 03:39:48 +00:00
05de2b2d0f Revert "Construct NJT without graph breaks" (#133145)
This reverts commit 911154271309667b55dfb963ec6384bd0048019b.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133145
Approved by: https://github.com/YuqingJ
2024-08-10 03:11:16 +00:00
e890d888d9 [dtensor] move tensor constructors to a separate module (#133129)
This is to ensure __init__.py only contain public APIs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133129
Approved by: https://github.com/awgu, https://github.com/tianyu-l
2024-08-10 02:51:42 +00:00
8fbd7d92a8 Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887)
## Summary

As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond).

WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations.
The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue.

Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel.
While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would  use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.

### Performance
#### AMX
Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded.

In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead.

Benchmarked with unit-tests.

Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442

The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel.

#### AVX2/AVX512 micro-kernels

Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437

### Follow-up
1. int4 WoQ GEMM micro-kernel will also be added in a separate PR.
2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.

E2E perf measurement should be done with #131310.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131887
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-10 02:01:04 +00:00
eqy
c89936eaa0 [CUDA][SDPA] Bump grad_key fudge factor in test_flash_attention_vs_math_ref_grads (#133051)
Abates failures like `ValueError: grad_key Test error 1.592235639691353e-05 is greater than threshold 1.5236437320709229e-05!` that we've seen when bringing up newer versions of CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133051
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2024-08-10 01:49:30 +00:00
f037803290 Add ChromiumEventLogger, log FXGraphCache and AOTAutogradCache (#132864)
This PR implements ChromiumEventLogger in all @dynamo_timed events. For each dynamo timed call, we log:
- A start event before starting the function execution
- An end event after finishing the function execution
- An extra pair of start/end events for any phase names included in dynamo.

Separately, this also gives us the ability to log instant events. I use them to log cache hits/misses as a first step. The little arrows on the bottom of the UI are cache hits/misses, and you can look at cache details by clicking each triangle.

The outputted chromium trace events can be viewed in perfetto for a timeline of an execution. Here's what it looks like for a run of nanogpt:
![image](https://github.com/user-attachments/assets/cb9e6c7a-1acf-45e6-8a27-6651d9ae6132)

And another with warm start:
![image](https://github.com/user-attachments/assets/cd9709bc-59ef-4da1-a7dd-10b1a0ab9b8f)

Trace events are based around the JSON Event format: https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview

We may want to switch to the less deprecated Protobuf format later, but so far I don't see any features we care about supported there.

Internal FB employees can see a link to this in the tlparse output:
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpVi1FIl/dedicated_log_torch_trace_bb4zl_bc.log/index.html

I'll also work on logging these

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132864
Approved by: https://github.com/aorenste
2024-08-10 01:15:53 +00:00
de48d54042 [TorchRec] Add Support for FakeProcessGroup (#133039)
Summary:
# context
* use FakeProcessGroup to mimic the multi-process tests
* can use `_test_compile_fake_pg_fn` as the single-process VB compile test
```
from torchrec.distributed.tests.test_pt2_multiprocess import _test_compile_fake_pg_fn
_test_compile_fake_pg_fn(
    rank=0,
    world_size=2,
)
```

reference: D59637444

Test Plan:
# run test
* run command and results: P1519228952, [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpwMCK1E/index.html)
```
TORCH_TRACE=/var/tmp/tt TORCH_SHOW_CPP_STACKTRACES=1 TORCH_LOGS="+all" buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:test_pt2_multiprocess
```

Differential Revision: D56124045

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133039
Approved by: https://github.com/ezyang
2024-08-10 01:10:47 +00:00
3899465268 relax unification checks when size-like symbols can be 0 (#133112)
Test Plan: Fixes test failure in https://www.internalfb.com/diff/D51127481

Differential Revision: D61031307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133112
Approved by: https://github.com/angelayi
2024-08-10 00:57:49 +00:00
72f2b29bb0 [CI] disable xpu kineto build (#133069)
Due to the xpu kineto support PR https://github.com/pytorch/pytorch/pull/130811 landed, but the xpu ci infra not ready for now. Disable kineto build as a temp WA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133069
Approved by: https://github.com/seemethere
2024-08-09 23:58:50 +00:00
21302d5891 AutoHeuristic: script to generate data for mm (#131617)
This PR introduces a script that can be used to generate training data for tuned_mm in order to learn a heuristic with AutoHeuristic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131617
Approved by: https://github.com/eellison
ghstack dependencies: #131615, #131616
2024-08-09 23:49:29 +00:00
e7512ab752 inductor mm autotuning: add back previously pruned configs (#131616)
This PR adds back 10 configs for tuned_mm that were previously removed in https://github.com/pytorch/pytorch/pull/126570. The main idea is that we use 30 configs to autotune only when data is collected with AutoHeuristic. The learned heuristic will prune these 30 configs down to 10 configs, which reduces compilation time and at the same time might improve performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131616
Approved by: https://github.com/eellison
ghstack dependencies: #131615
2024-08-09 23:49:29 +00:00
e5fa190e01 AutoHeuristic: tuned_mm (#131615)
This PR enables AutoHeuristic to be used for `tuned_mm`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131615
Approved by: https://github.com/eellison
2024-08-09 23:49:29 +00:00
3b440f358c [elastic collectives API] add missing rank tracing support (#132818)
Optional option to detect missing ranks (that can be mapped to host info via `rank_tracing_decoder` lambda argument) in store barrier operation.

This approach has been used in some form already, moving it to collectives API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132818
Approved by: https://github.com/d4l3k
2024-08-09 22:55:04 +00:00
6beb2be2ed Fix _dynamo.variables.torch_function.global_mangled_class_name (#132744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132744
Approved by: https://github.com/zou3519
2024-08-09 22:19:01 +00:00
d2ecdcb2f7 [Profiler] Add API for Dynamic Activity Toggling [2/n] (#133035)
Summary: During PT2 there are many GPU/CPU events that are unneccessary to profile in between a given step. To remedy this, we can add an API that takes in a list of activities and an arg to toggle said activies or not. For this diff we are adding the profiler API to propogate down to kineto (and in the future the collection.cpp logic). Subsequent diffs will be added for CPU toggling and e2e testing.

Test Plan: Tested by toggling backward gpu traces off and got following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jul_31_13_40_55.3251726.pt.trace.json.gz&bucket=gpu_traces

Reviewed By: aaronenyeshi

Differential Revision: D60541767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133035
Approved by: https://github.com/aaronenyeshi
2024-08-09 21:54:54 +00:00
b0b4723062 [c10d] Rename PG name and PG ID attribute (#132915)
As discussed in https://github.com/pytorch/pytorch/pull/132058. we think pg_uid and local_uid might be a better name for pg_name and pg_id. So this PR is trying to rename it. More PRs are needed to change on the logging and other places.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132915
Approved by: https://github.com/fegin
ghstack dependencies: #132058
2024-08-09 21:26:56 +00:00
4110cb6ba7 Add explicit GQA support. (#131559)
### tl;dr
This PR adds GQA support to higher order op `flex_attention`.

## Details
When `enable_gqa` is set to True, HOP `flex_attention(score_mod, query, key, value, block_mask, enable_gqa)` runs Group Query Attention(GQA), where the number of query heads (Hq) is a multiple of number of key/value heads (Hkv). Each group of query heads (`Hq//Hkv` heads) attends to a shared kv head.
Otherwise, `flex_attention` assumes Multi Head Attention (MHA) where the number of query heads is equal the number of kv heads.

The `score_mod` and `mask_mod` API are adapted accordingly to take `q_head` as head index.
```
def score_mod(score: torch.Tensor, batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor

def mask_mod(batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor
```

## Example
```python
import torch
from torch.nn.attention.flex_attention import flex_attention
from torch.nn.attention.flex_attention import create_block_mask

torch.manual_seed(0)

def query_key_value_clones(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    dtype: torch.dtype = None,
):
    """Clones the query, key, and value tensors and moves them to the specified dtype."""
    if dtype is None:
        dtype = query.dtype
    query_ref = query.clone().detach().to(dtype).requires_grad_(query.requires_grad)
    key_ref = key.clone().detach().to(dtype).requires_grad_(key.requires_grad)
    value_ref = value.clone().detach().to(dtype).requires_grad_(value.requires_grad)
    return query_ref, key_ref, value_ref

# Lets create some input tensors
# The input tensor has shape (batch_size, num_heads, seq_len, head_dim).
# query and key/value can have different num_heads and seq_len
# Here 8 query heads share one KV head.
query = torch.randn(2, 8, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True)
key = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True)
value = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True)

query1, key1, value1 = query_key_value_clones(query, key, value)

# Lets create a score_modification. We take alibi_bias as an example.
# score_mod takes batch index, query head index, query index, and key/value index.
def _generate_alibi_bias(num_kv_heads: int, num_q_heads: int):
    def _alibi_bias(
        score: torch.Tensor,
        b: torch.Tensor,
        hq: torch.Tensor,
        token_q: torch.Tensor,
        token_kv: torch.Tensor,
    ) -> torch.Tensor:
        # Let's calculate kv head from query head index
        group = num_q_heads // num_kv_heads
        hkv = hq // group

        scale = torch.exp2(-((hkv + 1) * 8.0 / num_kv_heads))
        return score + (token_kv - token_q) * scale

    return _alibi_bias

# Let's apply a casual mask on top of it
def causal_mask(b, h, q, kv):
    return q >= kv

# Generate a block mask for our new mask_mod function.
# The mask is broadcasted long head & batch dimensions.
block_mask = create_block_mask(causal_mask, B=1, H=1, Q_LEN=2048, KV_LEN=2048)

# Lets call flex_attention with our new score modification and block mask under eager mode.
output = flex_attention(query, key, value, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True)

# Now lets compile flex_attention and run the flex_attention kernel.
compiled_flex_attention = torch.compile(flex_attention)
out_compiled = compiled_flex_attention(query1, key1, value1, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True)

torch.testing.assert_close(output, out_compiled, atol=5e-2, rtol=2e-2)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131559
Approved by: https://github.com/drisspg
2024-08-09 21:25:35 +00:00
dc8bb2636c [c10d][doc] Add docs for ENV variables TORCH_NCCL_ASYNC_ERROR_HANDLING TORCH_NCCL_TRACE_CPP_STACK and TORCH_NCCL_COORD_CHECK_MILSEC (#132920)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132920
Approved by: https://github.com/fegin, https://github.com/wconstab
2024-08-09 21:08:20 +00:00
78fa32a77b Turn off Function Event Accumulation by Default (#133095)
Summary: D56956245 added the ability to accumulate FunctionEvents across multiple cycles in order to perform statistical analysis on them all together. Although this can be useful, it uses too many CPU resources especially for long running jobs. For this reason, lets add a flag to the profiler to turn off this behavior by default, but still allow users to turn it on if they wish.

Test Plan: Changed function count test to have acc_events passed in and check the amount of function events based on if flag is true or not

Differential Revision: D61021490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133095
Approved by: https://github.com/briancoutinho, https://github.com/LucasLLC, https://github.com/aaronenyeshi
2024-08-09 20:47:20 +00:00
c44cb89e06 [export] detach constant tensors when they're not registered as buffer or parameter in unlift (#133031)
Summary:
Fixes T198245910.

In  previous diff D60532628 that causes the test failure, we fix the  in-consistency caused by constant tensors is accidentally reigistered as buffer by deleting the buffer and re assign them as constant.

However, this broke several existing tests in pyspeech when the exported program is re-traced with torch.jit.trace (which is an anti-pattern we probably should have some alignment), the jit tracer finds this constant tensor requiring grad and errors out.

This PR force constant attr not requiring grad, which is the correct behavior. A better fix is finding out where the constants are created in user code and why it requires grad. But this has low roi so we warn user about it.

Test Plan: See failures in T198245910.

Differential Revision: D60974869

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133031
Approved by: https://github.com/angelayi
2024-08-09 20:33:52 +00:00
cd307fb0b1 [FSDP2] reset FSDPParam.sharded_param in lazy_init (#132954)
motivated by FSDP2 + DoRA https://github.com/pytorch/pytorch/issues/132721

after meta init, we need a user-defined function to move DoRALinear.magnitude from device=meta to device=cuda
The problem is how to trigger reset_sharded_param or _apply to update FSDPParam. Otherwise lazy_init complains that DoRALinear.magnitude are still on device=meta

credit to @awgu for chasing after DDP lazy_init to unblock the PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132954
Approved by: https://github.com/awgu
ghstack dependencies: #133059
2024-08-09 20:26:10 +00:00
78cf8df4a0 [aoti] forward fix of [inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#133042)
Summary:
Forward fix of a test failure caused by D60773405.

The idea of D60773405 is that we need to use absolute path. So we will want to use the older version of path for output_so and output_o.

However, when I was copying the older definitions of output_so and output_o, I thought it was okay to simplify it a bit. See https://github.com/pytorch/pytorch/pull/131304#issuecomment-2270016609

Turns out I was wrong.

Test Plan: ci

Differential Revision: D60990594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133042
Approved by: https://github.com/hl475, https://github.com/desertfire
2024-08-09 18:53:27 +00:00
472b0daeaa [DDP][FSDP2] keep DTensor params for replicate(fully_shard) (#133059)
current status: for `replicate(fully_shard)`, DDP lazy_init will convert DTensor into local tensor, and that breaks FSDP unshard

this PR keeps FSDP params untouched during DDP lazy_init
I came across it because of a CI error in FSDP2's unit test #132978
thanks @awgu for fix proposal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133059
Approved by: https://github.com/Skylion007, https://github.com/fegin
2024-08-09 18:38:05 +00:00
e66084f9bf [BUG FIX] Refactor _scale_attn_mask_fusion_kernel to Use Runtime Argument Instead of Template Parameter (#132434)
**Description**

**_[BUG FIX]_**
This PR fixes a bug which happens during compilation with GCC 11.4 compiler in the FlashAttentionKernel.cpp file. This issue doesn't seem to be with PyTorch main branch but gets introduced with our SVE PR changes (https://github.com/pytorch/pytorch/pull/119571 ) + PyTorch main.

See the CI Pipeline failing in our PR:
https://github.com/pytorch/pytorch/actions/runs/9895714768/job/27336251795?pr=119571

```
/var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp.SVE256.cpp
during RTL pass: expand
In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp.SVE256.cpp:1:
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp: In lambda function:
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp:290:57: internal compiler error: in emit_move_insn, at expr.c:3821
  290 |   at::parallel_for(0, batchSize * num_head * qSlice, 1, [&](int64_t begin, int64_t end) {
      |                                                         ^
0xffffb03f73fb __libc_start_call_main
	../sysdeps/nptl/libc_start_call_main.h:58
0xffffb03f74cb __libc_start_main_impl
	../csu/libc-start.c:392
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <file:///usr/share/doc/gcc-11/README.Bugs> for instructions.

[5731/6839] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/CatKernel.cpp.SVE256.cpp.o
[5732/6839] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/GridSamplerKernel.cpp.SVE256.cpp.o
```

This issue with compilation only happens with GCC 11.4 and works well with the latest GCC 12.3 compiler and also the Clang compiler. The issue is related to the check for `is_b_stride_zero` introduced as a template parameter (compile time check complexity) in the following commit: 5da428d9eb  which was added recently into FlashAttentionKernel.cpp file.

This PR fixes the above compilation failure with GCC 11.4 compiler.

cc : @Valentine233 @yanbing-j @mingfeima @malfet @jgong5 @r-barnes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132434
Approved by: https://github.com/jgong5
2024-08-09 18:34:42 +00:00
b41d62a3a2 Fix typo in docs of all_gather (#133066)
Fix a typo of docs:
```
def all_gather(tensor_list, tensor, group=None, async_op=False):
...
        [tensor([0, 0], device='cuda:0'), tensor([0, 0], device='cuda:1')] # Rank 1
```
`cuda:0` should be `cuda:1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133066
Approved by: https://github.com/awgu
2024-08-09 18:21:26 +00:00
f3eab23c42 Fix typo in mypy.ini (#133097)
A missing comma in the file list currently leads to errors when running mypy, introduced in #113745

Fixes #133096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133097
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-08-09 18:19:51 +00:00
31ef900a65 Revert "added persistent option to buffers and namedbuffers (#132994)"
This reverts commit 8707c6dfacaed293ddc40cbb5ecf5841568df0e6.

Reverted https://github.com/pytorch/pytorch/pull/132994 on behalf of https://github.com/PaliC due to breaking internal pyre tests ([comment](https://github.com/pytorch/pytorch/pull/132994#issuecomment-2278487672))
2024-08-09 18:14:53 +00:00
6c012f7217 [c10d][Log] Use pg_id instead of pg_name for logging prefix (#132058)
When checking the logs of c10d, I found it showed that "[PG 7 rank 7]" which it actually means "[PG 1 rank 7]". So we need to use pg_id(aka, uid_) rather than pg_name_ because when creating subpgs, currently we need to call it multiple times, so this makes PG names are based on bumped up numbers (e.g, 7 rather than 1). Using pg_id is more accurate and consistent with other logging tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132058
Approved by: https://github.com/shengbao-zheng, https://github.com/shuqiangzhang
2024-08-09 18:14:10 +00:00
655ec07525 [ROCm] TunableOp logging improvements (#132173)
Summary:
TunableOp logging improvements:
1. PYTORCH_TUNABLEOP_VERBOSE=1: print out the expected value vs actual value for TunableOp validators, so that if validation fails, we know exactly how to fix it
2. PYTORCH_TUNABLEOP_VERBOSE=3: print out the exact kernel signature for both successful and failure cases in kernel lookup

Test Plan:
> PYTORCH_TUNABLEOP_VERBOSE=3 buck
2 run mode/{opt,amd-gpu} -c fbcode.enable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enab
le-tuning

```
reading tuning results from hipblas_tuning_pt_llama0.csv
Validator PT_VERSION=2.5.0
Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty
Validator HIPBLASLT_VERSION=800-a15e4178
Validator ROCM_VERSION=6.0.0.0-12969-1544e39
Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack-
GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack-
ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39
HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178
ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty
PT_VERSION validation: expect 2.5.0 to match 2.5.0
Loading results
GemmTunableOp_BFloat16_TN(tn_8192_2_1024) -> Gemm_Hipblaslt_TN_61169,0.0171694
GemmTunableOp_BFloat16_TN(tn_7168_2_8192) -> Gemm_Hipblaslt_TN_61089,0.036138
GemmTunableOp_BFloat16_TN(tn_8192_2_3584) -> Gemm_Hipblaslt_TN_61169,0.0240673
missing params_signature, returning null ResultEntry for GemmTunableOp_BFloat16_TN,tn_1280_2_8192
finding fastest for GemmTunableOp_BFloat16_TN(tn_1280_2_8192) out of 2818 candidates
Rotating buffer 4 MiB. Needed Size: 20 MiB. Needed number of param copies: 1
├──tuning using warmup iters 0 [0 ms] and tuning iters 1 [0.208254 ms] instance id=0, GemmTunableOp_BFloat16_TN(tn_1280_2_8192) Default
├──offset at 3
......
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
Avg time: 16.42832040786743 us, Achieved 7.15 TFLOPS, 3578.07 GB/s

2x1280x8192-torch.bfloat16,16.260499954223633,2.5794434438103107,1294.0669757533708
2x8192x1024-torch.bfloat16,16.15394949913025,2.0771658350056508,1041.11852032876
2x7168x8192-torch.bfloat16,25.691540241241455,9.14234887416194,4574.841325057144
2x8192x3584-torch.bfloat16,16.42832040786743,7.1486621324818085,3578.0709494714856
```

Differential Revision: D60468273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132173
Approved by: https://github.com/mxz297, https://github.com/jeffdaily, https://github.com/eqy
2024-08-09 17:55:21 +00:00
d13e72fd6a [c10d] set a shorter heartbeat detect timeout to avoid race with NCCL timeout (#133028)
What we found recently is that:
1. Monitoring detect watchdog hang(no heartbeat) at same time as nccl timeout. This race leads to less useful debug info gets dumped to logs (such as CudaEventDestroy and GIL checker)
2. We don't kill the program if monitoring thread has not enabled but somehow still silently run the monitoring thread. Plus for users who feel this is too short, they should config TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133028
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-09 17:48:34 +00:00
574cdf1232 [export] Merge functions in replace set_grad/autocast with HOO (#132724)
Summary: as title

Test Plan: CI

Differential Revision: D60701648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132724
Approved by: https://github.com/ydwu4
2024-08-09 17:25:07 +00:00
2dbe5cb979 [C10D] Clarify warning for concurrent PG usage (#131895)
Addresses a common misconception about safety of using multiple NCCL
process groups from PyTorch.

Notably, it IS safe to use multiple process groups, so long as
communication operations from different groups are not allowed to
overlap.  (Overlap of communication operations from one group with
compute operations IS ok).

TODO: after getting feedback on the text, update other copies of the warning on other APIs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131895
Approved by: https://github.com/fduwjj
2024-08-09 17:06:46 +00:00
bc57d5b6ff [Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487)
**Summary**
The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929.  Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487
Approved by: https://github.com/anijain2305, https://github.com/jgong5
2024-08-09 16:56:57 +00:00
23b877cb54 [inductor]a less ambitious way to slove the scalar tensor (#132702)
Fixes #121374

The previous https://github.com/pytorch/pytorch/pull/131775 was trying to convert the 0dim cpu tensor to a DynamicScalar in lowering stage. But there are so many lowering rules uncompatible with that way. So, this PR is trying to do the conversion in codegen stage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132702
Approved by: https://github.com/eellison
2024-08-09 16:29:36 +00:00
50595ecef4 Revert "[BE] Raise when the target model has scalar parameters (#132934)"
This reverts commit ea00036841b225330396f8d8f6ecf796f4826786.

Reverted https://github.com/pytorch/pytorch/pull/132934 on behalf of https://github.com/clee2000 due to I think this broke distributed/_composable/fsdp/test_fully_shard_init.py::TestFullyShardShardedParameterTensor::test_raise_scalar_parameter [GH job link](https://github.com/pytorch/pytorch/actions/runs/10314920655/job/28563430905) [HUD commit link](ea00036841).  Dr CI is wrong, it is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132934#issuecomment-2278208789))
2024-08-09 15:30:34 +00:00
065f7aa44b [inductor] tensor_is_align fallbacking False if unbacked expr not comptime evaled (#132423)
Currently if storage_offset is unbacked symbol and is_align can not be computed compiletime - it hard fails.

Doing the best we can: adding guard_size_oblivious and fallback on False if can not be evaluated compiletime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132423
Approved by: https://github.com/ezyang
2024-08-09 15:07:42 +00:00
4bdb4bbd86 Fix fbcode AOTI GPU lowering for ARM64 hosts (#133017)
Summary: Fix fbcode AOTI GPU lowering for ARM64 hosts

Reviewed By: hl475

Differential Revision: D60969898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133017
Approved by: https://github.com/hl475
2024-08-09 14:05:13 +00:00
f2bacd908a [BE] Move function definitions to .cpp (#132927)
Summary:
Non-functional change.

Move function definitions for NCCLTraceBuffer to .cpp files.

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132927
Approved by: https://github.com/Skylion007, https://github.com/d4l3k
ghstack dependencies: #132916
2024-08-09 13:52:29 +00:00
465e071898 Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)"
This reverts commit 927b4c11143e047eb6e3430e4c7c912064572f1b.

Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/nmacchioni due to breaking many tests ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2277738114))
2024-08-09 11:30:23 +00:00
f565d16acb Fix work-around item non-sync issue on AMD (#133054)
Summary: Otherwise it will break FSDP code paths

Test Plan:
unit test

see next diff for print message
```
sh ./scripts/lufang/amd/small_repro.sh
ROCM_GET_SCALAR_ITEM_SYNC=1 sh ./scripts/lufang/amd/small_repro.sh
```

It will log "====== Async mode ======" or "====== Sync mode ======" correspondingly

Differential Revision: D60995134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133054
Approved by: https://github.com/houseroad
2024-08-09 09:22:29 +00:00
927b4c1114 [CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)
Unblocks/unbreaks against newer CUTLASS (3.5+)

CC @nWEIdia @xwang233 @ptrblck @thakkarV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493
Approved by: https://github.com/Skylion007
2024-08-09 07:35:38 +00:00
7b8ab7eb3e [dynamo] Partially support random.Random class (#133037)
This partially fixes the graph break issue when instantiating a `random.Random` class in Python.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133037
Approved by: https://github.com/anijain2305
2024-08-09 07:15:42 +00:00
ea00036841 [BE] Raise when the target model has scalar parameters (#132934)
Address the issue, https://github.com/pytorch/pytorch/issues/130810.

Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934
Approved by: https://github.com/awgu
ghstack dependencies: #132908, #132933
2024-08-09 06:45:48 +00:00
5707c6e952 [Fake tensor] Align the appearance of device_put op in fx_graph generated for CUDA and XPU, which is exposed in the issue #130823 (#132479)
[Fake tensor] Align the appearance of device_put op in fx_graph generated for CUDA and XPU, which is exposed in the issue #130823
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132479
Approved by: https://github.com/EikanWang, https://github.com/zou3519, https://github.com/eellison
2024-08-09 05:31:00 +00:00
cyy
da65cfbdea Remove unused Caffe2 macros (#132979)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132979
Approved by: https://github.com/ezyang
2024-08-09 04:48:20 +00:00
cyy
05e8e87a69 [Submodule] Remove foxi (#132976)
It is not used after removal of Caffe2 code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132976
Approved by: https://github.com/ezyang
2024-08-09 03:46:52 +00:00
bb6eef8ed1 [2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675)
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py

ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py

This is part 2 pull request which deals with the 2nd case above:

- The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.

- Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.

Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details.

Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels

Differential Revision: D60067757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131675
Approved by: https://github.com/mlazos
2024-08-09 03:14:16 +00:00
8875226d62 [dtensor] multi-dim mesh redistribute follow up (#133023)
follow up from https://github.com/pytorch/pytorch/pull/131210

and added one test case from user in

https://github.com/pytorch/pytorch/issues/132751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133023
Approved by: https://github.com/tianyu-l
ghstack dependencies: #133022
2024-08-09 02:26:23 +00:00
3b7edc12c6 [dtensor] more refactor to imports/paths (#133022)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133022
Approved by: https://github.com/XilunWu, https://github.com/wz337
2024-08-09 02:26:23 +00:00
22ea248aa8 dynamic shapes mismatch errors (#132982)
Summary: When PyTree detects a structural mismatch between inputs and dynamic shapes, the error messages are quite horrible. This PR fixes these error messages by adding, for each kind of error, the path to the point where the error happens and an actionable reason for the error.

Test Plan: added test with several cases

Differential Revision: D60956976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132982
Approved by: https://github.com/yushangdi
2024-08-09 02:22:32 +00:00
cyy
8967d55b01 [18/N] Fix clang-tidy warnings in jit (#132963)
Follows #132753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132963
Approved by: https://github.com/Skylion007
2024-08-09 01:27:32 +00:00
313aa151da Revert "[ROCm] TunableOp logging improvements (#132173)"
This reverts commit 9cca0494b9d5c89c0a1100aee9477ed8ca7d527b.

Reverted https://github.com/pytorch/pytorch/pull/132173 on behalf of https://github.com/PaliC due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/132173#issuecomment-2276966242))
2024-08-09 01:04:57 +00:00
4101dd14c2 Make debugging backends accept and ignore options kwargs from torch.compile (#132892)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132892
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-08-09 00:49:45 +00:00
0ff0bf3d31 [Replicate] Fix replicate with DeviceMesh initialization (#133024)
A follow up on https://github.com/pytorch/pytorch/pull/132339.

`get_parent_mesh` is replaced by `get_root_mesh`. In addition, modify a few places that parent mesh is mentioned in test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133024
Approved by: https://github.com/Skylion007, https://github.com/fegin
2024-08-09 00:45:47 +00:00
10c2168b31 [pt2-bench] use larger multiplier for smaller tensors for a few models (#132952)
Fix https://github.com/pytorch/pytorch/issues/132922  and https://github.com/pytorch/pytorch/issues/132924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132952
Approved by: https://github.com/eellison, https://github.com/jansel
2024-08-09 00:09:21 +00:00
3c5b246d3c [export] Remove Proxy from exported programs and modules (#132956)
Summary: Remove Proxy from exported programs and modules because they cannot be deepcopied or pickeled.

Test Plan:
CI

```
buck2 run 'fbcode//mode/dev-nosan'  fbcode//caffe2/test/quantization:test_quantization -- -r  qat_conv2d
buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export
buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et
buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False
buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r  test_fold_bn_erases_bn_node
```

Differential Revision: D60940832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132956
Approved by: https://github.com/angelayi
2024-08-09 00:00:20 +00:00
e2b94923ba [PyTorch] Speed up decomposed quantize_per_channel (#133029)
Similar to D60871396 (#132828).

Differential Revision: [D60978385](https://our.internmc.facebook.com/intern/diff/D60978385/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133029
Approved by: https://github.com/cccclai
2024-08-08 23:48:34 +00:00
fa8c34301a [ts-migration]: Quantized ops to standard ops pass. (#133026)
#### Description
Transform quantized operation properly. Add de/quantization before and after the quantized operation.

#### Test Plan
`pytest test/export/test_converter.py -s -k test_ts2ep_convert_quantized_model`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133026
Approved by: https://github.com/angelayi
2024-08-08 23:10:17 +00:00
45cf8ef557 add impls for required for nt ops (#132710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132710
Approved by: https://github.com/jbschlosser
ghstack dependencies: #131060
2024-08-08 23:09:38 +00:00
1434e0b121 Add a private _safe_softmax (#131060)
# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131060
Approved by: https://github.com/jbschlosser
2024-08-08 23:09:38 +00:00
1f66487c69 [BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770
Approved by: https://github.com/bdhirsh
2024-08-08 23:07:23 +00:00
f25df31008 TunableOp more unit test follow-up (#130065)
More unit tests for preventing TunableOp regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130065
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-08-08 22:42:16 +00:00
3d0de6e1cd [Inductor] Add config option to force higher-dimensional tiling (#132937)
Fixes #125077

**Feature**

This PR creates a new Inductor config, `config.triton.prefer_nd_tiling`, which is disabled by default. When enabled, this encourages the Triton code to use as many tiling dimensions as possible. This simplifies indexing expressions for discontiguous tensors, resulting in expressions like `5 * x + 8 * y` as opposed to `5 * (x // 7) + 8 * (y % 9)`. This allows us to find more block pointers than we normally would. We should now see simplified indexing expressions as long as:
 1. All discontiguous reads/writes have the same shape.
 2. The number of discontiguous dimensions is less than `config.triton.max_tiles`.

 Here's an example kernel (elementwise add of views) with ND tiling disabled:
 ```
 @triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 21
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 7
    x1 = (xindex // 7)
    x2 = xindex
    tmp0 = tl.load(in_ptr0 + (x0 + (9*x1)), xmask)
    tmp1 = tl.load(in_ptr1 + (x0 + (9*x1)), xmask)
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[21], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
 ```

 And here's the version with it enabled:
 ```
 @triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 3
    xnumel = 7
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    x1 = xindex
    y0 = yindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last')
    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last')
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[7, 3], strides=[1, 7], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tl.broadcast_to(tmp2, [XBLOCK, YBLOCK]).to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
 ```

 With this feature enabled, we get a discontiguous strided block pointer. Previously, this would only have worked for specific shapes, like powers of 2 or multiples of the maximum block size. With this PR, we can support arbitrary shapes so long as we have enough tiles to cover all discontiguous dimensions.

**Test plan**

This PR adds some tests for pointwise ops with discontiguous tensors.
 - Test that we can generate block pointers for views with odd shapes like `(5,7)`, `(9,3,5)`, etc.
 - Test that we can generate block pointers for a single discontiguous dim in 3D and 4D tensors.
 - Test that we generate a 2D tiling for a 5D tensor with two discontiguous dims. This case doesn't generate a block pointer, but it checks that the output code is at least correct.

This PR also parametrizes some existing tests to run with and without `triton.prefer_nd_tiling`. That way, we ensure this feature doesn't break existing usage.

Since this setting isn't enabled on most tests, I also created https://github.com/pytorch/pytorch/pull/132935 to test what happens when `triton.prefer_nd_tiling=True` by default. None of the failures seem related to invalid tiling, so I think this feature is safe to merge.

**Limitations and follow-ups**

I can see two main improvements which would expand the usefulness of this feature:

1. This feature currently only works for pointwise kernels, since reductions are never tiled. As a follow-up, we could enable tiled reductions to extend these benefits to reduction kernels.

2. The usefulness of this feature depends on `triton.config.max_tiles`. This is currently restricted to 2 by default, although it can be increased to 3 in certain cases. To support more discontiguous dims, we might consider expanding support for 3D tiling, or even supporting ND tiling, by mapping an ND "virtual" launch grid onto Triton's 3D launch grid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132937
Approved by: https://github.com/jansel, https://github.com/eellison
2024-08-08 22:11:56 +00:00
8707c6dfac added persistent option to buffers and namedbuffers (#132994)
Fixes #85235

Alternative to PR https://github.com/pytorch/pytorch/pull/129655, implements 3-valued option (None or bool).

- adds keyword only argument `persistent: Optional[bool] = None` to `nn.Module.buffers`
- updated docstrings slightly.
- added test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132994
Approved by: https://github.com/mikaylagawarecki
2024-08-08 21:39:01 +00:00
9cca0494b9 [ROCm] TunableOp logging improvements (#132173)
Summary:
TunableOp logging improvements:
1. PYTORCH_TUNABLEOP_VERBOSE=1: print out the expected value vs actual value for TunableOp validators, so that if validation fails, we know exactly how to fix it
2. PYTORCH_TUNABLEOP_VERBOSE=3: print out the exact kernel signature for both successful and failure cases in kernel lookup

Test Plan:
> PYTORCH_TUNABLEOP_VERBOSE=3 buck
2 run mode/{opt,amd-gpu} -c fbcode.enable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enab
le-tuning

```
reading tuning results from hipblas_tuning_pt_llama0.csv
Validator PT_VERSION=2.5.0
Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty
Validator HIPBLASLT_VERSION=800-a15e4178
Validator ROCM_VERSION=6.0.0.0-12969-1544e39
Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack-
GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack-
ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39
HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178
ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty
PT_VERSION validation: expect 2.5.0 to match 2.5.0
Loading results
GemmTunableOp_BFloat16_TN(tn_8192_2_1024) -> Gemm_Hipblaslt_TN_61169,0.0171694
GemmTunableOp_BFloat16_TN(tn_7168_2_8192) -> Gemm_Hipblaslt_TN_61089,0.036138
GemmTunableOp_BFloat16_TN(tn_8192_2_3584) -> Gemm_Hipblaslt_TN_61169,0.0240673
missing params_signature, returning null ResultEntry for GemmTunableOp_BFloat16_TN,tn_1280_2_8192
finding fastest for GemmTunableOp_BFloat16_TN(tn_1280_2_8192) out of 2818 candidates
Rotating buffer 4 MiB. Needed Size: 20 MiB. Needed number of param copies: 1
├──tuning using warmup iters 0 [0 ms] and tuning iters 1 [0.208254 ms] instance id=0, GemmTunableOp_BFloat16_TN(tn_1280_2_8192) Default
├──offset at 3
......
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
Avg time: 16.42832040786743 us, Achieved 7.15 TFLOPS, 3578.07 GB/s

2x1280x8192-torch.bfloat16,16.260499954223633,2.5794434438103107,1294.0669757533708
2x8192x1024-torch.bfloat16,16.15394949913025,2.0771658350056508,1041.11852032876
2x7168x8192-torch.bfloat16,25.691540241241455,9.14234887416194,4574.841325057144
2x8192x3584-torch.bfloat16,16.42832040786743,7.1486621324818085,3578.0709494714856
```

Differential Revision: D60468273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132173
Approved by: https://github.com/mxz297, https://github.com/jeffdaily
2024-08-08 21:24:16 +00:00
cd30861857 [PT2][Optimus] Update unbind_cat_to_view pass to include more complicated cases (#132831)
Summary: We found recent CMF and IGCTR has more complicated patterns to optimize in order to remove as many stack/cat nodes as possible, we thus design such patterns

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3659174939423652
Network: Up: 113KiB  Down: 112KiB  (reSessionID-11c9b598-af3a-4727-8f02-ccb1471d092b)
Jobs completed: 27. Time elapsed: 5:45.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

### cmf
```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213 -n
```
P1515072258

Counter({'pattern_matcher_nodes': 2170, 'pattern_matcher_count': 1766, 'normalization_pass': 402, 'remove_split_with_size_one_pass': 269, 'extern_calls': 193, 'merge_splits_pass': 74, 'normalization_aten_pass': 51, 'fxgraph_cache_miss': 9, 'batch_aten_mul': 6, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 3, 'scmerge_cat_removed': 3, 'unbind_stack_pass': 3, 'batch_sigmoid': 2, 'batch_linear': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'split_stack_to_cats_pass': 1, 'split_cat_to_slices_pass': 1, 'batch_aten_add': 1, 'batch_relu': 1})

### ig_ctr

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 -n
```
P1515087739

Counter({'pattern_matcher_nodes': 1832, 'pattern_matcher_count': 1564, 'extern_calls': 378, 'normalization_pass': 345, 'normalization_aten_pass': 49, 'fxgraph_cache_miss': 18, 'batch_aten_mul': 6, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'batch_linear_post_grad': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'unbind_cat_to_view_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'split_stack_to_cats_pass': 2, 'split_cat_to_slices_pass': 1})

# e2e

testing the following new patterns
```
                "split_stack_to_cats_pass": {},
                "split_cat_to_slices_pass": {},
                "unbind_cat_to_view_pass": {},
```
Note that you can tune the hyper-parameter "threshold_to_cat " for these patterns, and the minimum value you give should be at least 2. The larger the value, the less aggressive to do the node slicing but to keep the cat, and the default value is 10. You can tune the parameters by setting threshold_to_cat. For example

```
"split_stack_to_cats_pass": {"threshold_to_cat": 10},
"split_cat_to_slices_pass": {"threshold_to_cat": 10},
"unbind_cat_to_view_pass": {"threshold_to_cat": 10},
```

Note that the default value may not be optimal, it's based on my experiments on CMF and IGCTR, you are more than welcome to tune the value to find the best threashold for you. For example, in the cmf local run,
- when "threshold_to_cat" is 2
P1515072258
=============Print full analysis for cmf_shrink================
| Metric             | Value           |
|:-------------------|:----------------|
| Batch size         | 10              |
| Latency            | 156.07 ms       |
| Model size         | 844357184 bytes |
| Flops/example      | 583.53 G        |
| TFLOPS             | 37.39           |
| MFU                | 4.67%           |
| Activation/example | 1707.49 MB      |

- when "threshold_to_cat" is 10
P1515912635
=============Print full analysis for cmf_shrink================
| Metric             | Value           |
|:-------------------|:----------------|
| Batch size         | 10              |
| Latency            | 155.09 ms       |
| Model size         | 844357184 bytes |
| Flops/example      | 583.53 G        |
| TFLOPS             | 37.63           |
| MFU                | 4.70%           |
| Activation/example | 1707.49 MB      |

ads_dper3:164562cbe29f6c5aea4546cf3d463b87
training_platform:5e455c643c52940bb4567017f4c7ba83

## cmf
baseline
f588717948
proposal
f588719502

### QPS and NE results
{F1793304642}
{F1793304664}
{F1793304689}
{F1793304683}

### Compilation time reduction

zoomer link: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=1045728747213538&tab=pt2_metrics

Compile time for that frame is reduced to 1 min from 9 min.

### trace analysis
baseline trace link
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff588722004-TrainingApplication%2F0%2Frank-1.Aug_06_00_03_46.3617.pt.trace.json.gz&bucket=pyper_traces

proposal trace link
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff588723545-TrainingApplication%2F0%2Frank-1.Aug_05_23_54_56.3647.pt.trace.json.gz&bucket=pyper_traces

{F1793312804} {F1793312867}

From the trace, we can see that the green part (introduced by split cat) has been reduced significantly with our new patterns.

Differential Revision: D60750275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132831
Approved by: https://github.com/jackiexu1992
2024-08-08 21:18:01 +00:00
40767e8468 [BE] rename testHelperPrefix test (#132916)
Summary:
Re-enable testHelperPrefix test that was erroneously disabled in CI.
Fixes #50701

Test Plan:
Test passes locally:
```
❯ ./TCPStoreTest --gtest_filter=TCPStoreTest.testHelperPrefix
Running main() from
/data/users/cpio/pytorch/third_party/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = TCPStoreTest.testHelperPrefix
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TCPStoreTest
[ RUN      ] TCPStoreTest.testHelperPrefix
[W807 12:01:31.531576727 socket.cpp:462] [c10d] waitForInput: poll for
socket SocketImpl(fd=6, addr=[localhost]:37984,
remote=[localhost]:37171) returned 0, likely a timeout
[W807 12:01:31.531663710 socket.cpp:487] [c10d] waitForInput: socket
SocketImpl(fd=6, addr=[localhost]:37984, remote=[localhost]:37171) timed
out after 100ms
[       OK ] TCPStoreTest.testHelperPrefix (314 ms)
[----------] 1 test from TCPStoreTest (314 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (314 ms total)
[  PASSED  ] 1 test.
╭─ ~/local/pytorch/build/bin  main *1 +1 ···················· ✔
/home/cpio/local/a/pytorch-env   cpio@devgpu011 ─╮
╰─
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132916
Approved by: https://github.com/Skylion007
2024-08-08 20:54:52 +00:00
7bd0732cbd Fix flaky internal mixed_mm tests (#133015)
This PR fixes flaky internal tests:
- The AutoHeuristic test was sometimes failing because it required autotuning to happen for mixed_mm which didn't end up happening when there was a fx graph cache hit.
- The tests inside pattern_matcher failed because in some cases pad_mm decided to pad which made the mixed_mm pattern not match anymore (instead of cast -> mm, it was cast -> pad -> mm), and the tests also fail when is_big_gpu is false (which I haven't found an explanation for).

Differential Revision: [D60972176](https://our.internmc.facebook.com/intern/diff/D60972176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133015
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-08-08 20:32:12 +00:00
a9954d22f8 Raise exception if torch.func.* calls torch.compile functions (#128736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128736
Approved by: https://github.com/zou3519
2024-08-08 20:21:44 +00:00
b845068db2 [dtensor] refactor examples folder (#132914)
as titled:

1. remove checkpoint example as it's not maintained
2. refactor convnext example to use torchrun
3. refactor comm mode feature example to sit in one file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132914
Approved by: https://github.com/wz337
2024-08-08 20:03:14 +00:00
c326533999 [ROCm][Inductor] Enable AOT Inductor CPP UTs for ROCm (#131521)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131521
Approved by: https://github.com/jataylo, https://github.com/pruthvistony, https://github.com/malfet
2024-08-08 19:49:56 +00:00
de288e2203 Fix inf value reduction in non persistent reduction for scans (#132293)
Fixes https://github.com/pytorch/pytorch/issues/132107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132293
Approved by: https://github.com/peterbell10
2024-08-08 19:02:32 +00:00
322c9d03a0 [FSDP][dtensor] use _StridedShard to represent nested sharding for correct full_tensor() result (#130760)
Fixes issue #129229 #129206
**Summary**

1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding
2. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result
3. Re-enabled the tests that were disabled in #129519

**test**
`pytest test/distributed/_composable/fsdp/`
`pytest test/distributed/_composable/test_composability/test_2d_composability.py`
`pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py`

Differential Revision: [D60606114](https://our.internmc.facebook.com/intern/diff/D60606114)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130760
Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wz337
ghstack dependencies: #126697, #130239, #132391, #131408
2024-08-08 18:15:29 +00:00
21906ddaba [AOTI] Fix complex64 not defined (#132810)
Partially fixes #122980

- change cpp type mapping for complex64 to std::complex<float>
- add `aoti_torch_item_complex64` and `aoti_torch_scalar_to_tensor_complex64`.
- add `expensiveCopyToTensor()` to convert `ArrayRefTensor<T>` type to `AtenTensorHandle` type.

- if we want to fully fix #122980, we still need to let ArrayRef and MiniArrayRef to consider underlying storage number of elements. See more details in https://github.com/pytorch/pytorch/pull/132347 (#132347  broke some internal tests, so we need more work before landing it).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132810
Approved by: https://github.com/desertfire
2024-08-08 18:08:23 +00:00
ac95b2a2f2 Migrate slow self-hosted jobs to Amazon2023 AMI (#131771)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250

(for tracking: signal on Aug 6: https://hud.pytorch.org/pytorch/pytorch/pull/131771?sha=38bc4755567527fad5279203ddef534ac132ea94)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131771
Approved by: https://github.com/seemethere
2024-08-08 17:33:57 +00:00
75eb66afc0 Support 'non-contiguous with holes' NJTs for contiguous clone() (#132776)
It's possible to construct an NJT with "holes" by specifying both `offsets` and `lengths` metadata. When `nt.clone(memory_format=torch.contiguous_format)` is called on such an NJT, the result should be an NJT without holes.

This PR fixes this in simplistic way using `unbind()`, which isn't really supported in `torch.compile`. The longer term solution involves writing a proper kernel to support this.

NB: Another limitation is that the returned NJT does not have the same ragged structure as the input. While we could manually hack the nested int registry (or update the union find when that lands), this is the first instance where a NJT with holes and an NJT without holes could have the same ragged structure, and getting those to play nicely together requires some fairly involved updates. For now, this PR punts on these updates until we can clean this up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132776
Approved by: https://github.com/ani300, https://github.com/soulitzer
ghstack dependencies: #131898, #131704, #131937
2024-08-08 17:08:11 +00:00
3ec9ec03a8 Revert "[pipelining] Add schedule runtime for lowered schedule (#130488)"
This reverts commit b73d4b6555dd6b5a39d70d741099b83190eb31f0.

Reverted https://github.com/pytorch/pytorch/pull/130488 on behalf of https://github.com/PaliC due to breaking distributed tests internally (that should be running in OSS) ([comment](https://github.com/pytorch/pytorch/pull/130488#issuecomment-2276266909))
2024-08-08 16:57:50 +00:00
942ffd1b2d Make the __module__ name of HOO to be always "torch.ops.higher_order" (#132775)
Summary: It seems that we can just make this the default so that in the future all the ops printed in the graph should be like torch.ops.higher_order

Test Plan: CI

Differential Revision: D60530900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132775
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-08-08 16:55:09 +00:00
eeb6ad0744 [quant] Speed up dequantize_per_channel (#132828)
Tensor-wise operations are much faster than looping over tensor elements. Rewrite loop in dequantize_per_channel to use whole-Tensor operations accordingly.

Differential Revision: [D60871396](https://our.internmc.facebook.com/intern/diff/D60871396/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132828
Approved by: https://github.com/cccclai
2024-08-08 16:44:41 +00:00
dfc5bb0099 Login to Meta's ECR when using non-meta runner (#132870)
The project depends on fetching container images from Meta's ECR repo so when run on non-meta runners we need to ensure that we also login to Meta's ECR too.

Closes pytorch/ci-infra#252.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132870
Approved by: https://github.com/ZainRizvi
2024-08-08 16:34:46 +00:00
4a4dc9d6d9 [inductor] Disable remote caching in failing test_cpu_repro tests (#132955)
Summary: These tests are failing stress tests internally because of remote caching. Most already have local cache disabled; disable remote cache as well

Test Plan: Ran stress tests locally for each of the affected tests

Differential Revision: D60940081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132955
Approved by: https://github.com/leslie-fang-intel
2024-08-08 16:20:56 +00:00
9d5c85c499 Move exir.delegate to PyTorch core to enforce no out-of-tree HOPs (#132525)
Summary: When HOPs live out of tree, it makes it impossible to make breaking changes to the HOP API. But HOP implementations are deeply entwined with PyTorch internals. Move the HOP into PyTorch tree so that changes are possible.

Test Plan: sandcastle, ossci

Differential Revision: D60674615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132525
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2024-08-08 16:06:56 +00:00
4ee5547b37 [triton_op] Skip HOP dispatch when possible (#132822)
The capture_triton decorator returns a function that goes through the
triton kernel wrapper HOP. This is useful for make_fx tracing and
non-strict export. However, the HOP dispatch is slow (~1ms) and not
necessary in certain situations.

This PR skips going through the HOP dispatch for any
capture_triton-wrapped triton kernels that are registered as
implementations to a `@triton_op` custom operator. We do this by
creating a new thread-local flag that controls if the
capture_trition-wrapped triton kernel goes through HOP dispatch or not.

Test Plan:
- new test and existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132822
Approved by: https://github.com/SherlockNoMad
2024-08-08 15:56:40 +00:00
b885ad8fce Revert "[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487)"
This reverts commit 73c083e02cb6093bb3adf06b7ccdf5c4a2e7591c.

Reverted https://github.com/pytorch/pytorch/pull/132487 on behalf of https://github.com/PaliC due to this pr is breaking inductor tests internally ([comment](https://github.com/pytorch/pytorch/pull/132487#issuecomment-2276142742))
2024-08-08 15:47:04 +00:00
0ca8f66e3a [NestedTensor] Modify softmax on ragged dimension to allow for 2D nested tensors (#132812)
Summary:
Modify `softmax` on the ragged dimension, where `ragged_idx == 1`, to allow for 2D nested tensors. This diff now enables a `softmax` operation on tensors of shape `(B, *)`, where `*` is the ragged dimension.

Extend existing `softmax` unit tests to include 2D nested tensors using the `include_2d_tensor=True` keyword argument.

Test Plan:
Verify that existing and modified unit tests pass using the following commands:

```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_softmax
```

```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op
```

Reviewed By: davidberard98

Differential Revision: D60780975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132812
Approved by: https://github.com/davidberard98
2024-08-08 15:41:28 +00:00
c4071c4707 Remove noqa: G004 warnings (#132917)
Remove logging messages with f-strings (G004), https://docs.astral.sh/ruff/rules/logging-f-string/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132917
Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/fegin
ghstack dependencies: #132888
2024-08-08 15:18:53 +00:00
9db5bfccdc [inductor] disable test_torchinductor failed UTs on Windows (#132973)
Disable failed UTs of `test/inductor/test_torchinductor.py` on Windows.

**TODO:**
Debug and enable these UTs, after CI ready.

Local test:
<img width="857" alt="image" src="https://github.com/user-attachments/assets/3d9da274-f147-474e-92f1-a6d3ed8aa003">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132973
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-08 14:56:10 +00:00
51ddcde110 [BE] Introduces runner variants for amzn2023 to simplify lf-scale-config.yml and lf-canary-scale-config.yml (#132918)
Depends on https://github.com/pytorch/test-infra/pull/5541 to be deployed on LF and Meta infra

Test for this changes are in this PR: https://github.com/pytorch/test-infra/pull/5542
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132918
Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi
2024-08-08 14:38:34 +00:00
6f99e97f0a Revert "[ts-migration]: Support quantized operation transformation (#131915)"
This reverts commit 0e8541766fe5ed58c54aa530eee8e34832539199.

Reverted https://github.com/pytorch/pytorch/pull/131915 on behalf of https://github.com/ezyang due to test broken on windows 0e8541766f ([comment](https://github.com/pytorch/pytorch/pull/131915#issuecomment-2275974907))
2024-08-08 14:30:35 +00:00
42cd397a0e Loads .pyd instead of .so in MemPool test for windows (#132749)
Fixes #132650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749
Approved by: https://github.com/albanD
2024-08-08 14:29:56 +00:00
d1f73fd844 Revert "[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770)"
This reverts commit 902c6f3a191fb2ecb1976895b3e9eaae4b257b89.

Reverted https://github.com/pytorch/pytorch/pull/132770 on behalf of https://github.com/ezyang due to Removed API was recommitted ([comment](https://github.com/pytorch/pytorch/pull/132770#issuecomment-2275749689))
2024-08-08 12:54:34 +00:00
902c6f3a19 [BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770
Approved by: https://github.com/bdhirsh
ghstack dependencies: #132674, #132675, #132421, #132062, #132767, #132769
2024-08-08 12:03:25 +00:00
0e43175e22 [BE] Get rid of unnecessary inner_torch_dispatch method (#132769)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132769
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #132674, #132675, #132421, #132062, #132767
2024-08-08 12:03:25 +00:00
35fd4391bc Format torch.fx.experimental.proxy_tensor.py (#132767)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132767
Approved by: https://github.com/bdhirsh
ghstack dependencies: #132674, #132675, #132421, #132062
2024-08-08 12:03:18 +00:00
b4e2411f6f Big enough count to trigger stack overflow (#132062)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132062
Approved by: https://github.com/bdhirsh
ghstack dependencies: #132674, #132675, #132421
2024-08-08 12:03:12 +00:00
aec6332356 Only thunkify proxies in some situations (#132421)
The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead.

I annotated the PR with explanation of changes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #132674, #132675
2024-08-08 12:03:06 +00:00
54efd43022 [BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132675
Approved by: https://github.com/Skylion007, https://github.com/ydwu4, https://github.com/zou3519
ghstack dependencies: #132674
2024-08-08 12:03:00 +00:00
361db32d47 Consolidate SymDispatchMode into ProxyTensorMode (#132674)
Instead of having a separate context variable for SymDispatchMode, we
now simply delegate to the current active proxy tensor mode when we
need to trace a SymInt.  We maintain a separate `__sym_dispatch__` magic
method as the calling convention is different than `__torch_dispatch__`.

Consolidating the modes in this ways means that we can consistently
disable both of these modes in tandem simply by removing the mode
from the proxy mode infra slot.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-08-08 12:02:54 +00:00
0f19d4150b Revert "[inductor]a less ambitious way to slove the scalar tensor (#132702)"
This reverts commit b483ca05a91f2876b0f1f5a435fa264f5467762d.

Reverted https://github.com/pytorch/pytorch/pull/132702 on behalf of https://github.com/ezyang due to breaks trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/132702#issuecomment-2275642109))
2024-08-08 11:59:38 +00:00
ec49796b8f [Inductor] Support use_libdevice_for_f64 for pointwise ops on XPU, align with CUDA. (#132739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132739
Approved by: https://github.com/malfet, https://github.com/EikanWang
2024-08-08 11:50:10 +00:00
24dee99cb7 Populate submodules of torch._C to sys.modules recursively (#132216)
See comment:

e9d1c26275/torch/__init__.py (L938-L950)

This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216
Approved by: https://github.com/ezyang
2024-08-08 10:20:25 +00:00
7f71f2a997 [dtensor] improve docs and comments (#132683)
as titled, fixed typos in various comments and improve the
public documentations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132683
Approved by: https://github.com/XilunWu
ghstack dependencies: #131210, #132682
2024-08-08 09:24:58 +00:00
9e37e73e01 [dtensor] refactor and improve readability of _dispatch.py (#132682)
as titled. It also changes some comments of _op_schema.py to make them
update to date

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132682
Approved by: https://github.com/XilunWu
ghstack dependencies: #131210
2024-08-08 09:24:58 +00:00
ac960dced1 Skip Reformer for Dynamic size testing (#132468)
**Summary**

As discussed in https://github.com/pytorch/pytorch/issues/132286, `Reformer` has specialized the batch size dim which will fails the API  `mark_dynamic` 3a355c1891/torch/_dynamo/decorators.py (L228-L230)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132468
Approved by: https://github.com/ezyang
2024-08-08 08:25:53 +00:00
9c5e0d47fe Add xpu_cmake_macros.h to xpu build (#132847)
# Motivation

fix https://github.com/pytorch/pytorch/issues/132971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132847
Approved by: https://github.com/EikanWang
2024-08-08 08:06:49 +00:00
751c744ad0 Optimize sort kernel for contiguous tensors (#132236)
Introduces enhancement for SortingKernel.cpp for cases where both the values and indices tensors have a stride 1, indicating contiguous memory layouts.

The changes include:
1. A new function `sort_kernel_impl`, encapsulating the core sorting logic for distinct types of tensor accessors.
2. Modifications to the `sort_kernel` function to utilize `sort_kernel_impl`. It now checks for tensor strides and optimally handles contiguous and non-contiguous tensor scenarios.
3. The optimization aims to improve cache locality and efficiency in memory access for contiguous tensor sorts.
4. Enhanced Code Readability and Structure: The restructuring of the sorting process improves clarity and maintenance by clearly defining how different stride scenarios are handled, making the code more transparent and easier to understand.

Tests have been conducted across various tensor sizes and shapes to ensure stability and reliability of the change.

The result of running the `test/test_sort_and_select.py` test suite is consistent between the main branch, and this modified branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132236
Approved by: https://github.com/jgong5
2024-08-08 07:01:25 +00:00
83e4af203d [dtensor] rewrite redistribute algorithm for multi-dim mesh (#131210)
As titled, this PR rewrite the current redistribute algorithm to make
the multi-mesh dim redistribute logic more sound. The previous algorithm
works numerically but it could incur additional non-necessary steps
when transforming shardings in the multi-dimesnion device mesh, i.e.

Let's say we want to transform from (S(1), S(1)) -> (S(1), S(2)). The
previous algorithm yield the following steps:

* mesh_dim 1: S(1) -> R, mesh_dim 0: S(1) -> R
* mesh_dim 0: R -> S(1), mesh_dim 1: R -> S(2)

Although it works semantically but it incurs two allgather
transformations, where it should really only incur a S(1) -> S(2) on the
mesh dim 1.

The rewrite algorithm basically take it in a more principled way:

1. we check if src_spec have sharding, if not, we don't need to worry about nested sharding case, as sharding would always be in order, so we just go from left to right in the placements and add the transform steps
2. if src_spec have sharding, this potentially means that there would be either nested or mis-aligned shardings. So we first tranverse from right to left to check if there's mis-aligned sharding as the above example showed, if there is, we replicate that mesh dimension so that it unshard the nested sharding
3. we tranverse again from left to right to generate the transformation
   after we unshard the nested sharding

should also fix https://github.com/pytorch/pytorch/issues/132751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131210
Approved by: https://github.com/tianyu-l
2024-08-08 06:50:30 +00:00
479d460471 [DeviceMesh] Add a private _flatten() API for device_mesh (#132632)
Adds a new private API to flatten a DeviceMesh to a 1D DeviceMesh such that:
```
mesh_3d = init_device_mesh(
    self.device_type, (2, 2, 2), mesh_dim_names=("dp", "cp", "tp"),
)

dp_cp_mesh = mesh_3d["dp", "cp"]
# flattened_mesh on rank 0, 2, 4, 6 is DeviceMesh([0, 2, 4, 6], mesh_dim_names=('dp_cp',))
# flattened_mesh on rank 1, 3, 5, 7 is DeviceMesh([1, 3, 5, 7], mesh_dim_names=('dp_cp',))
flattened_dp_cp_mesh = dp_cp_mesh._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132632
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #132310, #132311, #132339
2024-08-08 06:46:42 +00:00
0e8541766f [ts-migration]: Support quantized operation transformation (#131915)
#### Description
Transform quantized operation properly. Add de/quantization before and after the quantized operation.

#### Test Plan
`pytest test/export/test_converter.py -s -k test_ts2ep_convert_quantized_model`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131915
Approved by: https://github.com/angelayi
2024-08-08 06:34:53 +00:00
9e584d0c05 [BE] Test foreach optimizer for FSDP1 optimizer state_dict (#132933)
Summary:
When fixing https://github.com/pytorch/pytorch/issues/130810, we suspected FSDP1 optimizer state_dict cannot handle foreach optimizer, which is not correct. For FSDP1, whether optimizer uses foreach or not does not matter. Since we already have tests for non-foreach version optimizer, this PR changes the distributed state_dict tests for FSDP1 to use foreach optimizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132933
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #132908
2024-08-08 06:13:10 +00:00
a270800f0b [export][reland] Add print_readable to unflattened module (#132817)
Reland https://github.com/pytorch/pytorch/pull/128617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132817
Approved by: https://github.com/pianpwk
2024-08-08 06:05:30 +00:00
745665d8b5 [BE] Using with_temp_dir for test_distributed_checkpoint (#132908)
Fixes https://github.com/pytorch/pytorch/issues/113936
Fixes https://github.com/pytorch/pytorch/issues/113937

The original way to broadcast the path seems to cause desync issues.  `with_temp_dir` has been used for other checkpoint related tests without problems. Change the tests to use `with_temp_dir`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132908
Approved by: https://github.com/awgu, https://github.com/Skylion007
2024-08-08 05:42:19 +00:00
aff48f7378 Autoselect default device in FSDP construction. (#127609)
There are still some differences between CUDA and non-CUDA custom devices when
construct FSDP because CUDA is selected as the default device. For example,
when construct FSDP from CPU model and device_id is not passed, device_handle
will choose CUDA as default device. This PR will autoselect the real device
as the default device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127609
Approved by: https://github.com/awgu
2024-08-08 05:25:17 +00:00
4a1edbe475 Disable SymDispatchMode when torch.compile'ing (#132433)
Partially addresses https://github.com/pytorch/pytorch/issues/132417

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132433
Approved by: https://github.com/ydwu4
2024-08-08 05:02:43 +00:00
5ae979ab10 [Dynamo] Support torch.autograd._is_checkpoint_valid (#132611)
Hi, we got `torch._dynamo.exc.Unsupported: torch.* op returned non-Tensor bool call_function <function _is_checkpoint_valid at 0x7f0b0d22e290>` while tracing activation [checkpointing function in deepspeed](324ee65cb0/deepspeed/runtime/activation_checkpointing/checkpointing.py (L630)). Consider to add it to constant_folding list which is similar with https://github.com/pytorch/pytorch/pull/126196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132611
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
2024-08-08 04:05:08 +00:00
4fd0d594a1 [sym_shapes] Not eval sym expression for printing storage_offset (#132911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132911
Approved by: https://github.com/ezyang
2024-08-08 03:49:29 +00:00
b483ca05a9 [inductor]a less ambitious way to slove the scalar tensor (#132702)
Fixes #121374

The previous https://github.com/pytorch/pytorch/pull/131775 was trying to convert the 0dim cpu tensor to a DynamicScalar in lowering stage. But there are so many lowering rules uncompatible with that way. So, this PR is trying to do the conversion in codegen stage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132702
Approved by: https://github.com/eellison
2024-08-08 03:42:21 +00:00
ac6398b630 [FSDP2] Follow-up fix to correct relaxed overlap test (#132953)
The previous PR forgot to include dummy all-gathers before backward, so the reference time was too short, causing the test to still fail.

I verified the test passes locally.

This should close https://github.com/pytorch/pytorch/issues/120961 (again).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132953
Approved by: https://github.com/weifengpy
ghstack dependencies: #132869
2024-08-08 03:24:46 +00:00
636a7c4859 [13/N] Use std::optional (#132527)
Follows #132361

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132527
Approved by: https://github.com/ezyang
2024-08-08 03:16:28 +00:00
fd874b799f [AOTI][refactor] Update MKLDNN ops cpp wrapper support (#132367)
Summary: Set op_overload for MKLDNN ops so that cpp_kernel_name and python_kernel_name are constructed from there. This is an important step towards support those MKLDNN ops in the ABI-compatible mode, because we will need to read schema from op_overload for generating correct fallback op call in C++.

Differential Revision: [D60909798](https://our.internmc.facebook.com/intern/diff/D60909798)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132367
Approved by: https://github.com/leslie-fang-intel, https://github.com/angelayi
2024-08-08 03:02:29 +00:00
c69b2d24e3 [dynamo] Support remove method of set (#132943)
Fixes https://github.com/pytorch/pytorch/issues/132800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132943
Approved by: https://github.com/anijain2305
2024-08-08 02:43:19 +00:00
194ec49d27 [dynamo][lists][stable diffusion] Do not add source on list slice (#132912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132912
Approved by: https://github.com/williamwen42
ghstack dependencies: #132806, #132899
2024-08-08 02:23:07 +00:00
45d0e90bd3 [export] Allow str outputs (#132808)
Summary: Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1478413606130179/

Test Plan: CI

Differential Revision: D60850712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132808
Approved by: https://github.com/ydwu4
2024-08-08 02:20:59 +00:00
4ca616e6d4 Disable sparse tests in export (#132824)
Summary: Dynamo doesn't trace through sparse tensors in fbcode. So we should disable tests that run sparse tensors in export. We should do this to make the CI green internally.

Test Plan:
Before:
Tests finished: Pass 1409. Fail 71. Fatal 0. Skip 90. Build failure 0
After:
Tests finished: Pass 1408. Fail 0. Fatal 0. Skip 162. Build failure 0

Differential Revision: D60870543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132824
Approved by: https://github.com/BoyuanFeng
2024-08-08 01:45:12 +00:00
fb6b001cde Disable expandable segments IPC in fbcode, because some jobs
seem to be failing. (#132890)

seem to be failing.

https://fb.workplace.com/groups/1405155842844877/permalink/8867182216642165/

Differential Revision: [D60912371](https://our.internmc.facebook.com/intern/diff/D60912371/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132890
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-08 01:42:32 +00:00
5709375d56 [AOTI][tooling][1/n] Add intermediate value debug printer (#132323)
Summary:
**Context:**

Currently we have a helper to print out AtenTensor in [shim_common.cpp](https://github.com/pytorch/pytorch/blob/v2.4.0-rc4/torch/csrc/inductor/aoti_torch/shim_common.cpp#L866)

The way we were using this function was a “manual” process. We inject this function into the generated output.cpp file, and recompile and reload the file. This diff automates the printing value process.

**Changes:**

1. Added a simple initial debug printer helper to print out tensor values

2. Added a filter option to selectively dump tensor values.

**Usage:**

Sample cmd :

```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code"  python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda
```

Sample outputs :
```
[  before_launch - triton_poi_fused_0 - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

[  after_launch - triton_poi_fused_0 - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

[ before_launch - aoti_torch_cuda_addmm_out - buf1  ]:
Min value: -2.25655
Max value: 2.32996
Device: cuda:0
Size: [16, 6]
Stride: [6, 1]
Dtype: float
Layout: Strided
Number of elements: 96
Is contiguous: 1
Requires grad: 0

[  before_launch - aoti_torch_cuda_addmm_out - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

[  after_launch - aoti_torch_cuda_addmm_out - buf1  ]:
Min value: -12.0839
Max value: 11.6878
Device: cuda:0
Size: [16, 6]
Stride: [6, 1]
Dtype: float
Layout: Strided
Number of elements: 96
Is contiguous: 1
Requires grad: 0

[  after_launch - aoti_torch_cuda_addmm_out - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('extern_calls', 2)]
.
----------------------------------------------------------------------
Ran 1 test in 10.867s

OK

```

The user is able to filter kernel names to print out values by specifying env var `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` and see choices of kernel names in a log message like below:
```
torch/_inductor/graph.py:1642] Finished codegen for all nodes. The list of kernel names available: ['triton_poi_fused_0', 'aoti_torch_cuda_addmm_out']

```

In the follow-up diff, will add `torch.save()` to dump/save the intermediate tensors into individual `.pt` files that can be further  `torch.load()`.

Test Plan:
Run Unit Tests in OSS: (similar cmd as mentioned above in the usage part)

 `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, output_code"  python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda`

Differential Revision: D60538496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132323
Approved by: https://github.com/ColinPeppler
2024-08-08 01:39:59 +00:00
59f4725b49 [NJT] manually autocast in SDPA handling (#132835)
When autocasting is turned on, right now SDPA w/ NJT won't be autocasted. This PR adds manual "autocasting" logic in sdpa.py - at the beginning, it just checks if autocasting is enabled, and if so, it casts the inputs in the way you would expect if autocasting was actually running.

Why normal autocasting won't work:
* NJT intercepts the `__torch_function__` call for scaled_dot_product_attention, which, AFAIK, happens before we get to any dispatcher logic, and then calls efficient attention or flash attention. So autocasting the scaled_dot_product_attention op won't work; we never call the aten op for scaled_dot_product_attention, so we won't ever run autocasting for it.
* If we try to add autocasting handling for `_flash_attention_forward` or `_efficient_attention_forward`, then autocasting will _run_, but it will have the wrong semantics: sdpa.py's handling will run first, and it will do backend selection based on the uncasted inputs to SDPA. This also means that if the inputs to the SDPA call don't have uniform types, the sdpa.py implementation will fail checks (this is the specific issue we're targeting).

Alternative: "just change the backend selection logic for NJT to be autocast aware, but don't actually do the autocast; then, add `_(flash|efficient)_attention_forward` to autocasting rules". I think this would work too. But it's arguably better to make the backend-selection logic and actual-autocast-behavior use the same implementation, in case the implementations are different.

Differential Revision: [D60879916](https://our.internmc.facebook.com/intern/diff/D60879916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132835
Approved by: https://github.com/soulitzer
2024-08-08 01:36:57 +00:00
bbf568aac8 Split of "[reland] [export] fix zero arg export in training_ir and constant tensor handling" (#132307)
Summary:
A re-land of D60006710.
Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing.

edit: also remove the eliminate_dead_code() in _unlift because of one onnx test failure:
a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state.
The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design

edit 2: Also fix the inconsistency of graph signatures when param_constant is marked as lifted_tensor_constants but it's registered as parameters in the output of ep.module().

Differential Revision: D60532628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132307
Approved by: https://github.com/zhxchen17
2024-08-08 01:36:16 +00:00
0f90ffe94a Remove ProcessGroupRoundRobin (#132888)
`_round_robin_process_groups` is deprecated and should be removed.

258f47fc0b/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.cpp (L10-L12)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132888
Approved by: https://github.com/Skylion007, https://github.com/wanchaol, https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-08 01:07:40 +00:00
5cb05a82b4 [BC breaking] move benchmarking + prefer inductor path (#132827)
move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827
Approved by: https://github.com/eellison
2024-08-08 00:47:45 +00:00
a9036e1cf8 [inductor] raise unsupport msg in capture_pre_autograd_graph on Windows (#132841)
Debuged with @leslie-fang-intel , and we found that: https://github.com/pytorch/pytorch/issues/132561 and https://github.com/pytorch/pytorch/issues/132569 are all failed by `capture_pre_autograd_graph` not work well on Windows.

So, we added some code to raise message and let end user known that.

Detailed:
For https://github.com/pytorch/pytorch/issues/132561
```cmd
Traceback (most recent call last):
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 59, in testPartExecutor
    yield
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 549, in _callTestMethod
    method()
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2918, in wrapper
    method(*args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 1515, in wrapper
    fn(*args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 399, in wrapper
    fn(*args, **kwargs)
  File "D:\xu_git\dnnl_cb\pytorch\test\quantization\pt2e\test_x86inductor_quantizer.py", line 1737, in test_qat_conv2d
    self._test_quantizer(
  File "D:\xu_git\dnnl_cb\pytorch\test\quantization\pt2e\test_x86inductor_quantizer.py", line 553, in _test_quantizer
    m = capture_pre_autograd_graph(
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_export\__init__.py", line 121, in capture_pre_autograd_graph
    raise RuntimeError("capture_pre_autograd_graph not yet supported on Windows")
RuntimeError: capture_pre_autograd_graph not yet supported on Windows

To execute this test, run the following from the base repo dir:
    python test\quantization\pt2e\test_x86inductor_quantizer.py -k TestQuantizePT2EX86Inductor.test_qat_conv2d

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

For https://github.com/pytorch/pytorch/issues/132569
```cmd
Traceback (most recent call last):
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 59, in testPartExecutor
    yield
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 549, in _callTestMethod
    method()
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2918, in wrapper
    method(*args, **kwargs)
  File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_torchinductor.py", line 11218, in new_test
    return value(self)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\testing.py", line 312, in _fn
    return fn(*args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_cpu_cpp_wrapper.py", line 155, in fn
    _, code = test_torchinductor.run_and_get_cpp_code(
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_inductor\utils.py", line 1863, in run_and_get_cpp_code
    result = fn(*args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 415, in wrapper
    fn(*args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 367, in wrapper
    fn(*args, **kwargs)
  File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 1668, in test_qlinear_gelu_cpu
    self._qlinear_unary_cpu_test_helper((torch.randn((2, 4)),), gelu)
  File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 1615, in _qlinear_unary_cpu_test_helper
    self._test_common(
  File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 165, in _test_common
    convert_model = _generate_qdq_quantized_model(
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 2949, in _generate_qdq_quantized_model
    export_model = capture_pre_autograd_graph(
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_export\__init__.py", line 121, in capture_pre_autograd_graph
    raise RuntimeError("capture_pre_autograd_graph not yet supported on Windows")
RuntimeError: capture_pre_autograd_graph not yet supported on Windows

To execute this test, run the following from the base repo dir:
    python test\inductor\test_cpu_cpp_wrapper.py -k DynamicShapesCppWrapperCpuTests.test_qlinear_gelu_cpu_dynamic_shapes_cpp_wrapper

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
--------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------
W0807 13:24:34.291000 11228 torch\_export\__init__.py:64] +============================+
W0807 13:24:34.291000 11228 torch\_export\__init__.py:65] |     !!!   WARNING   !!!    |
W0807 13:24:34.291000 11228 torch\_export\__init__.py:66] +============================+
W0807 13:24:34.291000 11228 torch\_export\__init__.py:67] capture_pre_autograd_graph() is deprecated and doesn't provide any function guarantee moving forward.
W0807 13:24:34.291000 11228 torch\_export\__init__.py:68] Please switch to use torch.export instead.
```

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132841
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-08-08 00:28:07 +00:00
441c1c03d5 Prevent an unnecessary device -> host copy for CuPy arrays when not explicitly setting a device in torch.as_tensor. (#132595)
See title. Until now, calling `torch.as_tensor` on a CuPy array would return a CPU tensor, when not providing a device. This is most likely not desired.

Fixes #132553

```python3
import torch
import cupy as cp

cupy_arr = cp.asarray([1, 2, 3])

# Default case
t = torch.as_tensor(cupy_arr)
# New behavior, same device as cupy_arr now, was cpu before
print(t.device)  # cuda:0

# Explicitly set device
t = torch.as_tensor(cupy_arr, device='cpu')
print(t.device)  # cpu

# Implicit default device
torch.set_default_device('cpu')
t = torch.as_tensor(cupy_arr)
print(t.device)  # cpu

# Default device via context manager
torch.set_default_device('cuda')
with torch.device('cpu'):
    t = torch.as_tensor(cupy_arr)
    print(t.device)  # cpu

# Unset default device
torch.set_default_device(None)
t = torch.as_tensor(cupy_arr)
# New behavior, same device as cupy_arr now, was cpu before
print(t.device)  # cuda:0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132595
Approved by: https://github.com/ezyang
2024-08-08 00:26:58 +00:00
374747818d Run performance test non-alternately (#131935)
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.

However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.

Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).

other changes:

need to add torch.compiler.cudagraph_mark_step_begin() to avoid the
slowdown from             # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards

also updated the torchao APIs to the current versions

X-link: https://github.com/pytorch/benchmark/pull/2394

Test Plan:
python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune

(should all be ~1.0
0.997x
1.006x
0.994x

Reviewed By: xuzhao9

Differential Revision: D60252821

Pulled By: HDCharles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131935
Approved by: https://github.com/xuzhao9
2024-08-08 00:23:20 +00:00
f16d87eeff Print where raw cprofile lives (#132866)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132866
Approved by: https://github.com/albanD
2024-08-08 00:13:29 +00:00
b73d4b6555 [pipelining] Add schedule runtime for lowered schedule (#130488)
Creates a new runtime that shifts complexity from runtime to
ahead-of-time.

The existing runtime (PipelineScheduleMulti) accepts a
compute-only schedule (forward, backward, weight) actions only are
specified, and it infers the communication operations at runtime.
Compared to that runtime, PipelineScheduleRuntime has less logic that
happens at runtime and relies on lowering passes to transform the
compute-only schedule to add communications.

Advantages include
- easier to verify the correctness by dumping a compute+comm schedule
- posible to manually edit the compute+comm schedule if the lowering
  heuristics are insufficient

Functionality included inside the PipelineScheduleRuntime is limited to
- accepting a compute-only schedule and lowering it to add comms
- executing the compute or comm operations specified by the given
  schedule
- handling work.wait() automatically by calling it just before the
  matching compute operation (for RECV ops) or at the end of step (for
  SEND ops)

Follow ups for later PRs
- Some refactoring should be done to replace PipelineScheduleMulti with
  this runtime
- Optimizer execution is not considered (e.g. for zero-bubble cases)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488
Approved by: https://github.com/H-Huang
2024-08-08 00:08:03 +00:00
9282e6ca78 Don't use _disable_current_modes as decorator (#132809)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132809
Approved by: https://github.com/albanD
ghstack dependencies: #132801, #132802, #132804
2024-08-07 23:59:46 +00:00
42226ca3a3 Don't use use_lazy_graph_module as decorator (#132804)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132804
Approved by: https://github.com/albanD
ghstack dependencies: #132801, #132802
2024-08-07 23:59:46 +00:00
5e4d8eb831 Don't generate stack entry for DebugContext.wrap (#132802)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132802
Approved by: https://github.com/albanD
ghstack dependencies: #132801
2024-08-07 23:59:38 +00:00
708a99e52a Stop using with_fresh_cache_if_config as decorator (#132801)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132801
Approved by: https://github.com/albanD
2024-08-07 23:59:32 +00:00
c3e51c09ed [PP] Add get_schedule_class util (#132768)
Add a function to map a string to a class instance for schedules. This allows users to select a schedule based on a string command line argument and removes the need for glue code (e.g. in torchtitan)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132768
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-07 23:51:03 +00:00
383f2ac914 AutoHeuristic: mixed_mm H100 heuristic (#132685)
H100 heuristic for mixed_mm. Performance looks similar to A100 heuristic.
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup  max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     1562    604     145   2311         1.522201          1.077722          10.399141            3.134170              1.034802               2061               2
 test  entropy          5              0.01      361    164      24    549         1.443590          1.079169           8.159173            3.105360              1.197973                500               2
```

gpt-fast speedups
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      |      109.95  |       220.63|  2      |
|     1    |     11      |      109.65  | 	    210.92|  1.92   |
|     4    |      7      |       149.04 |       625.80|  4.19   |
|     4    |     11      |       149.56 |       494.64|  3.30   |
|     8    |      7      |       293.68 |       956.72|  3.25   |
|     8    |     11      |       294.48 |       925.60|  3.14   |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132685
Approved by: https://github.com/eellison
2024-08-07 23:48:01 +00:00
c327710a87 [export] Publicize validate function (#132777)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132777
Approved by: https://github.com/zhxchen17
2024-08-07 23:10:05 +00:00
21d4c48059 Allow distributed breakpoint to skip the first few calls (#129511)
Summary:
PDB allows to do conditional breakpoint but the ability won't work in the distributed environment. We can still do conditional breakpoint by doing the following:

```
counter = 0

global counter
count += 1
if counter > 100:
  dist.breakpoint()
```

This PR makes dist.breakpoint() support this feature as a syntax sugar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129511
Approved by: https://github.com/wconstab, https://github.com/c-p-i-o
2024-08-07 21:57:37 +00:00
acad2050c1 [easy][dynamo] Add tx as an arg in getitem_const (#132899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132899
Approved by: https://github.com/yanboliang
ghstack dependencies: #132806
2024-08-07 21:35:41 +00:00
700a11fdd4 Make inductor kernel metadata comments more descriptive (#126698)
Summary:

A couple of improvements to the generated comments in inductor kernels:

1. Makes the nodes in the comment topologically sorted, I think having them
   alphabetically sorted is a gotcha. I was always confused on why the
   sorting in the comments did not match the code.
2. Adds a printout of the aten graph fragment corresponding to the
   current inductor kernel, to make it easier to map from aten
   code to inductor code

Example float8-overhead-related inductor kernel comment after this PR:

```
# kernel path: /tmp/torchinductor_vasiliy/27/c27ts3rdw56ns7od5j6ovdnhxphished2lcu3adclzzixoo7khg5.py
# Source Nodes: [weight_fp8], Original ATen: [aten.mul, aten.clamp, aten._to_copy]
# Source node to ATen node mapping:
#   weight_fp8 => clamp_max_1, clamp_min_3, convert_element_type_10, convert_element_type_11, convert_element_type_9, mul_3
# Graph fragment:
#   %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %convert_element_type_8), kwargs = {})
#   %convert_element_type_9 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%mul_3, torch.float32), kwargs = {})
#   %clamp_min_3 : [num_users=1] = call_function[target=torch.ops.aten.clamp_min.default](args = (%convert_element_type_9, -448.0), kwargs = {})
#   %clamp_max_1 : [num_users=1] = call_function[target=torch.ops.aten.clamp_max.default](args = (%clamp_min_3, 448.0), kwargs = {})
#   %convert_element_type_10 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%clamp_max_1, torch.bfloat16), kwargs = {})
#   %convert_element_type_11 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%convert_element_type_10, torch.float8_e4m3fn), kwargs = {})
triton_poi_fused__to_copy_clamp_mul_5 = async_compile.triton('triton_', '''
```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126698
Approved by: https://github.com/ezyang
ghstack dependencies: #126573
2024-08-07 21:25:09 +00:00
48f7bdbbe1 aot_autograd: copy metadata from fw to bw nodes (#126573)
Summary:

Uses the `seq_nr` field (introduced to aot_autograd nodes in
https://github.com/pytorch/pytorch/pull/103129) to map the aot_autograd
fx bw nodes to the corresponding fw nodes, and copy the metadata over.

I am trusting the `seq_nr` mapping in the linked PR here. I did
some validation with a toy LLaMa 3 8b training run and the mapping seemed
correct.

I am also trusting that the forward is single threaded, since `seq_nr` is thread local.  If this isn't always true, we'll need to also plumb `thread_id` through the same machinery which is populating `seq_nr`.

I'd like to use this data in a future PR to make inductor kernels easily
attributable to the nn.Module path in modeling land, to make it easier
to do performance debugging.

Test Plan:

```
// 1. unit test
python test/dynamo/test_aot_autograd.py -k test_aot_sequence_nr

// 2. manual test
// run LLaMa 3 8B fw + bw with torch.compile, print out the inductor graphs
// seen in `torch/_inductor/utils.py::get_kernel_metadata`, they seemed
// right to me.
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126573
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2024-08-07 21:25:09 +00:00
260e7cb143 Make CUDA device properties's __repr__ output actually printable (#132863)
Previously we would write the UUID bytes directly, leading to 'invalid
UTF-8 sequence' errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132863
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-08-07 21:08:43 +00:00
525fdc0f95 [docs] fix incorrect example in convert_conv3d_weight_memory_format (#129318)
The current example fails when using `torch.channels_last`, and the docs are slightly incorrect for the 3d case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129318
Approved by: https://github.com/albanD
2024-08-07 20:06:59 +00:00
6a348e5e57 [CUDAGraph] Warn once if too many distinct sizes (#132832)
Warn once if there are too many distinct sizes for cudagraph, so we can avoid spamming logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132832
Approved by: https://github.com/eellison
2024-08-07 19:48:06 +00:00
e76bd0b603 [BE] put "show_dispatch_trace()" print logic in .cpp file (#132717)
I find myself occasionally trying to modify this to get additional debug info. Recompiling takes forever after modifying these lines, because the .h file is depended on by a huge number of files.

If we move this logic into a helper function and put it in the .cpp file, recompilation will be a lot faster when adding debug here.

Tested with a local DEBUG=1 build (which is needed to use `TORCH_SHOW_DISPATCH_TRACE=1`) and verified basic sanity - i.e. it still prints `[call]`, etc.

Differential Revision: [D60804331](https://our.internmc.facebook.com/intern/diff/D60804331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132717
Approved by: https://github.com/soulitzer, https://github.com/bdhirsh
2024-08-07 19:43:29 +00:00
7830373662 Update owner for BC test (#132891)
Add @larryliu0820 to `/test/forward_backward_compatibility/check_forward_backward_compatibility.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132891
Approved by: https://github.com/albanD
2024-08-07 19:42:04 +00:00
59bbaea3a7 [inductor] disable capture_pre_autograd_graph related UTs on Windows (#132848)
Contined to https://github.com/pytorch/pytorch/pull/132841

We disable `capture_pre_autograd_graph` related UT on Windows.
Disable `test_lstm_packed_change_input_sizes` and `test_multihead_attention` UTs on Windows.

**TODO:**
Turn on them after fix `capture_pre_autograd_graph` issue on Windows.

## Local Test:
Linux is not skiped:
<img width="1387" alt="image" src="https://github.com/user-attachments/assets/28dfbb4b-d9c0-4d5b-be84-d7b3697bcd3f">

And we can skiped them on Windows:
<img width="853" alt="image" src="https://github.com/user-attachments/assets/e96ebcf8-9bf3-43aa-93fd-fb33d3743573">

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132848
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-07 19:38:03 +00:00
7ea8374c0e nn.ModuleList.__getitem__ overloads (#132834)
Overloads so that you can get more specific type info based on how you are indexing.

```python
from torch import nn

module_list = nn.ModuleList(32 * [nn.Linear(2, 2)])

# before:
reveal_type(module_list[0])  # Type of "module_list[0]" is "Module | ModuleList"
reveal_type(module_list[:1])  # Type of "module_list[: 1]" is "Module | ModuleList"

# now:
reveal_type(module_list[0])  # Type of "module_list[0]" is "Module"
reveal_type(module_list[:1])  # Type of "module_list[: 1]" is "ModuleList"
```
Co-authored-by: Skylion007 <Skylion007@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132834
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-08-07 19:25:23 +00:00
83fa7f871f Work around item non-sync issue on AMD (#132772)
Differential Revision: D59669714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132772
Approved by: https://github.com/ZhengkaiZ, https://github.com/izaitsevfb
2024-08-07 18:58:11 +00:00
ff81ca8e0c Revert "Populate submodules of torch._C to sys.modules recursively (#132216)"
This reverts commit 672ce4610e41386da9763e07375b0879dc351905.

Reverted https://github.com/pytorch/pytorch/pull/132216 on behalf of https://github.com/PaliC due to was breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/132216#issuecomment-2274112397))
2024-08-07 18:45:00 +00:00
4fe6a5dc34 Move slow tests to be in repo (#132379)
Move the slow test json to be in the pytorch/pytorch repo and make a job that will update it weekly.  The job uses the same environment as the commit hash.  It uses similar code to the hash updates, but the hash update contains a lot of code that is specific to the hash update, so I chose to pick out the parts that are relevant

Remove references to the old file and set up testing to read from the new file instead

The old update cadence was every day, the new one is every week

The auto slow test infra + the lack of pinning between pytorch and test-infra makes it really hard to tell if a test started failing because of a change or because of the slow test json changing.  While this can have benefits, like disable test issues being effective everywhere immediately, it can also be very confusing, especially since we don't have the same insight into slow tests like we do for disable issues.

Example PR made: https://github.com/pytorch/pytorch/pull/132383 (with all the changes from this PR because it was working on top of this)

We should just get rid of this at some point in favor of the slowTest decorator, but there are some tests that take 5+ minutes to run and I don't want to track them down right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132379
Approved by: https://github.com/huydhn
2024-08-07 18:42:56 +00:00
26b0011fb8 [XPU][Kineto Submodule] Introduce kineto-based XPU profiler (#130811)
As XPU became a PyTorch built-in device, the profiler support is indispensable part of functionality completeness. This PR is associated with the PR to introduce XPU profiler plugin into the kineto. When USE_XPU is enabled, the LIBKINETO_NOXPUPTI option will be suppressed accordingly, which allows kineto to build with XPU profiler plugin.

Associated PR to introduce kineto-based XPU profiler into kineto:
https://github.com/pytorch/kineto/pull/961

Also updates the Kineto Submodule to include XPU changes.

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130811
Approved by: https://github.com/aaronenyeshi
2024-08-07 18:41:37 +00:00
07551887b8 Revert "Disable SymDispatchMode when torch.compile'ing (#132433)"
This reverts commit 63eb06c0512b636a34caf041eab6fbc0726fc7ee.

Reverted https://github.com/pytorch/pytorch/pull/132433 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132433#issuecomment-2274105080))
2024-08-07 18:41:28 +00:00
ca713b8393 llvm update for backward-breaking APIs in 18 and 19 (#132825)
Related to #130661, #129797.  Based on the LLVM tagged releases, these LLVM_VERSION_MAJOR guards are accurate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132825
Approved by: https://github.com/dcci, https://github.com/Skylion007
2024-08-07 18:31:40 +00:00
a9ff190867 Revert "Consolidate SymDispatchMode into ProxyTensorMode (#132674)"
This reverts commit ffdf48e63b94930c81f05b06444721109d0b243d.

Reverted https://github.com/pytorch/pytorch/pull/132674 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132674#issuecomment-2274062785))
2024-08-07 18:25:33 +00:00
9d476fee53 Revert "[BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675)"
This reverts commit c2bccfd4311fe905ff78c0977281b8e642bb10d6.

Reverted https://github.com/pytorch/pytorch/pull/132675 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132674#issuecomment-2274062785))
2024-08-07 18:25:33 +00:00
f2ad3c89b0 fix dtype mismatch in lobpcg eigen solver (#132762)
Fixes #132761

If rerr value is_complex, test against the real part. Since the rerr variable holds a norm calculation, the imaginary part will be 0.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132762
Approved by: https://github.com/albanD
2024-08-07 18:20:46 +00:00
1749025081 Revert "Fix infinite recursion while walking to submodules (#132763)"
This reverts commit 063a45ed27c3001bba44ea2161d188ec2314d428.

Reverted https://github.com/pytorch/pytorch/pull/132763 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132763#issuecomment-2274059792))
2024-08-07 18:20:27 +00:00
25df063f04 [dynamo][user_defined][stable-diffusion] Raise ObservedAttributeError on UserDefinedObject var_getattr (#132806)
Fixes https://github.com/pytorch/pytorch/issues/132551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132806
Approved by: https://github.com/williamwen42
2024-08-07 18:19:49 +00:00
40ce0a53bb [FSDP][dtensor] add FSDP2+TP distributed state dict test (#131408)
**Test**
`pytest test/distributed/_composable/fsdp/test_fully_shard_training.py`
`pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py`
`pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py`
`pytest test/distributed/_composable/fsdp/test_fully_shard_init.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131408
Approved by: https://github.com/fegin
ghstack dependencies: #126697, #130239, #132391
2024-08-07 18:17:12 +00:00
ad0ce89050 [3/N][dtensor] Strided Sharding offset calculation util (#132391)
**Summary**
1. change `compute_local_shape_and_global_offset` to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding).
2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim.

**Test**
`test/distributed/_tensor/test_utils.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132391
Approved by: https://github.com/wanchaol
ghstack dependencies: #126697, #130239
2024-08-07 18:17:12 +00:00
0b0c660c02 [2/N][dtensor] Strided Sharding shard_to_replicate (#130239)
** Summary **
This PR adds the necessary util function to `_StridedShard` for correct shard-to-replicate resharding.

**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`
`pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130239
Approved by: https://github.com/wanchaol
ghstack dependencies: #126697
2024-08-07 18:17:06 +00:00
92a17f454a [1/N][dtensor] introduce StridedShard placement type and _split_tensor() logic (#126697)
**Summary**
This PR adds a new private placement type `_StridedShard` for FSDP2 + TP style tensor sharding. The previously used `Shard` placement type cannot produce correct `full_tensor()` result because it assumes the tensor to be first sharded over `dp` mesh dimension then `tp` mesh dimension which does not hold true in FSDP2 + TP case.

**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126697
Approved by: https://github.com/wanchaol
2024-08-07 18:17:02 +00:00
123d9ec5bf Revert "Loads .pyd instead of .so in MemPool test for windows (#132749)"
This reverts commit 37ab0f33854fafdf9bf4f575260329ffcd960d13.

Reverted https://github.com/pytorch/pytorch/pull/132749 on behalf of https://github.com/syed-ahmed due to Seems like periodic is still failing: 7c79e89bc5 ([comment](https://github.com/pytorch/pytorch/pull/132749#issuecomment-2274041302))
2024-08-07 18:08:44 +00:00
a62710c820 [FSDP2] Relaxed overlap test to address CI flakiness (#132869)
This tries to fix https://github.com/pytorch/pytorch/issues/120961.

This is a similar situation as https://github.com/pytorch/pytorch/pull/132116. The overlap tests were written strictly based on a precise calculation of what compute/communication should be non-overlapped vs. overlapped. This is done via `torch.cuda._sleep()`, which takes inputs in cycles, so we must convert from milliseconds to cycles via `get_cycles_per_ms()`, which is computed once and cached. Variation in CI can cause this `get_cycles_per_ms()` value to be inaccurate when the FSDP overlap tests run. Thus, we decide to relax the overlap tests to just make sure the overlapped runs are faster than a baseline without overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132869
Approved by: https://github.com/weifengpy
2024-08-07 17:37:03 +00:00
cyy
32a284c275 [9/N] Fix clang-tidy warnings in aten/src/ATen (#132842)
Follows #132728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132842
Approved by: https://github.com/Skylion007
2024-08-07 16:54:21 +00:00
ffd0d92c18 fix autotuning init issues (#132837)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132837
Approved by: https://github.com/yanboliang
2024-08-07 16:36:47 +00:00
8b50d5398f [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-07 16:13:11 +00:00
258f47fc0b Add padding_side to pad_sequence with "left" and "right" options ("right" as default) (#131884)
Fixes #10536

Reattempt of #61467. Thank you so much to @mskoh52 for your excellent work!

As I was trying to create a more efficient LLM data collator, I realized that `pad_sequence` only supports right padding, even though left padding is a very common format for LLMs, like Llama and Mistral.

The proposed alternative implementation was to use multiple flips, which tends to be 1.5x-2x slower. Instead we can add a [`padding_side` parameter as there is for for Hugging Face tokenizers](9d6c0641c4/src/transformers/tokenization_utils_base.py (L1565)), which requires only a very small change in the C++ code.

Here are the benchmarks of the new implementation!

`float32`:

![eaaa95ef-9384-45d2-be56-6898bc1d3514](https://github.com/user-attachments/assets/3b0eb309-e5a0-4a4d-97bb-4e3298783dbb)

`bool`:

![892f32da-8d9a-492b-9507-18d3f0a41e8e](https://github.com/user-attachments/assets/6824ea15-7d4e-4b89-95f0-8546635f0c2e)

Code:

```python
from __future__ import annotations

import random
import time
from typing import Literal

import numpy as np
import torch

def pad_sequence_with_flips(
    sequences: list[torch.Tensor],
    batch_first: bool = False,
    padding_value: int | float | bool = 0.0,
    padding_side: Literal["left", "right"] | str = "left",
) -> torch.Tensor:
    if padding_side == 'right':
        padded_sequence = torch._C._nn.pad_sequence([t.flatten() for t in sequences], batch_first=batch_first, padding_value=padding_value)
    elif padding_side=='left':
        padded_sequence = torch._C._nn.pad_sequence([t.flatten().flip(0) for t in sequences], batch_first=batch_first, padding_value=padding_value)  # pyright: ignore[reportArgumentType]
        padded_sequence = padded_sequence.flip(int(batch_first))
    else:
        raise ValueError(f"padding_side should be either 'right' or 'left', but got {padding_side}")

    return padded_sequence

sequence_lengths: list[int] = []

flip_left_pad_times: list[float] = []
flip_left_pad_times_std: list[float] = []

left_pad_times: list[float] = []
left_pad_times_std: list[float] = []

RUNS_PER_LOOP: int = 100

for i in range(1, 7):
    sequence_length = i * int(1e6) // 6
    sequence_lengths.append(sequence_length)

    sequences = [torch.randint(0, 2, (random.randint(1, sequence_length),), dtype=torch.bool) for _ in range(64)]

    inner_left_pad_times: list[float] = []
    inner_right_pad_times: list[float] = []

    inner_flip_left_pad_times: list[float] = []
    inner_flip_right_pad_times: list[float] = []

    for _ in range(RUNS_PER_LOOP):

        start = time.perf_counter()
        torch._C._nn.pad_sequence(sequences, batch_first=True, padding_value=False, padding_side="left")
        end = time.perf_counter()
        inner_left_pad_times.append(end - start)

        start = time.perf_counter()
        pad_sequence_with_flips(sequences, batch_first=True, padding_value=False, padding_side="left")
        end = time.perf_counter()
        inner_flip_left_pad_times.append(end - start)

    left_pad_times.append(sum(inner_left_pad_times) / len(inner_left_pad_times))
    left_pad_times_std.append(np.std(inner_left_pad_times))

    flip_left_pad_times.append(sum(inner_flip_left_pad_times) / len(inner_flip_left_pad_times))
    flip_left_pad_times_std.append(np.std(inner_flip_left_pad_times))

    print(f"Sequence Length: {sequence_length}, Left Pad Time: {left_pad_times[-1]}, Left with Flips Pad Time: {flip_left_pad_times[-1]}")

import matplotlib.pyplot as plt

plt.plot(sequence_lengths, left_pad_times, label="new pad_sequence left")
plt.scatter(sequence_lengths, left_pad_times)
plt.errorbar(sequence_lengths, left_pad_times, yerr=left_pad_times_std, linestyle='None', marker='^')

plt.plot(sequence_lengths, flip_left_pad_times, label="old pad_sequence left (2 flips)")
plt.scatter(sequence_lengths, flip_left_pad_times)
plt.errorbar(sequence_lengths, flip_left_pad_times, yerr=flip_left_pad_times_std, linestyle='None', marker='^')

plt.xlabel("Sequence Length")
plt.ylabel("Time (s)")
plt.legend(loc="upper right")

# Sequence Length: 166666, Left Pad Time: 0.06147645162009212, Left with Flips Pad Time: 0.09842291727001794
# Sequence Length: 333333, Left Pad Time: 0.08933195920990329, Left with Flips Pad Time: 0.15597836187991562
# Sequence Length: 500000, Left Pad Time: 0.08863158334006585, Left with Flips Pad Time: 0.15224887342999863
# Sequence Length: 666666, Left Pad Time: 0.10524682551997103, Left with Flips Pad Time: 0.18177212480995877
# Sequence Length: 833333, Left Pad Time: 0.11801802741003485, Left with Flips Pad Time: 0.20821274195001024
# Sequence Length: 1000000, Left Pad Time: 0.131894061660023, Left with Flips Pad Time: 0.23223503091008751
```

Co-authored-by: mskoh52 <mskoh52@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131884
Approved by: https://github.com/ezyang
2024-08-07 15:53:07 +00:00
780310fed7 Revert "Only thunkify proxies in some situations (#132421)"
This reverts commit bb99008c9e7c357b88047bcd6971dc2078341484.

Reverted https://github.com/pytorch/pytorch/pull/132421 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_subclasses.py::TestNestedTensor::test_in_graph_construction_from_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10283744685/job/28459340678) [HUD commit link](bb99008c9e).  Test got added in f50621989b which is before your merge base ([comment](https://github.com/pytorch/pytorch/pull/132421#issuecomment-2273742960))
2024-08-07 15:29:54 +00:00
de9b8a42c1 Revert "Add support for other backends in get_preferred_device (#132118)"
This reverts commit c184ac0f6b6d2482cf300d852fde6370a1c1e086.

Reverted https://github.com/pytorch/pytorch/pull/132118 on behalf of https://github.com/clee2000 due to I think this broke distributed/checkpoint/test_file_system_checkpoint_cpu.py::TestDistributedReshardOnLoad::test_load_rowwise_to_colwise_thread_count_1 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10279901233/job/28456599072) [HUD commit link](c184ac0f6b).  Dr CI classification is wrong, the failure is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132118#issuecomment-2273729288))
2024-08-07 15:22:42 +00:00
cyy
13fa59580e Enable clang-tidy on aten/src/ATen/cpu (#132830)
Expands code coverage of clang-tidy to aten/src/ATen/cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132830
Approved by: https://github.com/Skylion007
2024-08-07 14:44:17 +00:00
ed97fb77f9 Conversions between strided and jagged layouts for Nested Tensors (#115749)
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
2024-08-07 14:18:53 +00:00
fb146fc3c6 Only store necessary tensor_dict fields in node meta (#132805)
Fixes #132290

This PR attempts a more invasive / complete solution than the one from #132338, which removes immediate tensor fields from the `tensor_dict` copy stored in node meta. The approach taken here is to store only those fields of the `tensor_dict` which are absolutely utilized somewhere else.

So far, this appears to be limited to:
* `_dynamo_static_input_type`
* `tag` (at least in the tests). Discussion at #94080 appears to indicate this is depended on for export

(CI may point out more)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132805
Approved by: https://github.com/mlazos
2024-08-07 13:35:16 +00:00
7c79e89bc5 Stop using clear_frame as decorator (#132778)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132778
Approved by: https://github.com/albanD
ghstack dependencies: #132774
2024-08-07 11:53:18 +00:00
bb99008c9e Only thunkify proxies in some situations (#132421)
The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead.

I annotated the PR with explanation of changes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #132674, #132675
2024-08-07 11:51:17 +00:00
32f9a809c7 Replace [[unlikely]] with unlikely(x) (#130816)
Do not use `[[unlikely]]` as its c++20 language features, see https://en.cppreference.com/w/cpp/language/attributes/likely

Fixes https://github.com/pytorch/pytorch/issues/130815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130816
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet
2024-08-07 10:38:13 +00:00
8c8eb9670a [CI] Enable inductor UT test on avx512 (#132645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132645
Approved by: https://github.com/desertfire
2024-08-07 10:22:40 +00:00
37ab0f3385 Loads .pyd instead of .so in MemPool test for windows (#132749)
Fixes #132650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749
Approved by: https://github.com/albanD
2024-08-07 09:58:52 +00:00
8333ecf085 Support hasattr tracing for more PythonModuleVariable (#132731)
Fixes #132237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132731
Approved by: https://github.com/EikanWang, https://github.com/yanboliang
2024-08-07 09:15:17 +00:00
c8c964f950 [inductor] check best templates first for fusions (#132829)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132829
Approved by: https://github.com/eellison
2024-08-07 07:48:00 +00:00
c184ac0f6b Add support for other backends in get_preferred_device (#132118)
Currenlty get_preferred_device supports only cuda and cpu. Add support for other backends using backend config.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132118
Approved by: https://github.com/awgu
2024-08-07 07:19:20 +00:00
87053132ea [DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339)
Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example:
```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0))
```

We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have:

```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0))
```
With this change, we will have two types of meshes in an environment.
1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing.
2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339
Approved by: https://github.com/wanchaol
ghstack dependencies: #132310, #132311
2024-08-07 07:01:12 +00:00
dc00eeb0f4 [Dynamo] fix incorrect kwargs in create_proxy (#132723)
## Summary
Fix https://github.com/pytorch/pytorch/issues/132642, the implementation of `create_proxy` requires to pass-in `kwargs` explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132723
Approved by: https://github.com/aorenste
2024-08-07 06:26:24 +00:00
2206a3de00 [Compile] Speedup int8-to-float conversion on aarch64 (#132676)
With this change following snippet:
```cpp
#include <ATen/cpu/vec/vec.h>

void int8tofloat(int8_t* in, float* out) {
        auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in, 8);
        auto tmp1 = at::vec::convert<float>(tmp0);
        tmp1.store(out);
}
```, which is core of the algorithm generated by cpu_inductor for the following compiled function:
```python
@torch.compile
def to_float(x):
  return x.to(torch.float)
```

changes from
```assembly
int8tofloat(signed char*, float*):
0000000000000000	stp	x29, x30, [sp, #-0x10]!
0000000000000004	mov	x29, sp
0000000000000008	sub	x9, sp, #0x30
000000000000000c	and	sp, x9, #0xffffffffffffffe0
0000000000000010	adrp	x8, 0 ; 0x0
0000000000000014	ldr	x8, [x8]
0000000000000018	ldr	x8, [x8]
000000000000001c	str	x8, [sp, #0x28]
0000000000000020	ldr	s0, [x0]
0000000000000024	sshll.8h	v0, v0, #0x0
0000000000000028	sshll.4s	v0, v0, #0x0
000000000000002c	scvtf.4s	v0, v0
0000000000000030	str	q0, [sp]
0000000000000034	ldr	s0, [x0, #0x4]
0000000000000038	sshll.8h	v0, v0, #0x0
000000000000003c	sshll.4s	v0, v0, #0x0
0000000000000040	scvtf.4s	v0, v0
0000000000000044	str	q0, [sp, #0x10]
0000000000000048	mov	x8, sp
000000000000004c	ld1.4s	{ v0, v1 }, [x8]
0000000000000050	st1.4s	{ v0, v1 }, [x1]
0000000000000054	ldr	x8, [sp, #0x28]
0000000000000058	adrp	x9, 0 ; 0x0
000000000000005c	ldr	x9, [x9]
0000000000000060	ldr	x9, [x9]
0000000000000064	cmp	x9, x8
0000000000000068	b.ne	0x78
000000000000006c	mov	sp, x29
0000000000000070	ldp	x29, x30, [sp], #0x10
0000000000000074	ret
0000000000000078	bl	0x78
```
to
```assembly
0000000000000000	ldr	d0, [x0]
0000000000000004	sshll.8h	v0, v0, #0x0
0000000000000008	sshll.4s	v1, v0, #0x0
000000000000000c	scvtf.4s	v1, v1
0000000000000010	sshll2.4s	v0, v0, #0x0
0000000000000014	scvtf.4s	v2, v0
0000000000000018	st1.4s	{ v1, v2 }, [x1]
000000000000001c	ret
```

and improves perf of `python3 torchchat.py generate stories110M --num-samples 3 --quantize '{"linear:int8" : {"groupsize" : 0}}' --compile --device cpu` from 56 to 98 tokens per sec on MacBook M1 Pro

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132676
Approved by: https://github.com/desertfire
2024-08-07 06:26:05 +00:00
4faa0e3efb [Inductor] support masked vectorization for the tail_loop (#126526)
Currently the tail_loop always uses the scalar kernel. This PR supports masked vectorization for the tail_loop to improve the performance.

Example:
```
import torch
import torch.nn as nn

class GN(nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GN, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return self.gn(x)

input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last)
m = GN(32, 960).eval()
compiled_m = torch.compile(m)

with torch.no_grad():
    for _ in range(3):
        compiled_m(input)

```

Generated code:
- Before:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/ky/cky2bufythacofebk7ujv36e4pxyqcqbpsy5r4vojoprjiwcwfxf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> weight_recps(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps);
                            }
                            #pragma omp simd simdlen(8)
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(1L))
                            {
                                auto tmp0 = in_ptr0[static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0))];
                                tmp_acc0 = welford_combine(tmp_acc0, tmp0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)), 16);
                        auto tmp1 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp3 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16);
                        auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16);
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = static_cast<float>(276480.0);
                        auto tmp5 = at::vec::Vectorized<float>(tmp4);
                        auto tmp6 = tmp3 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = at::vec::Vectorized<float>(tmp7);
                        auto tmp9 = tmp6 + tmp8;
                        auto tmp10 = tmp9.rsqrt();
                        auto tmp11 = tmp2 * tmp10;
                        auto tmp13 = tmp11 * tmp12;
                        auto tmp15 = tmp13 + tmp14;
                        tmp15.store(out_ptr2 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)));
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1 = args
    args.clear()
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3)
    del arg0_1
    del arg1_1
    del arg2_1
    return (buf3, )
```

- After:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/em/cemtujj65j5txpqlxc7w4pcunpmvz3qtiudkc5ocxxhcmdlknw2m.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 14);
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)), 16);
                        auto tmp1 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp3 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16);
                        auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16);
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = static_cast<float>(276480.0);
                        auto tmp5 = at::vec::Vectorized<float>(tmp4);
                        auto tmp6 = tmp3 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = at::vec::Vectorized<float>(tmp7);
                        auto tmp9 = tmp6 + tmp8;
                        auto tmp10 = tmp9.rsqrt();
                        auto tmp11 = tmp2 * tmp10;
                        auto tmp13 = tmp11 * tmp12;
                        auto tmp15 = tmp13 + tmp14;
                        tmp15.store(out_ptr2 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)));
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1 = args
    args.clear()
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3)
    del arg0_1
    del arg1_1
    del arg2_1
    return (buf3, )
```

Co-authored-by: CaoE <e.cao@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126526
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-07 06:00:12 +00:00
8bc5ef563e Grouped Query Attention (#132689)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Differential Revision: D60772086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689
Approved by: https://github.com/drisspg
2024-08-07 05:35:36 +00:00
527f104a69 add L2 cache size to device properties (#132819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132819
Approved by: https://github.com/eellison
2024-08-07 04:55:06 +00:00
cyy
bfeb45e46b [17/N] Fix clang-tidy warnings in jit (#132753)
Follows #132604
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132753
Approved by: https://github.com/Skylion007
2024-08-07 03:47:54 +00:00
cyy
03480213de [8/N] Fix clang-tidy warnings in aten/src/ATen (#132728)
Follows  #132727
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132728
Approved by: https://github.com/ezyang
2024-08-07 02:44:17 +00:00
919e384247 [PT2][Optimus] Add unbind_stack_to_cat_pass (#132542)
Summary: We observe the stack mpde can be transformed to cat node to elimiate split nodes, which could further enable the unbind cat optimization, thus we add a more advanced pattern to do the graph transformation

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```
Buck UI: https://www.internalfb.com/buck2/de6c1cda-3d74-4a30-8980-7b209b6fe5dc
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12103424042268125
Network: Up: 485KiB  Down: 728KiB  (reSessionID-2f2c01c3-79bb-4e37-b5be-fb77ec09b264)
Jobs completed: 29. Time elapsed: 5:19.8s.
Cache hits: 0%. Commands: 4 (cached: 0, remote: 0, local: 4)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697
```
P1503698962

before and after graph transformation
https://www.internalfb.com/intern/diffing/?paste_number=1504050718

Differential Revision: D60411560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132542
Approved by: https://github.com/jackiexu1992
2024-08-07 02:26:40 +00:00
063a45ed27 Fix infinite recursion while walking to submodules (#132763)
Fixes https://github.com/pytorch/pytorch/pull/132216#issuecomment-2271555873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132763
Approved by: https://github.com/ezyang
2024-08-07 02:20:17 +00:00
73c083e02c [Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487)
**Summary**
The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929.  Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487
Approved by: https://github.com/anijain2305, https://github.com/jgong5
2024-08-07 02:18:51 +00:00
ed224554eb [BE] Don't unnecessarily suggest -k for rerunning tests locally (#132807)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132807
Approved by: https://github.com/malfet
2024-08-07 02:15:18 +00:00
837898d9c8 Stop using preserve_rng_state as decorator (#132774)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132774
Approved by: https://github.com/albanD
2024-08-07 01:07:12 +00:00
cyy
b01402b0a4 [7/N] Fix clang-tidy warnings in aten/src/ATen (#132727)
Follows  #132620
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132727
Approved by: https://github.com/Skylion007
2024-08-07 00:29:03 +00:00
178dc0c9c7 various doc fixes (#132803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132803
Approved by: https://github.com/Chillee, https://github.com/joydddd, https://github.com/BoyuanFeng
ghstack dependencies: #132799
2024-08-07 00:19:42 +00:00
cb4d1bfb71 Clean up some tflop calc and add option for saving (#132799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132799
Approved by: https://github.com/BoyuanFeng
2024-08-07 00:19:42 +00:00
cbee9c1fd2 Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit 0e7e61f7cec82a43f2de52b83eff152d703be7a3.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))
2024-08-07 00:05:20 +00:00
e98eac76b3 [inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#132766)
Summary: This is basically https://github.com/pytorch/pytorch/pull/131304 together with https://github.com/pytorch/pytorch/pull/132594 and absolute path fix for fbcode.

Test Plan: ci

Differential Revision: D60773405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132766
Approved by: https://github.com/xuhancn, https://github.com/chenyang78, https://github.com/desertfire
2024-08-06 23:56:34 +00:00
c7113a6186 Revert "[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)"
This reverts commit 1a23ef2ece1c667ee46cd34deb70df2b91bffa32.

Reverted https://github.com/pytorch/pytorch/pull/132709 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_device_mesh_initialization [GH job link](https://github.com/pytorch/pytorch/actions/runs/10274519791/job/28432469987) [HUD commit link](1a23ef2ece).  Test not run due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132709#issuecomment-2272350923))
2024-08-06 23:47:53 +00:00
0d6caeb259 Add logging + counter for missed reinplacing opportunities (#132758)
Summary:
- We add Inductor logs for what tensors we tried to reinplace, what
  tensors we were unable to reinplace, and of those tensors, which of
  those might be bugs (the "missed reinplacing opportunities"). You can
  tell this by reading the Inductor output graph but the logs make it
  easier to figure out.
- Add a dynamo_compile counter for missed reinplacing opportunities. The
  goal is to see how widespread existing problems (if any) are. We've had
  trouble getting all of the edge cases for the reinplacing pass; the
  counter will help us hunt down issues.

Test Plan:
- tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132758
Approved by: https://github.com/eellison
2024-08-06 23:44:24 +00:00
cd7f527c59 [3/3] 3D Composability - move tp dp tests (#129802)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
**DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py**
PP:
test/distributed/pipelining/test_composability.py

=>
**distributed/_composable/test_composability/test_2d_composability.py**
distributed/_composable/test_composability/test_pp_composability.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129802
Approved by: https://github.com/fduwjj
ghstack dependencies: #129801
2024-08-06 23:07:07 +00:00
179b572fd9 [2/3] 3D Composability - move pp tests (#129801)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
**PP:
test/distributed/pipelining/test_composability.py**

=>
distributed/_composable/test_composability/test_2d_composability.py
**distributed/_composable/test_composability/test_pp_composability.py**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801
Approved by: https://github.com/wconstab, https://github.com/atalman
2024-08-06 23:07:07 +00:00
825002c9c6 [export][fx] More robust DCE pass (#132764)
Summary:
- make default DCE pass check schema,
- need to rebase onto https://github.com/pytorch/pytorch/pull/131651 after it's in phabricator (for now the change is manually added).

- mark Proxy dump as NotImplemented for better error msg

- Remove Proxy from tensors when dumping models, as Proxy cannot be dumped.

More details in https://docs.google.com/document/d/1G5vmTXjzxoyVGRI2kpA1gQukK_Glyg2NrE0Oh6Nlg9A/edit?usp=sharing.

Test Plan:
CI
```
- buck2 run 'fbcode//mode/dev-nosan'  fbcode//caffe2/test/quantization:test_quantization -- -r  qat_conv2d
- test_export.py
- buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export
- buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et
- buck2 run 'fbcode//mode/dev-nosan'  fbcode//caffe2/test:fx -- -r dce
- buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False
- buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False
- buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r  test_fold_bn_erases_bn_node
```

Reviewed By: angelayi

Differential Revision: D60319175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132764
Approved by: https://github.com/angelayi
2024-08-06 22:27:22 +00:00
073cee531c [Test][Easy] Remove print in test_device_mesh.py (#132780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132780
Approved by: https://github.com/XilunWu
2024-08-06 22:04:39 +00:00
1a23ef2ece [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-06 22:00:09 +00:00
18b678082e [Easy] log output code path on cache hit (#132718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132718
Approved by: https://github.com/oulgen, https://github.com/masnesral
2024-08-06 21:59:30 +00:00
3c1033eeb0 Don't auto request review for reopened PRs (#132681)
This will clobber previous approves.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132681
Approved by: https://github.com/albanD, https://github.com/malfet
2024-08-06 21:36:18 +00:00
2073ddfd1c Actually report the HOP and subclass/mode when there isn't a registration (#132550)
Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132550
Approved by: https://github.com/ydwu4
2024-08-06 21:33:10 +00:00
623d0204f0 [NJT] Support Chunk backward for simple cases (#132193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132193
Approved by: https://github.com/soulitzer
2024-08-06 21:20:09 +00:00
2f908ffa4a [traced-graph][sparse] sparsity propagation for all current tests (#132690)
This PR makes sure all current tests in the sparsity export test suite pass. Note that there will probably be anecdotal cases that need fixing after this, but the general idea of preserving sparsity metadata has been completed.

Fixes: https://github.com/pytorch/pytorch/issues/117188

```
$ PYTORCH_TEST_WITH_DYNAMO=0 python test/export/test_sparse.py ........................................................................................................................................................
 ----------------------------------------------------------------------
Ran 152 tests
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132690
Approved by: https://github.com/ezyang
2024-08-06 21:18:13 +00:00
029f8fc701 Bump rexml from 3.2.8 to 3.3.3 in /ios/TestApp (#132469)
Bumps [rexml](https://github.com/ruby/rexml) from 3.2.8 to 3.3.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/ruby/rexml/releases">rexml's releases</a>.</em></p>
<blockquote>
<h2>REXML 3.3.3 - 2024-08-01</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Added support for detecting invalid XML that has unsupported
content before root element</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and
<code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull
parsers</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added more tests for invalid XMLs.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Added more performance tests.</p>
<ul>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li>
<li>Patch by tomoya ishida.</li>
</ul>
</li>
</ul>
<h3>Thanks</h3>
<ul>
<li>
<p>NAITOH Jun</p>
</li>
<li>
<p>Watson</p>
</li>
<li>
<p>tomoya ishida</p>
</li>
</ul>
<h2>REXML 3.3.2 - 2024-07-16</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/176">GH-176</a></li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/ruby/rexml/blob/master/NEWS.md">rexml's changelog</a>.</em></p>
<blockquote>
<h2>3.3.3 - 2024-08-01 {#version-3-3-3}</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Added support for detecting invalid XML that has unsupported
content before root element</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and
<code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull
parsers</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added more tests for invalid XMLs.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Added more performance tests.</p>
<ul>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li>
<li>Patch by tomoya ishida.</li>
</ul>
</li>
</ul>
<h3>Thanks</h3>
<ul>
<li>
<p>NAITOH Jun</p>
</li>
<li>
<p>Watson</p>
</li>
<li>
<p>tomoya ishida</p>
</li>
</ul>
<h2>3.3.2 - 2024-07-16 {#version-3-3-2}</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="e4a067e112"><code>e4a067e</code></a> Add 3.3.3 entry</li>
<li><a href="17ff3e7874"><code>17ff3e7</code></a> test: add a performance test for attribute list declaration</li>
<li><a href="be86b3de0a"><code>be86b3d</code></a> test: fix wrong test name</li>
<li><a href="b93d790b36"><code>b93d790</code></a> test: use double quote for string literal</li>
<li><a href="0fbe7d5a0e"><code>0fbe7d5</code></a> test: don't use abbreviated name</li>
<li><a href="1599e8785f"><code>1599e87</code></a> test: add a performance test for PI with many tabs</li>
<li><a href="e2546e6eca"><code>e2546e6</code></a> parse pi: improve invalid case detection</li>
<li><a href="73661ef281"><code>73661ef</code></a> test: fix a typo</li>
<li><a href="850488abf2"><code>850488a</code></a> test: use double quote for string literal</li>
<li><a href="46c6397d5c"><code>46c6397</code></a> test: add performance tests for entity declaration</li>
<li>Additional commits viewable in <a href="https://github.com/ruby/rexml/compare/v3.2.8...v3.3.3">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rexml&package-manager=bundler&previous-version=3.2.8&new-version=3.3.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132469
Approved by: https://github.com/ezyang
2024-08-06 21:17:24 +00:00
e47b684c33 Revert "Temp disable MKL in DistributionKernels.cpp (#132532)"
This reverts commit 7b2664ece6a961ce9e4557be913c2cead09c7390.

Reverted https://github.com/pytorch/pytorch/pull/132532 on behalf of https://github.com/PaliC due to causing numerical instability issues internally ([comment](https://github.com/pytorch/pytorch/pull/132532#issuecomment-2272136210))
2024-08-06 20:57:09 +00:00
94155ce31b [Torch] Support meta device in checkpoint (#132684)
Summary:
## Why
utils.checkpoint doesn't support meta device:

```
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 490, in checkpoint
    next(gen)
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 1359, in _checkpoint_without_reentrant_generator
    device_module = _get_device_module(device)
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 98, in _get_device_module
    device_module = getattr(torch, device)
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/__init__.py", line 1938, in __getattr__
    raise AttributeError(f"module '{__name__}' has no attribute '{name}'")
AttributeError: module 'torch' has no attribute 'meta'
```

This blocks us from running model with checkpoint enabled in meta mode.

## What
This diff handles the case of meta device in checkpoint.py.

(in checkpoint.py, device module is manily used when preserve_rng_state=true, which doesn't apply to meta case. So a more elgant fix might be set preserve_rng_state=false when detecting args are on meta device. But I didn't find where to do this check in the minimum way. Let me know if you have ideas.)

Test Plan: Tested with toy model which has checkpoint on its module: P1513716944

Differential Revision: D60749427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132684
Approved by: https://github.com/kit1980
2024-08-06 20:45:50 +00:00
de00c79583 [dynamo][inline_inbuilt_nn_modules] Mark nn module tensor static for cudagraphs (#132736)
Fixes https://github.com/pytorch/pytorch/issues/132714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132736
Approved by: https://github.com/mlazos
ghstack dependencies: #132538
2024-08-06 20:13:28 +00:00
1954bfacda [Inductor] Small performance, precision, and dependency updates to B2B-GEMM (#132354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132354
Approved by: https://github.com/masnesral
2024-08-06 20:01:27 +00:00
775c310c0c Preserve source_fn_stack in the training IR decomp (#132033)
Title

Differential Revision: [D60377712](https://our.internmc.facebook.com/intern/diff/D60377712/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132033
Approved by: https://github.com/angelayi
ghstack dependencies: #131988, #131995, #131999
2024-08-06 19:45:40 +00:00
4faa5804f6 [c10d] Used float tensor for PG NCCL barrier all-reduce (#132701)
This helps avoid a CUDA illegal memory access in the NCCL all-reduce part of `barrier()` when the CUDA caching allocator is disabled. NCCL all-reduce seems to assume reading at least 4 bytes. See https://github.com/pytorch/pytorch/issues/132640 for more context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132701
Approved by: https://github.com/wanchaol, https://github.com/fegin
2024-08-06 19:35:37 +00:00
1e65ccc3de [inductor] export kernel for gemm template. (#132580)
Changes:
1. Move `get_export_declaration` to global scope.
2. Export kernel for gemm template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132580
Approved by: https://github.com/ezyang
2024-08-06 18:52:22 +00:00
81a5a7a30a [Quantizer] Fix getattr for quantizing constants (#132705)
Mobilebert quantization was failing because there were embedding constants that could not be accessed through getattr().

It seems that we have to search the submodule for the embeddings. Which we do here. This is just to help get around looking at unlifted attrs to check if they are large scalars

Differential Revision: [D60492338](https://our.internmc.facebook.com/intern/diff/D60492338/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132705
Approved by: https://github.com/jerryzh168
ghstack dependencies: #132704
2024-08-06 18:16:27 +00:00
c2bccfd431 [BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132675
Approved by: https://github.com/Skylion007, https://github.com/ydwu4, https://github.com/zou3519
ghstack dependencies: #132674
2024-08-06 18:13:22 +00:00
1de4ebc85d [Quantizer] Fix Maxpool2d share q params (#132704)
There seems to be a bug in the code for sharing q params for maxpool2d. This case occurs when output_node = maxpool_node. When this happens we overwrite the node's "quantization_annotation" metadata. This fix ensures that qparams are indeed shared across input and output

Differential Revision: [D60492341](https://our.internmc.facebook.com/intern/diff/D60492341/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132704
Approved by: https://github.com/jerryzh168
2024-08-06 18:13:16 +00:00
db0bd04151 [AOTI] Switch to use shim v2 for fbcode (#132750)
Summary: As title

Test Plan: CI

Reviewed By: hl475, ColinPeppler

Differential Revision: D57899065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132750
Approved by: https://github.com/angelayi
2024-08-06 17:57:32 +00:00
8d2c272e5a properly register conjugate/neg fallthroughs to prim ops (#132699)
A few aten ops (like `clone` and `copy_` get fallthrough registrations to the Conjugate/Negative keys. We haven't been giving the same treatment to their corresponding `prims` variants, which can cause infinite loops in some cases.

Fixes an infinite loop that showed up in tests from https://github.com/pytorch/pytorch/pull/132563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132699
Approved by: https://github.com/albanD
2024-08-06 17:57:04 +00:00
c6582f11cd Add get_optin_feature() to allow opt-in to amz2023 (#131792)
This extends the runner determinator to be able to opt-in to keywords
to provide additional options when determining which systems to run
jobs on. This enables us to support opt-in users to Amazon Linux 2023.

This change creates a generic get_optin_feature() which hopefully will
be useful to handle additional future features that we might want to
experiment with.

This change has kept backwards compatability with the existing issue
userlist format and adds support for the comma-separated list of users
in a backwards compatible way.

The user list has the following rules:

- Users are GitHub usernames with the @ prefix
- If the first line is a "*" then all users will use the new runners
- If the first line is a "!" then all users will use the old runners
- Each user is also a comma-separated list of features/experiments to enable
- A "#" prefix indicates the user is opted out of the new runners but is opting
  into features/experiments.

Example user list:

```
@User1
@User2,amz2023
#@UserOptOutOfNewRunner,amz2023
```

This closes pytorch/ci-infra#249.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131792
Approved by: https://github.com/jeanschmidt, https://github.com/ZainRizvi
2024-08-06 17:54:20 +00:00
e3394e5548 torch.autograd.graph.increment_version: accept List[Tensor], use in AOTDispatcher (#132652)
The regression from https://github.com/pytorch/pytorch/issues/132281 pinpoints e4ace1a396 as the cause. The main delta that commit introduces is that we now manually check `is_inference()` and call `increment_version()` (a pybind call) on every mutated input tensor to the graph.

This PR attempts to reduce overhead a bit by bundling up all of those checks into a single pybind call, by:

(1) updating `torch.autograd.graph.increment_version()` to accept a `Union[Tensor, List[Tensor]]`

(2) updating its semantics to no-op if you pass in a tensor with no version counter, instead of erroring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132652
Approved by: https://github.com/albanD
2024-08-06 17:46:48 +00:00
af67b8df6d [export] Fix exportdb test (#132678)
Summary:
FIx exportdb test  for tensor_setattr.

copy.deepcopy(deepcopy) can fail if tensor inputs have attribute (i.e. __dict__).

We remove it before deepcopy.

Before the fix, we have

```
inputs[0].__dict__
{'attr': FakeTensor(..., size=(3, 2))}
```

the test errors out with

```
======================================================================
ERROR: test_exportdb_supported_case_tensor_setattr (caffe2.test.export.test_serialize.TestDeserialize)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/testing/_internal/common_utils.py", line 529, in instantiated_test
    test(self, **param_kwargs)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 878, in test_exportdb_supported
    self.check_graph(model, case.example_args, _check_meta=_check_meta)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 548, in check_graph
    _check_graph(pre_dispatch=True)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 506, in _check_graph
    copy.deepcopy(inputs),
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 211, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 211, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 153, in deepcopy
    y = copier(memo)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/_tensor.py", line 206, in __deepcopy__
    new_tensor.__dict__ = deepcopy(self.__dict__, memo)
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 153, in deepcopy
    y = copier(memo)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/_tensor.py", line 108, in __deepcopy__
    or (type(self) is not Tensor and self.data_ptr() == 0)
RuntimeError: Cannot access data pointer of Tensor (e.g. FakeTensor, FunctionalTensor). If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html
```

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r  test_exportdb_supported_case_tensor_setattr
```

Differential Revision: D60610860

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132678
Approved by: https://github.com/zhxchen17
2024-08-06 17:45:10 +00:00
e6eee04875 dynamo: use equality guards instead of id guards for Placement/DeviceMesh (#124401)
After talking to @anijain2305, we probably can't land this since it won't work for C++ guards. But we should still be able to do better than ID_MATCH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124401
Approved by: https://github.com/anijain2305
2024-08-06 17:14:44 +00:00
f50621989b Construct NJT without graph breaks (#130292)
Combines contributions from https://github.com/pytorch/pytorch/pull/130505

Some context can be found in this large comment block:

a5b64d39fd/test/dynamo/test_subclasses.py (L1667-L1681)

Changes in this PR
- For each tensor fakified, check the nested int registry in eager, and eagerly symbolicize if that tensor has already been associated with nested int in eager.
- Adds a separate counter stored on FakeTensorMode as a fake analog to _tensor_id_counter (which keeps track of unique tensors). This counter is initialized to the global eager tensor id counter upon creation of the FakeTensorMode, and needs to be reset when the same FakeTensorMode is reused to trace again (in this PR, we piggyback on the epoch incrementing logic).
- (refactor) Today, we store FakeTensor -> symbolic nested int in the global registry. With this PR, symbolic nested int is stored directly on the FakeTensor. (Eager still caches nested int in the registry, though we should avoid this at some point.)

Basically unchanged, but worth noting:
- `__tensor_unflatten__` is still responsible for determining whether we should cache for now. The logic is somewhat simplified.
- to_copy is still using the trick of updating two different tensors in the registry to point to the same nested int. This is kind of broken, but we try to leave it as is, and plan a better fix with the UnionFind stack.

Differential Revision: [D60406772](https://our.internmc.facebook.com/intern/diff/D60406772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130292
Approved by: https://github.com/bdhirsh
ghstack dependencies: #131916, #131803
2024-08-06 17:03:39 +00:00
406b50835b Use FakeTensor cache for subclass inner tensors (#131803)
Rewrite of original PR in https://github.com/pytorch/pytorch/pull/130291

To answer review comments from https://github.com/pytorch/pytorch/pull/130291#pullrequestreview-2166671953:

> At a higher level, do we need this?

Today, this should not change the behavior of anything. But an invariant of "same tensor always corresponds to the same FakeTensor" is nice (from discussion with @bdhirsh).

> Why does this happen?

Today, both dynamo and meta_utils do some recursion when it comes to FakeTensors. So whenever we fakify a subclass, the process would roughly like:

```
wrap_to_fake (subclass)
   meta_utils (subclass)
      meta_utils (values) -> not cached because we use callback
      meta_utils(offsets) -> not cached because we use callback
  wrap_to_fake (values)
  wrap_to_fake (offsets) -> cached because we rely on top-level meta_utils
```

However, we know that:
- Caching only occurs at the top-level of meta_utils.
- The return value of the top-level wrap_to_fake is returned.

This means that after all of this:
- The fakified subclass holds inner FakeTensors that are NOT part of the cache
- values/offsets are Fakified a second time, and those instances are cached.

Differential Revision: [D60406773](https://our.internmc.facebook.com/intern/diff/D60406773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131803
Approved by: https://github.com/ezyang
ghstack dependencies: #131916
2024-08-06 17:03:39 +00:00
a94c441e48 Fix symbolic nested int printing (#131916)
Differential Revision: [D60406775](https://our.internmc.facebook.com/intern/diff/D60406775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131916
Approved by: https://github.com/Skylion007, https://github.com/jbschlosser
2024-08-06 17:03:39 +00:00
ffdf48e63b Consolidate SymDispatchMode into ProxyTensorMode (#132674)
Instead of having a separate context variable for SymDispatchMode, we
now simply delegate to the current active proxy tensor mode when we
need to trace a SymInt.  We maintain a separate `__sym_dispatch__` magic
method as the calling convention is different than `__torch_dispatch__`.

Consolidating the modes in this ways means that we can consistently
disable both of these modes in tandem simply by removing the mode
from the proxy mode infra slot.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-08-06 17:03:17 +00:00
7045bc5a77 [export] change error message for specializations (#132698)
https://github.com/pytorch/pytorch/pull/130775 recently killed forced specializations for export on complex guards, so the only way we now get a specialized value is if we're able to solve for it. For example, if we have guards `s0 * 2 = s1`, `s0 + 6 = s1`, we specialize `s0 = 6; s1 = 12`.

That might look like this:
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        return x.reshape([-1]) + y

dy = Dim("dy", min=6)
x, y = torch.randn(6, 2), torch.randn(12)
dynamic_shapes = {
    "x": (dy - 6, 2),
    "y": (dy,),
}
```

Our current error message is:
`{symbol} must be specialized to {value} because the guards generated for it are too complex`
This is now misleading, so we change it to:
`solving the guards generated for {symbol} resulted in a specialized value of {value}`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132698
Approved by: https://github.com/avikchaudhuri
2024-08-06 16:59:53 +00:00
ca7ce2fca1 [ts-migration][1/N]: Add prim::Loop for constant number of iterations and condition (#131418)
#### Description
This PR adds prim::Loop support for the simplest case where the number of iteration is constant and the loop termination condition is also a constant.

[PR by stages](https://docs.google.com/document/d/1q6OprW3HBHbYPwEyE_DikBn-uzmhnN284Cmen_CnlhI/edit?usp=sharing)

#### Test Plan
Add reprod example.
* `pytest test/export/test_converter.py -s -k test_ts2ep_with_loop`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131418
Approved by: https://github.com/angelayi
2024-08-06 16:51:08 +00:00
C
c803e35c4b Reduce number of guards introduced by check_cudnn_tensor_shapes when cudnn version is higher enough (#132384)
I found that when using TorchDynamo (torch.compile) with dynamic shape on H100, there are some extra guards added to check the sequence length of inputs of `scaled_dot_product_attention` to be divisible by 64. These guards cause unwanted recompilations when the input shape changes.

In fact these guards are not necessary if our CUDNN version is higher enough, So I change the order of those checks to use short-circuit rules to skip those checks and avoid unnecessary guards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132384
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-08-06 16:48:13 +00:00
fc7849b93f [pt2e][quant] Ensure BN node is erased after convert (#131651)
Summary: Previously, when folding BN into conv, we rely on DCE
to clean up the unused BN node from the graph. This works if
the model is already in eval mode, but fails if the model is
still in train mode because DCE doesn't remove nodes with
potential side effects (in this case `_native_batch_norm_legit`).
This required users to move the model to eval mode before calling
convert in order to get a properly DCE'd graph.

To solve this, we manually erase the BN node after folding
instead of relying on DCE. This relaxes the ordering constraints
between `move_exported_model_to_eval` and `convert_pt2e`.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_fold_bn_erases_bn_node
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_fold_bn_erases_bn_node

Reviewers: jerryzh168, yushangdi

Subscribers: jerryzh168, yushangdi, supriyar

Differential Revision: [D60520149](https://our.internmc.facebook.com/intern/diff/D60520149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131651
Approved by: https://github.com/yushangdi, https://github.com/leslie-fang-intel
2024-08-06 16:37:39 +00:00
679cdf606a Converted __all__ literal tuple to literal list. (#132404)
Partial Fix for #131765.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132404
Approved by: https://github.com/soulitzer
2024-08-06 15:12:32 +00:00
6753ee127c Allow torch.cuda.memory.mem_get_info to take a device str argument with an unspecified device index. (#132616)
`torch.cuda.memory.mem_get_info` allows device strings given the current type hints. However, `device = torch.device('cuda')` leads to `device.index = None`, which results in downstream problems. Setting `optional=True` will insert the default device index in such cases.

Fixes #132583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132616
Approved by: https://github.com/soulitzer
2024-08-06 13:19:46 +00:00
7100c36c8a Revert "[inductor] export kernel for gemm template. (#132580)"
This reverts commit 87d46d70d7754e32eb0e6689688f4336e4e7c955.

Reverted https://github.com/pytorch/pytorch/pull/132580 on behalf of https://github.com/PaliC due to sys is not defined in torch/_inductor/codegen/cpp_utils.py ([comment](https://github.com/pytorch/pytorch/pull/132580#issuecomment-2271264974))
2024-08-06 13:15:15 +00:00
cyy
656a4d1408 [6/N] Fix clang-tidy warnings in aten/src/ATen (#132620)
Follows #132565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132620
Approved by: https://github.com/Skylion007
2024-08-06 13:07:16 +00:00
a8f0979962 Add cudagraph static inputs logging (#132726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132726
Approved by: https://github.com/anijain2305
2024-08-06 12:01:20 +00:00
da320214e6 Format tensor (#127992)
Align tensor display
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127992
Approved by: https://github.com/janeyx99
2024-08-06 07:10:16 +00:00
728374d7f7 Changed create_block_mask to just accept BLOCK_SIZE (#132697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132697
Approved by: https://github.com/drisspg
2024-08-06 04:37:15 +00:00
91df66ee74 [caffe2] Wrap constexpr with preprocessor statements (#132582)
Summary: When the preprocessor check we leave an unused constexpr around, so when `-Wunused-const-variable` is enabled we get an error. Let's inline these values since they're not used anywhere else in order to avoid this.

Test Plan: CI

Differential Revision: D60723823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132582
Approved by: https://github.com/houseroad
2024-08-06 04:35:06 +00:00
4260f365ba [inductor] Replace torch.allclose with torch.testing.assert_close in test_fx_fusion (#130618)
Preventative fix of a test failure with oneDNN v3.5 upgrade where order of float32 arithmetic may change in torch.admm ( bias term can be at the start or end of the arithmetic ) resulting in slightly different output due to float32 precision loss.

Replaced occurrences of torch.allclose with ~~torch._dynamo.testing.same~~  torch.testing.assert_close which is the recommended approach as per this issue https://github.com/pytorch/pytorch/issues/56544 ,the default tolerance is more relaxed than torch.allclose which satisfies the test with upcoming oneDNN change.

This should fix aarch64 ci failures in #129932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130618
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-08-06 03:58:43 +00:00
4e610924d4 [c10d] Add a new API for adding ephemeral timeout for one local rank and the timeout will reset when the first collective finishes (#130905)
We provide an API for user to add ephemeral timeout across all PGs within one rank and the timeout will reset when the first collective issued after the timeout added finishes.

Each extension only covers collectives after the issue and before the first collective finished. The diagram below shows how the timeout changes:

<img width="1174" alt="image" src="https://github.com/user-attachments/assets/354923b7-581c-40de-ae0f-1cd3da273ccc">

While this feature provides flexibility in specific scenarios, it introduces statefulness to timeout setting. Therefore, it is advisable to use this API sparingly and consider alternative approaches, such as directly setting the timeout or utilizing a barrier collective (one can set any timeout to the barrier), whenever feasible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130905
Approved by: https://github.com/ezyang
2024-08-06 03:47:58 +00:00
39c9b75a68 Add registration mechanism for aoti model runner (#131638)
Current AOTI model runner has supported CUDA and CPU. However, in terms of a particular out-of-tree backend, it is not easier to support the feature.

This PR intends to provide a registration mechanism to support this case by providing two: `RegisterAOTIModelRunner` and `getAOTIModelRunnerRegistry`.

- `RegisterAOTIModelRunner` is used to register a function(`AOTIModelRunnerABC`) to create a `AOTIModelContainerRunner`. The function signature is as follows.

    ```C++
    using AOTIModelRunnerABC = std::shared_ptr<AOTIModelContainerRunner> (*)(
        const std::string& model_so_path,
        size_t num_models,
        const std::string& device_str,
        const std::string& bin_dir);
    ```
- `getAOTIModelRunnerRegistry` is used to get all the registered backends.

In terms of a new backend, it needs to define its `AOTIModelContainerRunner` class and then register a `AOTIModelRunnerABC` function to `aoti` to create its `AOTIModelContainerRunner`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131638
Approved by: https://github.com/desertfire, https://github.com/jansel
2024-08-06 02:47:35 +00:00
345bea01dc Refactor thunkify to return proper thunk abstraction (#132407)
This is superior to lru_cache because (1) it's more explicit and (2) it
doesn't leak the original function after it's been forced.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407
Approved by: https://github.com/albanD
2024-08-06 02:35:45 +00:00
93fad2f0f2 [export] Fix import in D60427208 (#132707)
Summary:
D60427208 broke APS release by failing our NE  deterministric test. https://www.internalfb.com/intern/test/562950111197340/

This Diff fixes it.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text test_mtml_instagram_model_474023725_single_gpu_with_ir
```

Differential Revision: D60790203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132707
Approved by: https://github.com/ydwu4
2024-08-06 02:35:17 +00:00
2f16e68cab [Intel GPU] Allow XPU device in copy, cdist, index_put_impl (#130088)
# Motivation
`copy`, `cdist`, `index_put_impl` operators use `op_stub` for runtime dispatching inside operators.  Extra device list is inside them to assure the accuracy, while XPU is not in them. This PRs make them allow XPU as a supported device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130088
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #130019, #130082
2024-08-06 01:55:50 +00:00
38674bcb45 Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749)"
This reverts commit eca0cb0fbe84bb0a34fa94afe261bceecd52c436.

Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/izaitsevfb due to breaks test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function_tensor_subclass ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2270213988))
2024-08-06 01:55:41 +00:00
d6a24b3b92 Removed duplicate __all__ declarations. (#132405)
Partial Fix for #131765.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132405
Approved by: https://github.com/soulitzer
2024-08-06 01:17:44 +00:00
96471ea47c [inductor] support vectorization for torch.any(bool) -> bool (#132472)
Support reduction `any` by from `bool` to `bool`.
TestPlan:
```
python test/inductor/test_cpu_repro.py -k test_any_bool_vec
```

Generated code for `test_any_bool_vec`
```
cpp_fused_any_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'bool*', 'bool*'], '''
#include "/tmp/torchinductor_root/ky/cky2bufythacofebk7ujv36e4pxyqcqbpsy5r4vojoprjiwcwfxf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       bool* out_ptr0,
                       bool* out_ptr1)
{
    {
        {
            bool tmp_acc0 = 0;
            at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(0);
            bool tmp_acc0_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_arr[tid] = 0;
            }
            at::vec::VecMask<float,1> tmp_acc0_vec_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_vec_arr[tid] = at::vec::VecMask<float,1>::from(0);
            }
            #pragma omp parallel num_threads(64)
            {
                int tid = omp_get_thread_num();
                bool tmp_acc0_local = 0;
                at::vec::VecMask<float,1> tmp_acc0_vec_local = at::vec::VecMask<float,1>::from(0);
                #pragma omp for
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0), 16);
                    auto tmp1 = at::vec::VecMask<float,1>::from<float,1>(tmp0);
                    tmp_acc0_vec_local = tmp_acc0_vec_local | tmp1;
                }
                tmp_acc0_arr[tid] = tmp_acc0_local;
                tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local;
            }
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0 = tmp_acc0 || tmp_acc0_arr[tid];
            }
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_vec = tmp_acc0_vec | tmp_acc0_vec_arr[tid];
            }
            tmp_acc0 = tmp_acc0 || at::vec::vec_reduce_all<bool>([](at::vec::Vectorized<bool>& x, at::vec::Vectorized<bool>& y) { return x | y; }, tmp_acc0_vec.to<bool, 1>());
            out_ptr0[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0);
        }
    }
    {
        {
            bool tmp_acc0 = 0;
            at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(0);
            bool tmp_acc0_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_arr[tid] = 0;
            }
            at::vec::VecMask<float,1> tmp_acc0_vec_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_vec_arr[tid] = at::vec::VecMask<float,1>::from(0);
            }
            #pragma omp parallel num_threads(64)
            {
                int tid = omp_get_thread_num();
                bool tmp_acc0_local = 0;
                at::vec::VecMask<float,1> tmp_acc0_vec_local = at::vec::VecMask<float,1>::from(0);
                #pragma omp for
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0), 16);
                    auto tmp1 = at::vec::VecMask<float,1>::from<float,1>(tmp0);
                    tmp_acc0_vec_local = tmp_acc0_vec_local | tmp1;
                }
                tmp_acc0_arr[tid] = tmp_acc0_local;
                tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local;
            }
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0 = tmp_acc0 || tmp_acc0_arr[tid];
            }
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_vec = tmp_acc0_vec | tmp_acc0_vec_arr[tid];
            }
            tmp_acc0 = tmp_acc0 || at::vec::vec_reduce_all<bool>([](at::vec::Vectorized<bool>& x, at::vec::Vectorized<bool>& y) { return x | y; }, tmp_acc0_vec.to<bool, 1>());
            out_ptr1[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0);
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132472
Approved by: https://github.com/jgong5
2024-08-06 01:03:51 +00:00
26c6786109 return_and_correct_aliasing: skip dispatcher when swapping storage (#132524)
`return_and_correct_aliasing` is used by FunctionalTensor today to ensure that when we call view/inplace ops, the input and output `FunctionalTensors` share the same storage.

This was previously done with a dispatcher call to `aten.set_`. In this PR I swap it out with a util that just manually does the storage swap. Benefits:

(1) we know this is safe in the specific way it is used by FunctionalTensor: avoiding the extra assertions in `aten.set_` is necessary to avoid some unbacked symint errors

(2) this should improve compile times a bit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132524
Approved by: https://github.com/ezyang
ghstack dependencies: #132243, #132337, #132322
2024-08-06 00:44:35 +00:00
2702 changed files with 124437 additions and 64078 deletions

View File

@ -1,5 +1,5 @@
0.6b
0.7b
manylinux_2_17
rocm6.1
7f07e8a1cb1f99627eb6d77f5c0e9295c775f3c7
77c29fa3f3b614e187d7213d745e989a92708cee2bc6020419ab49019af399d1
rocm6.2
9be04068c3c0857a4cfd17d7e39e71d0423ebac2
3e9e1959d23b93d78a08fcc5f868125dc3854dece32fd9458be9ef4467982291

View File

@ -92,7 +92,7 @@ _UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b
# from scratch
case "$image" in
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.0
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
@ -120,7 +120,7 @@ case "$image" in
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.0
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
@ -165,7 +165,7 @@ case "$image" in
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.0
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
@ -194,7 +194,7 @@ case "$image" in
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.0
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
@ -222,7 +222,7 @@ case "$image" in
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.0
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
@ -236,7 +236,7 @@ case "$image" in
TRITON=yes
;;
pytorch-linux-focal-py3-clang10-onnx)
ANACONDA_PYTHON_VERSION=3.8
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
PROTOBUF=yes
DB=yes
@ -245,7 +245,7 @@ case "$image" in
ONNX=yes
;;
pytorch-linux-focal-py3-clang9-android-ndk-r21e)
ANACONDA_PYTHON_VERSION=3.8
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=9
LLVMDEV=yes
PROTOBUF=yes
@ -254,8 +254,8 @@ case "$image" in
GRADLE_VERSION=6.8.3
NINJA_VERSION=1.9.0
;;
pytorch-linux-focal-py3.8-clang10)
ANACONDA_PYTHON_VERSION=3.8
pytorch-linux-focal-py3.9-clang10)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
PROTOBUF=yes
DB=yes
@ -276,8 +276,8 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3.8-gcc9)
ANACONDA_PYTHON_VERSION=3.8
pytorch-linux-focal-py3.9-gcc9)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=9
PROTOBUF=yes
DB=yes
@ -286,18 +286,7 @@ case "$image" in
TRITON=yes
;;
pytorch-linux-focal-rocm-n-1-py3)
ANACONDA_PYTHON_VERSION=3.8
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=6.0
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.8
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
@ -307,8 +296,19 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-rocm-n-py3)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
ROCM_VERSION=6.2
NINJA_VERSION=1.9.0
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-xpu-2024.0-py3)
ANACONDA_PYTHON_VERSION=3.8
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
@ -318,8 +318,8 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.8
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
@ -330,8 +330,8 @@ case "$image" in
DOCS=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12)
ANACONDA_PYTHON_VERSION=3.8
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CUDNN_VERSION=9
CLANG_VERSION=12
@ -355,8 +355,8 @@ case "$image" in
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3.8-gcc11)
ANACONDA_PYTHON_VERSION=3.8
pytorch-linux-jammy-py3.9-gcc11)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
@ -379,6 +379,7 @@ case "$image" in
GCC_VERSION=11
CONDA_CMAKE=yes
HALIDE=yes
TRITON=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.

View File

@ -108,10 +108,10 @@ ENV CMAKE_C_COMPILER cc
ENV CMAKE_CXX_COMPILER c++
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt
COPY ci_commit_pins/triton.txt triton.txt
COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# Install AOTriton (Early fail)
COPY ./aotriton_version.txt aotriton_version.txt

View File

@ -1 +1 @@
91298923a0076c1b41059efb6dad2876426e4b03
cd1c833b079adb324871dcbbe75b43d42ffc0ade

View File

@ -1 +1 @@
340136fec6d3ebc73e7a19eba1663e9b0ba8ab2d
461c12871f336fe6f57b55d6a297f13ef209161b

View File

@ -1 +1 @@
730b907b4d45a4713cbc425cbf224c46089fd514
ac3470188b914c5d7a5058a7e28b9eb685a62427

View File

@ -1 +0,0 @@
21eae954efa5bf584da70324b640288c3ee7aede

View File

@ -1 +1 @@
1b2f15840e0d70eec50d84c7a0575cb835524def
91b14bf5593cf58a8541f3e6b9125600a867d4ef

View File

@ -1 +1 @@
dedb7bdf339a3546896d4820366ca562c586bfa0
5fe38ffd73c2ac6ed6323b554205186696631c6f

View File

@ -1,5 +0,0 @@
0.6b
manylinux_2_17
rocm6.1
04b5df8c8123f90cba3ede7e971e6fbc6040d506
77c29fa3f3b614e187d7213d745e989a92708cee2bc6020419ab49019af399d1

View File

@ -4,12 +4,12 @@ set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
TARBALL='aotriton.tar.bz2'
TARBALL='aotriton.tar.gz'
# This read command alwasy returns with exit code 1
read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true
ARCH=$(uname -m)
AOTRITON_INSTALL_PREFIX="$1"
AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.bz2"
AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"
cd "${AOTRITON_INSTALL_PREFIX}"
# Must use -L to follow redirects

View File

@ -5,32 +5,22 @@ set -ex
# Optionally install conda
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.anaconda.com/miniconda"
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
fi
MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)
MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)
if [[ $(uname -m) == "aarch64" ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"
case "$MAJOR_PYTHON_VERSION" in
3)
CONDA_FILE="Miniforge3-Linux-aarch64.sh"
;;
3);;
*)
echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"
exit 1
;;
esac
else
case "$MAJOR_PYTHON_VERSION" in
3)
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
;;
*)
echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"
exit 1
;;
esac
fi
mkdir -p /opt/conda
chown jenkins:jenkins /opt/conda
@ -78,19 +68,20 @@ fi
CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"
if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
conda_install numpy=1.24.4 ${CONDA_COMMON_DEPS}
NUMPY_VERSION=1.24.4
else
conda_install numpy=1.26.2 ${CONDA_COMMON_DEPS}
NUMPY_VERSION=1.26.2
fi
else
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then
conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}
NUMPY_VERSION=1.26.0
else
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
NUMPY_VERSION=1.21.2
fi
fi
conda_install ${CONDA_COMMON_DEPS}
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
# and libpython-static for torch deploy
@ -112,7 +103,7 @@ fi
# Install some other packages, including those needed for Python test reporting
pip_install -r /opt/conda/requirements-ci.txt
pip_install numpy=="$NUMPY_VERSION"
pip_install -U scikit-learn
if [ -n "$DOCS" ]; then

View File

@ -7,7 +7,7 @@ PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/hea
GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py
# Python versions to be installed in /opt/$VERSION_NO
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0"}
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}
function check_var {
if [ -z "$1" ]; then
@ -22,6 +22,13 @@ function do_cpython_build {
check_var $py_ver
check_var $py_folder
tar -xzf Python-$py_ver.tgz
local additional_flags=""
if [ "$py_ver" == "3.13.0t" ]; then
additional_flags=" --disable-gil"
mv cpython-3.13/ cpython-3.13t/
fi
pushd $py_folder
local prefix="/opt/_internal/cpython-${py_ver}"
@ -37,8 +44,10 @@ function do_cpython_build {
local openssl_flags="--with-openssl=${WITH_OPENSSL} --with-openssl-rpath=auto"
fi
# -Wformat added for https://bugs.python.org/issue17547 on Python 2.6
CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} > /dev/null
CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} ${additional_flags} > /dev/null
make -j40 > /dev/null
make install > /dev/null
@ -58,7 +67,8 @@ function do_cpython_build {
if [ -e ${prefix}/bin/pip3 ] && [ ! -e ${prefix}/bin/pip ]; then
ln -s pip3 ${prefix}/bin/pip
fi
${prefix}/bin/pip install wheel==0.34.2
# install setuptools since python 3.12 is required to use distutils
${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2
local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")
ln -s ${prefix} /opt/python/${abi_tag}
}
@ -68,7 +78,14 @@ function build_cpython {
check_var $py_ver
check_var $PYTHON_DOWNLOAD_URL
local py_ver_folder=$py_ver
if [ "$py_ver" = "3.13.0" ]; then
if [ "$py_ver" = "3.13.0t" ]; then
PY_VER_SHORT="3.13"
PYT_VER_SHORT="3.13t"
check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH
wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz
do_cpython_build $py_ver cpython-$PYT_VER_SHORT
elif [ "$py_ver" = "3.13.0" ]; then
PY_VER_SHORT="3.13"
check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH
wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

View File

@ -27,6 +27,17 @@ function install_cusparselt_052 {
rm -rf tmp_cusparselt
}
function install_cusparselt_062 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_118 {
echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"
rm -rf /usr/local/cuda-11.8 /usr/local/cuda
@ -94,13 +105,13 @@ function install_121 {
}
function install_124 {
echo "Installing CUDA 12.4 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run
chmod +x cuda_12.4.0_550.54.14_linux.run
./cuda_12.4.0_550.54.14_linux.run --toolkit --silent
rm -f cuda_12.4.0_550.54.14_linux.run
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run
chmod +x cuda_12.4.1_550.54.15_linux.run
./cuda_12.4.1_550.54.15_linux.run --toolkit --silent
rm -f cuda_12.4.1_550.54.15_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
@ -121,7 +132,7 @@ function install_124 {
cd ..
rm -rf nccl
install_cusparselt_052
install_cusparselt_062
ldconfig
}

View File

@ -17,13 +17,13 @@ function install_cusparselt_052 {
}
function install_124 {
echo "Installing CUDA 12.4 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux_sbsa.run
chmod +x cuda_12.4.0_550.54.14_linux_sbsa.run
./cuda_12.4.0_550.54.14_linux_sbsa.run --toolkit --silent
rm -f cuda_12.4.0_550.54.14_linux_sbsa.run
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run
chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run
./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent
rm -f cuda_12.4.1_550.54.15_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

View File

@ -0,0 +1,25 @@
#!/bin/bash
set -ex
# cudss license: https://docs.nvidia.com/cuda/cudss/license.html
mkdir tmp_cudss && cd tmp_cudss
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[1-4]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUDSS_NAME="libcudss-linux-${arch_path}-0.3.0.9_cuda12-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudss/redist/libcudss/linux-${arch_path}/${CUDSS_NAME}.tar.xz
# only for cuda 12
tar xf ${CUDSS_NAME}.tar.xz
cp -a ${CUDSS_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUDSS_NAME}/lib/* /usr/local/cuda/lib64/
fi
cd ..
rm -rf tmp_cudss
ldconfig

View File

@ -5,7 +5,15 @@ set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[1-4]$ ]]; then
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-6]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

View File

@ -10,6 +10,21 @@ if [[ -z $ROCM_VERSION ]]; then
exit 1;
fi
IS_UBUNTU=0
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
IS_UBUNTU=1
;;
centos)
IS_UBUNTU=0
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
# To make version comparison easier, create an integer representation.
save_IFS="$IFS"
IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION})
@ -57,7 +72,12 @@ MIOPEN_CMAKE_COMMON_FLAGS="
-DMIOPEN_BUILD_DRIVER=OFF
"
# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version
if [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then
if [[ $ROCM_INT -ge 60300 ]]; then
echo "ROCm 6.3+ MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then
MIOPEN_BRANCH="release/rocm-rel-6.2-staging"
elif [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then
echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 60000 ]] && [[ $ROCM_INT -lt 60100 ]]; then
@ -90,12 +110,21 @@ else
exit 1
fi
yum remove -y miopen-hip
if [[ ${IS_UBUNTU} == 1 ]]; then
apt-get remove -y miopen-hip
else
yum remove -y miopen-hip
fi
git clone https://github.com/ROCm/MIOpen -b ${MIOPEN_BRANCH}
pushd MIOpen
# remove .git to save disk space since CI runner was running out
rm -rf .git
# Don't build CK to save docker build time
if [[ $ROCM_INT -ge 60200 ]]; then
sed -i '/composable_kernel/d' requirements.txt
fi
# Don't build MLIR to save docker build time
# since we are disabling MLIR backend for MIOpen anyway
if [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then
@ -108,10 +137,15 @@ cmake -P install_deps.cmake --minimum
# clean up since CI runner was running out of disk space
rm -rf /tmp/*
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
if [[ ${IS_UBUNTU} == 1 ]]; then
apt-get autoclean && apt-get clean
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
else
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
fi
## Build MIOpen
mkdir -p build
@ -128,7 +162,11 @@ make -j $(nproc) package
# clean up since CI runner was running out of disk space
rm -rf /usr/local/cget
yum install -y miopen-*.rpm
if [[ ${IS_UBUNTU} == 1 ]]; then
sudo dpkg -i miopen-hip*.deb
else
yum install -y miopen-*.rpm
fi
popd
rm -rf MIOpen

View File

@ -0,0 +1,20 @@
#!/bin/bash
set -ex
function install_nvpl {
mkdir -p /opt/nvpl/lib /opt/nvpl/include
wget https://developer.download.nvidia.com/compute/nvpl/redist/nvpl_blas/linux-sbsa/nvpl_blas-linux-sbsa-0.3.0-archive.tar.xz
tar xf nvpl_blas-linux-sbsa-0.3.0-archive.tar.xz
cp -r nvpl_blas-linux-sbsa-0.3.0-archive/lib/* /opt/nvpl/lib/
cp -r nvpl_blas-linux-sbsa-0.3.0-archive/include/* /opt/nvpl/include/
wget https://developer.download.nvidia.com/compute/nvpl/redist/nvpl_lapack/linux-sbsa/nvpl_lapack-linux-sbsa-0.2.3.1-archive.tar.xz
tar xf nvpl_lapack-linux-sbsa-0.2.3.1-archive.tar.xz
cp -r nvpl_lapack-linux-sbsa-0.2.3.1-archive/lib/* /opt/nvpl/lib/
cp -r nvpl_lapack-linux-sbsa-0.2.3.1-archive/include/* /opt/nvpl/include/
}
install_nvpl

View File

@ -15,7 +15,7 @@ pip_install \
flatbuffers==2.0 \
mock==5.0.1 \
ninja==1.10.2 \
networkx==2.0 \
networkx==2.5 \
numpy==1.24.2
# ONNXRuntime should be installed before installing
@ -30,10 +30,9 @@ pip_install \
pip_install coloredlogs packaging
pip_install onnxruntime==1.18
pip_install onnx==1.16.0
# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps
pip_install onnxscript==0.1.0.dev20240613 --no-deps
pip_install onnxruntime==1.18.1
pip_install onnx==1.16.2
pip_install onnxscript==0.1.0.dev20240831 --no-deps
# required by onnxscript
pip_install ml_dtypes

View File

@ -12,10 +12,7 @@ conda_reinstall() {
as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*
}
if [ -n "${ROCM_VERSION}" ]; then
TRITON_REPO="https://github.com/openai/triton"
TRITON_TEXT_FILE="triton-rocm"
elif [ -n "${XPU_VERSION}" ]; then
if [ -n "${XPU_VERSION}" ]; then
TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"
TRITON_TEXT_FILE="triton-xpu"
else
@ -41,19 +38,33 @@ if [ -z "${MAX_JOBS}" ]; then
export MAX_JOBS=$(nproc)
fi
# Git checkout triton
mkdir /var/lib/jenkins/triton
chown -R jenkins /var/lib/jenkins/triton
chgrp -R jenkins /var/lib/jenkins/triton
pushd /var/lib/jenkins/
as_jenkins git clone ${TRITON_REPO} triton
cd triton
as_jenkins git checkout ${TRITON_PINNED_COMMIT}
cd python
# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527
as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py
if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}" == "7" ]]; then
# Triton needs at least gcc-9 to build
apt-get install -y g++-9
CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
CXX=g++-9 pip_install -e .
elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then
# Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain
add-apt-repository -y ppa:ubuntu-toolchain-r/test
apt-get install -y g++-9
CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
CXX=g++-9 pip_install -e .
else
pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
pip_install -e .
fi
if [ -n "${CONDA_CMAKE}" ]; then

View File

@ -16,11 +16,11 @@ function install_ubuntu() {
apt-get update -y
apt-get install -y gpg-agent wget
# To add the online network package repository for the GPU Driver LTS releases
# To add the online network package repository for the GPU Driver
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \
| gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \
https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}/lts/2350 unified" \
https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}${XPU_DRIVER_VERSION} unified" \
| tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list
# To add the online network network package repository for the Intel Support Packages
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
@ -45,9 +45,9 @@ function install_ubuntu() {
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel Support Packages
if [ -n "$XPU_VERSION" ]; then
apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION}
apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev
else
apt-get install -y intel-for-pytorch-gpu-dev
apt-get install -y intel-for-pytorch-gpu-dev intel-pti-dev
fi
# Cleanup
@ -55,52 +55,6 @@ function install_ubuntu() {
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
}
function install_centos() {
dnf install -y 'dnf-command(config-manager)'
dnf config-manager --add-repo \
https://repositories.intel.com/gpu/rhel/8.6/production/2328/unified/intel-gpu-8.6.repo
# To add the EPEL repository needed for DKMS
dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
# https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
# Create the YUM repository file in the /temp directory as a normal user
tee > /tmp/oneAPI.repo << EOF
[oneAPI]
name=Intel® oneAPI repository
baseurl=https://yum.repos.intel.com/oneapi
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
EOF
# Move the newly created oneAPI.repo file to the YUM configuration directory /etc/yum.repos.d
mv /tmp/oneAPI.repo /etc/yum.repos.d
# The xpu-smi packages
dnf install -y flex bison xpu-smi
# Compute and Media Runtimes
dnf install -y \
intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\
level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \
mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \
mesa-libxatracker libvpl-tools intel-metrics-discovery \
intel-metrics-library intel-igc-core intel-igc-cm \
libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc hwinfo clinfo
# Development packages
dnf install -y --refresh \
intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \
level-zero-devel
# Install Intel® oneAPI Base Toolkit
dnf install intel-basekit -y
# Cleanup
dnf clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
}
function install_rhel() {
. /etc/os-release
if [[ "${ID}" == "rhel" ]]; then
@ -114,9 +68,9 @@ function install_rhel() {
fi
dnf install -y 'dnf-command(config-manager)'
# To add the online network package repository for the GPU Driver LTS releases
# To add the online network package repository for the GPU Driver
dnf config-manager --add-repo \
https://repositories.intel.com/gpu/rhel/${VERSION_ID}/lts/2350/unified/intel-gpu-${VERSION_ID}.repo
https://repositories.intel.com/gpu/rhel/${VERSION_ID}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_ID}.repo
# To add the online network network package repository for the Intel Support Packages
tee > /etc/yum.repos.d/intel-for-pytorch-gpu-dev.repo << EOF
[intel-for-pytorch-gpu-dev]
@ -131,7 +85,7 @@ EOF
# The xpu-smi packages
dnf install -y xpu-smi
# Compute and Media Runtimes
dnf install -y \
dnf install --skip-broken -y \
intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\
level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \
mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \
@ -160,9 +114,9 @@ function install_sles() {
exit
fi
# To add the online network package repository for the GPU Driver LTS releases
# To add the online network package repository for the GPU Driver
zypper addrepo -f -r \
https://repositories.intel.com/gpu/sles/${VERSION_SP}/lts/2350/unified/intel-gpu-${VERSION_SP}.repo
https://repositories.intel.com/gpu/sles/${VERSION_SP}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_SP}.repo
rpm --import https://repositories.intel.com/gpu/intel-graphics.key
# To add the online network network package repository for the Intel Support Packages
zypper addrepo https://yum.repos.intel.com/intel-for-pytorch-gpu-dev intel-for-pytorch-gpu-dev
@ -181,6 +135,12 @@ function install_sles() {
}
# Default use GPU driver LTS releases
XPU_DRIVER_VERSION="/lts/2350"
if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
# Use GPU driver rolling releases
XPU_DRIVER_VERSION=""
fi
# The installation depends on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
@ -188,9 +148,6 @@ case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
;;
rhel|almalinux)
install_rhel
;;

View File

@ -37,6 +37,12 @@ esac
(
set -x
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
docker build \
--target final \
--progress plain \

View File

@ -89,7 +89,7 @@ RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/aotriton_version.txt aotriton_version.txt
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

View File

@ -10,6 +10,7 @@ ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=9
# Note: This is required patch since CentOS have reached EOL
# otherwise any yum install setp will fail
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
@ -196,7 +197,7 @@ RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/aotriton_version.txt aotriton_version.txt
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

View File

@ -145,9 +145,13 @@ ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
FROM cpu_final as xpu_final
# XPU CD use rolling driver
ENV XPU_DRIVER_TYPE ROLLING
# cmake-3.28.4 from pip
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
# Install setuptools and wheel for python 3.13
RUN /opt/python/cp313-cp313/bin/python -m pip install setuptools wheel
ADD ./common/install_xpu.sh install_xpu.sh
RUN bash ./install_xpu.sh && rm install_xpu.sh
RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

View File

@ -75,17 +75,17 @@ ARG BASE_CUDA_VERSION
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as openblas
# Install openblas
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM base as nvpl
# Install nvpl
ADD ./common/install_nvpl.sh install_nvpl.sh
RUN bash ./install_nvpl.sh && rm install_nvpl.sh
FROM final as cuda_final
ARG BASE_CUDA_VERSION
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
COPY --from=nvpl /opt/nvpl/lib/ /usr/local/lib/
COPY --from=nvpl /opt/nvpl/include/ /usr/local/include/
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

View File

@ -124,7 +124,14 @@ if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then
fi
(
set -x
DOCKER_BUILDKIT=1 docker build \
# TODO: Remove LimitNOFILE=1048576 patch once https://github.com/pytorch/test-infra/issues/5712
# is resolved. This patch is required in order to fix timing out of Docker build on Amazon Linux 2023.
sudo sed -i s/LimitNOFILE=infinity/LimitNOFILE=1048576/ /usr/lib/systemd/system/docker.service
sudo systemctl daemon-reload
sudo systemctl restart docker
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--target "${TARGET}" \

View File

@ -30,9 +30,14 @@ dill==0.3.7
#Pinned versions: 0.3.7
#test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
expecttest==0.1.6
expecttest==0.2.1
#Description: method for writing tests where test framework auto populates
# the expected output based on previous runs
#Pinned versions: 0.2.1
#test that import:
fbscribelogger==0.1.6
#Description: write to scribe from authenticated jobs on CI
#Pinned versions: 0.1.6
#test that import:
@ -85,7 +90,7 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.10.0
mypy==1.11.2
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.10.0
@ -104,7 +109,7 @@ networkx==2.8.8
#test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
numba==0.49.0 ; python_version < "3.9"
numba==0.54.1 ; python_version == "3.9"
numba==0.55.2 ; python_version == "3.9"
numba==0.55.2 ; python_version == "3.10"
#Description: Just-In-Time Compiler for Numerical Functions
#Pinned versions: 0.54.1, 0.49.0, <=0.49.1
@ -218,7 +223,7 @@ pygments==2.15.0
#test that import:
scikit-image==0.19.3 ; python_version < "3.10"
scikit-image==0.20.0 ; python_version >= "3.10"
scikit-image==0.22.0 ; python_version >= "3.10"
#Description: image processing routines
#Pinned versions:
#test that import: test_nn.py
@ -269,6 +274,10 @@ lintrunner==0.12.5
#Pinned versions: 0.12.5
#test that import:
redis>=4.0.0
#Description: redis database
#test that import: anything that tests OSS caching/mocking (inductor/test_codecache.py, inductor/test_max_autotune.py)
rockset==1.0.3
#Description: queries Rockset
#Pinned versions: 1.0.3
@ -312,3 +321,24 @@ lxml==5.0.0
# Python-3.9 binaries
PyGithub==2.3.0
sympy==1.12.1 ; python_version == "3.8"
sympy==1.13.1 ; python_version >= "3.9"
#Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
#Pinned versions:
#test that import:
onnx==1.16.1
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
onnxscript==0.1.0.dev20240817
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
parameterized==0.8.1
#Description: Parameterizes unittests, both the tests themselves and the entire testing class
#Pinned versions:
#test that import:

View File

@ -1 +1 @@
3.0.0
3.1.0

View File

@ -156,6 +156,12 @@ COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash install_cusparselt.sh
RUN rm install_cusparselt.sh
# Install CUDSS
ARG CUDA_VERSION
COPY ./common/install_cudss.sh install_cudss.sh
RUN bash install_cudss.sh
RUN rm install_cudss.sh
# Delete /usr/local/cuda-11.X/cuda-11.X symlinks
RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi
RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi

View File

@ -68,6 +68,8 @@ RUN rm install_rocm.sh
COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh
RUN rm install_rocm_magma.sh
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
ENV ROCM_PATH /opt/rocm
ENV PATH /opt/rocm/bin:$PATH
ENV PATH /opt/rocm/hcc/bin:$PATH
@ -100,10 +102,10 @@ ARG TRITON
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt
COPY ci_commit_pins/triton.txt triton.txt
COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# Install AOTriton
COPY ./aotriton_version.txt aotriton_version.txt
@ -121,5 +123,8 @@ RUN bash ./install_cache.sh && rm install_cache.sh
ARG BUILD_ENVIRONMENT
ENV BUILD_ENVIRONMENT ${BUILD_ENVIRONMENT}
# Install LLVM dev version (Defined in the pytorch/builder github repository)
COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
USER jenkins
CMD ["bash"]

View File

@ -30,6 +30,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ARG DOCS
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
ENV DOCS=$DOCS

View File

@ -49,13 +49,8 @@ if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
fi
# Enable LLVM dependency for TensorExpr testing
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
export USE_LLVM=/opt/rocm/llvm
export LLVM_DIR=/opt/rocm/llvm/lib/cmake/llvm
else
export USE_LLVM=/opt/llvm
export LLVM_DIR=/opt/llvm/lib/cmake/llvm
fi
export USE_LLVM=/opt/llvm
export LLVM_DIR=/opt/llvm/lib/cmake/llvm
if [[ "$BUILD_ENVIRONMENT" == *executorch* ]]; then
# To build test_edge_op_registration
@ -176,7 +171,8 @@ fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# shellcheck disable=SC1091
source /opt/intel/oneapi/compiler/latest/env/vars.sh
export USE_XPU=1
# XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA
export USE_KINETO=0
fi
# sccache will fail for CUDA builds if all cores are used for compiling
@ -236,7 +232,7 @@ fi
# Do not change workspace permissions for ROCm CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *rocm* && "$BUILD_ENVIRONMENT" != *s390x* ]]; then
# Workaround for dind-rootless userid mapping (https://github.com/pytorch/ci-infra/issues/96)
WORKSPACE_ORIGINAL_OWNER_ID=$(stat -c '%u' "/var/lib/jenkins/workspace")
cleanup_workspace() {
@ -282,11 +278,11 @@ else
# set only when building other architectures
# or building non-XLA tests.
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *s390x* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then
# Install numpy-2.0 release candidate for builds
# Which should be backward compatible with Numpy-1.X
python -mpip install --pre numpy==2.0.0rc1
# Install numpy-2.0.2 for builds which are backward compatible with 1.X
python -mpip install --pre numpy==2.0.2
fi
WERROR=1 python setup.py clean
@ -345,11 +341,11 @@ else
CUSTOM_OP_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/custom-op-build"
CUSTOM_OP_TEST="$PWD/test/custom_operator"
python --version
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"
mkdir -p "$CUSTOM_OP_BUILD"
pushd "$CUSTOM_OP_BUILD"
cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -359,10 +355,10 @@ else
JIT_HOOK_BUILD="${CUSTOM_TEST_ARTIFACT_BUILD_DIR}/jit-hook-build"
JIT_HOOK_TEST="$PWD/test/jit_hooks"
python --version
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
SITE_PACKAGES="$(python -c 'import site; print(";".join([x for x in site.getsitepackages()] + [x + "/torch" for x in site.getsitepackages()]))')"
mkdir -p "$JIT_HOOK_BUILD"
pushd "$JIT_HOOK_BUILD"
cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -374,7 +370,7 @@ else
python --version
mkdir -p "$CUSTOM_BACKEND_BUILD"
pushd "$CUSTOM_BACKEND_BUILD"
cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -407,6 +403,6 @@ fi
# snadampal: skipping it till sccache support added for aarch64
# https://github.com/pytorch/pytorch/issues/121559
if [[ "$BUILD_ENVIRONMENT" != *aarch64* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *aarch64* && "$BUILD_ENVIRONMENT" != *s390x* ]]; then
print_sccache_stats
fi

View File

@ -179,7 +179,7 @@ function install_torchvision() {
}
function install_tlparse() {
pip_install --user "tlparse==0.3.7"
pip_install --user "tlparse==0.3.25"
PATH="$(python -m site --user-base)/bin:$PATH"
}

View File

@ -1,4 +1,4 @@
from datetime import datetime, timedelta
from datetime import datetime, timedelta, timezone
from tempfile import mkdtemp
from cryptography import x509
@ -42,10 +42,10 @@ def create_cert(path, C, ST, L, O, key):
.issuer_name(issuer)
.public_key(key.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(datetime.utcnow())
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.utcnow()
datetime.now(timezone.utc)
+ timedelta(days=10)
)
.add_extension(
@ -88,10 +88,10 @@ def sign_certificate_request(path, csr_cert, ca_cert, private_ca_key):
.issuer_name(ca_cert.subject)
.public_key(csr_cert.public_key())
.serial_number(x509.random_serial_number())
.not_valid_before(datetime.utcnow())
.not_valid_before(datetime.now(timezone.utc))
.not_valid_after(
# Our certificate will be valid for 10 days
datetime.utcnow()
datetime.now(timezone.utc)
+ timedelta(days=10)
# Sign our certificate with our private key
)

View File

@ -9,15 +9,13 @@ if [[ -n "$CONDA_ENV" ]]; then
export PATH="$CONDA_ENV/bin":$PATH
fi
# Test that OpenMP is enabled for non-arm64 build
if [[ ${BUILD_ENVIRONMENT} != *arm64* ]]; then
pushd test
if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then
echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"
exit 1
fi
popd
# Test that OpenMP is enabled
pushd test
if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then
echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"
exit 1
fi
popd
setup_test_python() {
# The CircleCI worker hostname doesn't resolve to an address.
@ -27,8 +25,9 @@ setup_test_python() {
echo "Ninja version: $(ninja --version)"
echo "Python version: $(which python) ($(python --version))"
# Increase default limit on open file handles from 256 to 1024
ulimit -n 1024
# Set the limit on open file handles to 16384
# might help with intermittent compiler test failures
ulimit -n 16384
}
test_python_all() {

View File

@ -44,19 +44,15 @@ time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compi
time python test/run_test.py --verbose -i distributed/test_device_mesh
# DTensor/TP tests
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state
# FSDP2 tests
time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh
# Pipelining composability tests
time python test/run_test.py --verbose -i distributed/pipelining/test_composability.py
# ND composability tests
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx

View File

@ -6,6 +6,9 @@
set -ex
# Suppress ANSI color escape sequences
export TERM=vt100
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
@ -166,7 +169,7 @@ fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# Source Intel oneAPI envrioment script to enable xpu runtime related libraries
# refer to https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2024-0/use-the-setvars-and-oneapi-vars-scripts-with-linux.html
# refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html
# shellcheck disable=SC1091
source /opt/intel/oneapi/compiler/latest/env/vars.sh
# Check XPU status before testing
@ -316,7 +319,6 @@ test_inductor_distributed() {
python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose
python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose
python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_fsdp_2d_parallel.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose
@ -359,10 +361,12 @@ test_inductor_shard() {
test_inductor_aoti() {
# docker build uses bdist_wheel which does not work with test_aot_inductor
# TODO: need a faster way to build
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
# We need to hipify before building again
python3 tools/amd_build/build_amd.py
fi
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference
}
test_inductor_cpp_wrapper_abi_compatible() {
@ -371,9 +375,8 @@ test_inductor_cpp_wrapper_abi_compatible() {
mkdir -p "$TEST_REPORTS_DIR"
echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"
# cpu stack allocation causes segfault and needs more investigation
PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper
python test/run_test.py --include inductor/test_cuda_cpp_wrapper
python test/run_test.py --include inductor/test_cuda_cpp_wrapper inductor/test_cpu_repro
TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \
--training --inductor --disable-cudagraphs --only vit_base_patch16_224 \
@ -391,7 +394,22 @@ test_inductor_cpp_wrapper_abi_compatible() {
# .github/workflows/inductor-perf-test-nightly.yml
DYNAMO_BENCHMARK_FLAGS=()
if [[ "${TEST_CONFIG}" == *dynamo_eager* ]]; then
pr_time_benchmarks() {
pip_install --user "fbscribelogger"
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"
echo "benchmark results on current PR: "
cat "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv"
}
if [[ "${TEST_CONFIG}" == *pr_time_benchmarks* ]]; then
pr_time_benchmarks
exit 0
elif [[ "${TEST_CONFIG}" == *dynamo_eager* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--backend eager)
elif [[ "${TEST_CONFIG}" == *aot_eager* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--backend aot_eager)
@ -487,6 +505,12 @@ test_perf_for_dashboard() {
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then
if [[ "$target" == "accuracy" ]]; then
# Also collect Export pass rate and display as a separate row
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --export --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_export_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
TORCHINDUCTOR_ABI_COMPATIBLE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"
@ -550,7 +574,10 @@ test_single_dynamo_benchmark() {
fi
if [[ "${TEST_CONFIG}" == *_avx2* ]]; then
TEST_CONFIG=${TEST_CONFIG::-5}
TEST_CONFIG=${TEST_CONFIG//_avx2/}
fi
if [[ "${TEST_CONFIG}" == *_avx512* ]]; then
TEST_CONFIG=${TEST_CONFIG//_avx512/}
fi
python "benchmarks/dynamo/$suite.py" \
--ci --accuracy --timing --explain \
@ -568,6 +595,9 @@ test_single_dynamo_benchmark() {
test_inductor_micro_benchmark() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
test_inductor_set_cpu_affinity
fi
python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"
}
@ -637,8 +667,7 @@ test_inductor_torchbench_smoketest_perf() {
# https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,
# and thus we lower its threshold to reduce flakiness. If this continues to be a problem,
# we switch to use some other model.
# lowering threshold from 4.9 to 4.7 for cu124. Will bump it up after cuda 12.4.0->12.4.1 update
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.7
python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9
# Check memory compression ratio for a few models
for test in hf_Albert timm_vision_transformer; do
@ -662,7 +691,7 @@ test_inductor_torchbench_smoketest_perf() {
}
test_inductor_get_core_number() {
if [[ "${TEST_CONFIG}" == *aarch64 ]]; then
if [[ "${TEST_CONFIG}" == *aarch64* ]]; then
echo "$(($(lscpu | grep 'Cluster(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per cluster:' | awk '{print $4}')))"
else
echo "$(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))"
@ -672,11 +701,16 @@ test_inductor_get_core_number() {
test_inductor_set_cpu_affinity(){
#set jemalloc
JEMALLOC_LIB="$(find /usr/lib -name libjemalloc.so.2)"
IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"
export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"
export LD_PRELOAD="$JEMALLOC_LIB":"$LD_PRELOAD"
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
if [[ "${TEST_CONFIG}" != *aarch64* ]]; then
# Use Intel OpenMP for x86
IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"
export LD_PRELOAD="$IOMP_LIB":"$LD_PRELOAD"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
fi
cores=$(test_inductor_get_core_number)
export OMP_NUM_THREADS=$cores
end_core=$((cores-1))
@ -1348,14 +1382,16 @@ test_executorch() {
assert_git_not_dirty
}
test_linux_aarch64(){
test_linux_aarch64() {
python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \
test_transformers test_multiprocessing test_numpy_interop --verbose
test_transformers test_multiprocessing test_numpy_interop \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
# Dynamo tests
python test/run_test.py --include dynamo/test_compile dynamo/test_backends dynamo/test_comptime dynamo/test_config \
dynamo/test_functions dynamo/test_fx_passes_pre_grad dynamo/test_interop dynamo/test_model_output dynamo/test_modules \
dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles --verbose
dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
# Inductor tests
python test/run_test.py --include inductor/test_torchinductor inductor/test_benchmark_fusion inductor/test_codecache \
@ -1365,7 +1401,8 @@ test_linux_aarch64(){
inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \
inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \
inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \
inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes --verbose
inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes inductor/test_memory \
--shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose
}
if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then
@ -1447,9 +1484,7 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
install_torchvision
test_inductor_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.8-gcc11-build ]]; then
# Temporarily skip test_inductor_aoti due to https://github.com/pytorch/pytorch/issues/130311
test_inductor_aoti
if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.9-gcc11-build ]]; then
test_inductor_distributed
fi
fi

View File

@ -24,6 +24,12 @@ call %INSTALLER_DIR%\install_sccache.bat
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
if "%USE_XPU%"=="1" (
:: Install xpu support packages
call %INSTALLER_DIR%\install_xpu.bat
if errorlevel 1 exit /b 1
)
:: Miniconda has been installed as part of the Windows AMI with all the dependencies.
:: We just need to activate it here
call %INSTALLER_DIR%\activate_miniconda3.bat
@ -43,6 +49,16 @@ if "%VC_VERSION%" == "" (
)
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
if "%USE_XPU%"=="1" (
:: Activate xpu environment - VS env is required for xpu
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
if errorlevel 1 exit /b 1
:: Reduce build time. Only have MTL self-hosted runner now
SET TORCH_XPU_ARCH_LIST=xe-lpg
SET USE_KINETO=0
)
@echo on
popd
@ -65,13 +81,6 @@ set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%
set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64
set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%
set CUDNN_ROOT_DIR=%CUDA_PATH%
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%
set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64
set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%
set CUDNN_ROOT_DIR=%CUDA_PATH%
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%
:cuda_build_end

View File

@ -0,0 +1,91 @@
@echo on
REM Description: Install Intel Support Packages on Windows
REM BKM reference: https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html
set XPU_INSTALL_MODE=%~1
if "%XPU_INSTALL_MODE%"=="" goto xpu_bundle_install_start
if "%XPU_INSTALL_MODE%"=="bundle" goto xpu_bundle_install_start
if "%XPU_INSTALL_MODE%"=="driver" goto xpu_driver_install_start
if "%XPU_INSTALL_MODE%"=="all" goto xpu_driver_install_start
:arg_error
echo Illegal XPU installation mode. The value can be "bundle"/"driver"/"all"
echo If keep the value as space, will use default "bundle" mode
exit /b 1
:xpu_driver_install_start
:: TODO Need more testing for driver installation
set XPU_DRIVER_LINK=https://downloadmirror.intel.com/830975/gfx_win_101.5972.exe
curl -o xpu_driver.exe --retry 3 --retry-all-errors -k %XPU_DRIVER_LINK%
echo "XPU Driver installing..."
start /wait "Intel XPU Driver Installer" "xpu_driver.exe"
if errorlevel 1 exit /b 1
del xpu_driver.exe
if "%XPU_INSTALL_MODE%"=="driver" goto xpu_install_end
:xpu_bundle_install_start
set XPU_BUNDLE_PARENT_DIR=C:\Program Files (x86)\Intel\oneAPI
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-for-pytorch-gpu-dev_p_0.5.3.37_offline.exe
set XPU_PTI_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-pti-dev_p_0.9.0.37_offline.exe
set XPU_BUNDLE_VERSION=0.5.3+31
set XPU_PTI_VERSION=0.9.0+36
set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.intel-for-pytorch-gpu-dev.product
set XPU_PTI_PRODUCT_NAME=intel.oneapi.win.intel-pti-dev.product
set XPU_BUNDLE_INSTALLED=0
set XPU_PTI_INSTALLED=0
set XPU_BUNDLE_UNINSTALL=0
set XPU_PTI_UNINSTALL=0
:: Check if XPU bundle is target version or already installed
if exist "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" goto xpu_bundle_ver_check
goto xpu_bundle_install
:xpu_bundle_ver_check
"%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --list-products > xpu_bundle_installed_ver.log
for /f "tokens=1,2" %%a in (xpu_bundle_installed_ver.log) do (
if "%%a"=="%XPU_BUNDLE_PRODUCT_NAME%" (
echo %%a Installed Version: %%b
set XPU_BUNDLE_INSTALLED=1
if not "%XPU_BUNDLE_VERSION%"=="%%b" (
start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %XPU_BUNDLE_PRODUCT_NAME% --product-ver %%b --log-dir uninstall_bundle
set XPU_BUNDLE_UNINSTALL=1
)
)
if "%%a"=="%XPU_PTI_PRODUCT_NAME%" (
echo %%a Installed Version: %%b
set XPU_PTI_INSTALLED=1
if not "%XPU_PTI_VERSION%"=="%%b" (
start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %XPU_PTI_PRODUCT_NAME% --product-ver %%b --log-dir uninstall_bundle
set XPU_PTI_UNINSTALL=1
)
)
)
if errorlevel 1 exit /b 1
if exist xpu_bundle_installed_ver.log del xpu_bundle_installed_ver.log
if "%XPU_BUNDLE_INSTALLED%"=="0" goto xpu_bundle_install
if "%XPU_BUNDLE_UNINSTALL%"=="1" goto xpu_bundle_install
if "%XPU_PTI_INSTALLED%"=="0" goto xpu_pti_install
if "%XPU_PTI_UNINSTALL%"=="1" goto xpu_pti_install
goto xpu_install_end
:xpu_bundle_install
curl -o xpu_bundle.exe --retry 3 --retry-all-errors -k %XPU_BUNDLE_URL%
echo "XPU Bundle installing..."
start /wait "Intel Pytorch Bundle Installer" "xpu_bundle.exe" --action=install --eula=accept --silent --log-dir install_bundle
if errorlevel 1 exit /b 1
del xpu_bundle.exe
:xpu_pti_install
curl -o xpu_pti.exe --retry 3 --retry-all-errors -k %XPU_PTI_URL%
echo "XPU PTI installing..."
start /wait "Intel PTI Installer" "xpu_pti.exe" --action=install --eula=accept --silent --log-dir install_bundle
if errorlevel 1 exit /b 1
del xpu_pti.exe
:xpu_install_end

View File

@ -40,7 +40,6 @@ set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%
set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64
set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%
set CUDNN_ROOT_DIR=%CUDA_PATH%
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%
set NUMBAPRO_CUDALIB=%CUDA_PATH%\bin
set NUMBAPRO_LIBDEVICE=%CUDA_PATH%\nvvm\libdevice

View File

@ -31,6 +31,6 @@ if ERRORLEVEL 1 exit /b 1
:: Run tests C++-side and load the exported script module.
cd build
set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%
set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%
test_custom_backend.exe model.pt
if ERRORLEVEL 1 exit /b 1

View File

@ -31,6 +31,6 @@ if ERRORLEVEL 1 exit /b 1
:: Run tests C++-side and load the exported script module.
cd build
set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%
set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%
test_custom_ops.exe model.pt
if ERRORLEVEL 1 exit /b 1

View File

@ -5,7 +5,7 @@ if errorlevel 1 exit /b 1
set CWD=%cd%
set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\bin
set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%
set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%
set TORCH_CPP_TEST_MNIST_PATH=%CWD%\test\cpp\api\mnist
python tools\download_mnist.py --quiet -d %TORCH_CPP_TEST_MNIST_PATH%

View File

@ -40,6 +40,12 @@ python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver==4.12.2.0
# Install tlparse for test\dynamo\test_structured_trace.py UTs.
python -m pip install tlparse==0.3.25
# Install parameterized
python -m pip install parameterized==0.8.1
run_tests() {
# Run nvidia-smi if available
for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

View File

@ -116,15 +116,14 @@ if [[ "$PACKAGE_TYPE" == libtorch ]]; then
cd /tmp/libtorch
fi
if [[ "$GPU_ARCH_TYPE" == xpu ]]; then
# Workaround for __mkl_tmp_MOD unbound variable issue, refer https://github.com/pytorch/pytorch/issues/130543
set +u
source /opt/intel/oneapi/pytorch-gpu-dev-0.5/oneapi-vars.sh
fi
# Test the package
/builder/check_binary.sh
if [[ "\$GPU_ARCH_TYPE" != *s390x* && "\$GPU_ARCH_TYPE" != *xpu* && "\$GPU_ARCH_TYPE" != *rocm* && "$PACKAGE_TYPE" != libtorch ]]; then
# Exclude s390, xpu, rocm and libtorch builds from smoke testing
python /builder/test/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled
fi
# Clean temp files
cd /builder && git clean -ffdx

View File

@ -90,7 +90,7 @@ fi
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" ]]; then
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
@ -102,10 +102,10 @@ fi
# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}"
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

View File

@ -10,6 +10,11 @@ export SCCACHE_BUCKET=ossci-compiler-cache
export SCCACHE_IGNORE_SERVER_IO_ERROR=1
export VC_YEAR=2019
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
export VC_YEAR=2022
export USE_SCCACHE=0
fi
echo "Free space on filesystem before build:"
df -h

View File

@ -6,6 +6,10 @@ source "${BINARY_ENV_FILE:-/c/w/env}"
export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export VC_YEAR=2019
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
export VC_YEAR=2022
fi
pushd "$BUILDER_ROOT"
./windows/internal/smoke_test.bat

View File

@ -57,7 +57,7 @@ per-file-ignores =
torch/distributed/_tensor/_collective_utils.py: TOR901
# This is a full package that happen to live within the test
# folder, so ok to skip
test/cpp_extensions/open_registration_extension/pytorch_openreg/__init__.py: TOR901
test/cpp_extensions/open_registration_extension/pytorch_openreg/_aten_impl.py: TOR901
optional-ascii-coding = True
exclude =
./.git,

View File

@ -3,18 +3,20 @@ self-hosted-runner:
# GitHub hosted x86 Linux runners
- linux.20_04.4x
- linux.20_04.16x
# Repo-specific LF hosted ARC runners
- linux.large.arc
# Organization-wide AWS Linux Runners
- linux.large
- linux.2xlarge
- linux.4xlarge
- linux.9xlarge.ephemeral
- am2.linux.9xlarge.ephemeral
- linux.12xlarge
- linux.12xlarge.ephemeral
- linux.24xlarge
- linux.24xlarge.ephemeral
- linux.arm64.2xlarge
- linux.arm64.2xlarge.ephemeral
- linux.arm64.m7g.4xlarge
- linux.arm64.m7g.4xlarge.ephemeral
- linux.4xlarge.nvidia.gpu
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu
@ -30,32 +32,12 @@ self-hosted-runner:
- lf.linux.8xlarge.nvidia.gpu
- lf.linux.16xlarge.nvidia.gpu
- lf.linux.g5.4xlarge.nvidia.gpu
# Organization-wide AWS Linux Runners with new Amazon 2023 AMI
- amz2023.linux.large
- amz2023.linux.2xlarge
- amz2023.linux.4xlarge
- amz2023.linux.12xlarge
- amz2023.linux.24xlarge
- amz2023.linux.arm64.2xlarge
- amz2023.linux.arm64.m7g.4xlarge
- amz2023.linux.4xlarge.nvidia.gpu
- amz2023.linux.8xlarge.nvidia.gpu
- amz2023.linux.16xlarge.nvidia.gpu
- amz2023.linux.g5.4xlarge.nvidia.gpu
# Pytorch/pytorch AWS Linux Runners with the new Amazon 2023 AMI on Linux Foundation account
- amz2023.lf.linux.large
- amz2023.lf.linux.2xlarge
- amz2023.lf.linux.4xlarge
- amz2023.lf.linux.12xlarge
- amz2023.lf.linux.24xlarge
- amz2023.lf.linux.arm64.2xlarge
- amz2023.lf.linux.4xlarge.nvidia.gpu
- amz2023.lf.linux.8xlarge.nvidia.gpu
- amz2023.lf.linux.16xlarge.nvidia.gpu
- amz2023.lf.linux.g5.4xlarge.nvidia.gpu
# Repo-specific IBM hosted S390x runner
- linux.s390x
# Organization wide AWS Windows runners
- windows.g4dn.xlarge
- windows.g4dn.xlarge.nonephemeral
- windows.4xlarge
- windows.4xlarge.nonephemeral
- windows.8xlarge.nvidia.gpu
- windows.8xlarge.nvidia.gpu.nonephemeral

View File

@ -57,7 +57,7 @@ outputs:
runs:
using: composite
steps:
- uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
- uses: nick-fields/retry@v3.0.0
name: Setup dependencies
env:
GITHUB_TOKEN: ${{ inputs.github-token }}

View File

@ -17,7 +17,7 @@ inputs:
runs:
using: composite
steps:
- uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
- uses: nick-fields/retry@v3.0.0
name: Setup dependencies
with:
shell: bash

View File

@ -24,7 +24,7 @@ inputs:
runs:
using: composite
steps:
- uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
- uses: nick-fields/retry@v3.0.0
name: Setup dependencies
with:
shell: bash

View File

@ -44,7 +44,7 @@ runs:
fi
- name: Log in to ECR
uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
uses: nick-fields/retry@v3.0.0
env:
AWS_RETRY_MODE: standard
AWS_MAX_ATTEMPTS: "5"
@ -59,6 +59,13 @@ runs:
aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
--password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
# For LF Runners we need to make sure we also login to Meta's ECR docker registry too.
META_AWS_ACCOUNT_ID=308535385114
if [ "$AWS_ACCOUNT_ID" != "$META_AWS_ACCOUNT_ID" ] ; then
aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \
--password-stdin "$META_AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"
fi
- name: Preserve github env variables for use in docker
shell: bash
run: |

View File

@ -31,7 +31,7 @@ runs:
# retry this step several time similar to how checkout-pytorch GHA does
- name: Cleanup workspace
if: always()
uses: nick-fields/retry@v2.8.2
uses: nick-fields/retry@v3.0.0
env:
EXTRA_DELETE_DIR: ${{ inputs.extra-delete-dir }}
with:

View File

@ -1 +1 @@
b3f6f511f2a1082bd56b13a3f6794e7fc3ba4862
ba696ea3dfec4cbe693bf06a84c75dc196077f5b

View File

@ -1 +1 @@
5ea4535f0699f366adb554183a65ebf7dc34a8be
2eb4a60ed14a38260b85b0c765161f0ce45be6d1

View File

@ -1,13 +1,50 @@
# Use this to auto apply labels based on other labels. Applies to both PRs and
# issues. Currently only supports any and all
- any:
- "module: custom operators"
- "module: opcheck"
then:
- "module: custom-operators"
- any:
- "module: custom-operators"
- "module: functionalization"
- "module: aotdispatch"
- "module: higher order operators"
- "module: fakeTensor"
- "module: ProxyTensor"
- "module: library"
- "module: reinplacing"
then:
- "module: pt2-dispatcher"
- any:
- "module: vmap"
then:
- "module: functorch"
- any:
- "module: reinplacing"
then:
- "module: inductor"
- any:
- "module: pt2 optimizer"
then:
- "module: dynamo"
- any:
- "module: flex attention"
then:
- "module: higher order operators"
- any:
- "module: aotinductor"
then:
- "oncall: export"
- any:
- "module: dynamo"
- "module: pt2-dispatcher"
- "module: inductor"
- "module: aotinductor"
- "module: cudagraphs"
- "oncall: export"
- "module: startup-tracing-compile"
- "module: compiled autograd"
- "module: flex attention"
- "module: dynamic shapes"
then:
- "oncall: pt2"

1
.github/labeler.yml vendored
View File

@ -29,7 +29,6 @@
- torch/fx/experimental/recording.py
- torch/fx/experimental/sym_node.py
- torch/fx/experimental/validator.py
- torch/fx/experimental/_sym_dispatch_mode.py
- torch/fx/experimental/proxy_tensor.py
- test/distributed/_tensor/test_dtensor_compile.py
- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py

View File

@ -7,10 +7,14 @@
# runners. Runners listed here will be available as self hosted
# runners, configuration is directly pulled from the main branch.
#
# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2
#
# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
# NOTES:
# - Linux runners are by default non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
# - When updating this file, run the following command to validate the YAML and to generate
# corresponding versions of scale-config for the pytorch/pytorch repo and merge the
# pytorch/pytorch changes before merging these changes.
# `python .github/scripts/validate_scale_config.py --test-infra-repo-root [path_to_test-infra_root] --pytorch-repo-root [path_to_pytorch_root]``
#
# TODO: Add some documentation on how the auto-scaling works
#
@ -31,132 +35,190 @@ runner_types:
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 20
max_available: 50
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
variants:
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
lf.c.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
max_available: 300
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 250
max_available: 500
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.24xlarge.ephemeral:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 300
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 2400
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 30
max_available: 50
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.c.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.2xlarge.ephemeral:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.4xlarge.ephemeral:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.c.windows.g4dn.xlarge:
disk_size: 256
instance_type: g4dn.xlarge
is_ephemeral: true
max_available: 100
os: windows
lf.c.windows.g4dn.xlarge.nonephemeral:
disk_size: 256
instance_type: g4dn.xlarge
is_ephemeral: false
max_available: 100
os: windows
lf.c.windows.4xlarge:
disk_size: 256
instance_type: c5d.4xlarge
@ -187,159 +249,3 @@ runner_types:
is_ephemeral: false
max_available: 250
os: windows
### Setup runner types to test the Amazon Linux 2023 AMI
lf.c.amz2023.linux.12xlarge:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 20
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 2400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.c.amz2023.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
lf.c.amz2023.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
lf.c.amz2023.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

View File

@ -7,10 +7,14 @@
# runners. Runners listed here will be available as self hosted
# runners, configuration is directly pulled from the main branch.
#
# NOTE (Apr, 5, 2021): Linux runners are currently all an amazonlinux2
#
# NOTE (Jan 5, 2021): Linux runners are all non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
# NOTES:
# - Linux runners are by default non-ephemeral to reduce the amount of CreateInstaces calls
# to avoid RequestLimitExceeded issues
# - When updating this file, run the following command to validate the YAML and to generate
# corresponding versions of scale-config for the pytorch/pytorch repo and merge the
# pytorch/pytorch changes before merging these changes.
# `python .github/scripts/validate_scale_config.py --test-infra-repo-root [path_to_test-infra_root] --pytorch-repo-root [path_to_pytorch_root]``
#
# TODO: Add some documentation on how the auto-scaling works
#
@ -31,132 +35,190 @@ runner_types:
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 20
max_available: 50
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
variants:
am2:
ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs
lf.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
max_available: 300
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 250
max_available: 500
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.24xlarge.ephemeral:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 300
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 2400
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 30
max_available: 50
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-x86_64
lf.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.2xlarge.ephemeral:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.4xlarge.ephemeral:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: true
max_available: 200
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
ami: al2023-ami-2023.5.202*-kernel-6.1-arm64
lf.windows.g4dn.xlarge:
disk_size: 256
instance_type: g4dn.xlarge
is_ephemeral: true
max_available: 100
os: windows
lf.windows.g4dn.xlarge.nonephemeral:
disk_size: 256
instance_type: g4dn.xlarge
is_ephemeral: false
max_available: 100
os: windows
lf.windows.4xlarge:
disk_size: 256
instance_type: c5d.4xlarge
@ -187,159 +249,3 @@ runner_types:
is_ephemeral: false
max_available: 250
os: windows
### Setup runner types to test the Amazon Linux 2023 AMI
lf.amz2023.linux.12xlarge:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.10xlarge.avx2:
disk_size: 200
instance_type: m4.10xlarge
is_ephemeral: false
max_available: 450
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.24xl.spr-metal:
disk_size: 200
instance_type: c7i.metal-24xl
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.16xlarge.spr:
disk_size: 200
instance_type: c7i.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.9xlarge.ephemeral:
disk_size: 200
instance_type: c5.9xlarge
is_ephemeral: true
max_available: 20
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.12xlarge.ephemeral:
disk_size: 200
instance_type: c5.12xlarge
is_ephemeral: true
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.16xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.16xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.24xlarge:
disk_size: 150
instance_type: c5.24xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.2xlarge:
disk_size: 150
instance_type: c5.2xlarge
is_ephemeral: false
max_available: 3120
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.4xlarge:
disk_size: 150
instance_type: c5.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.4xlarge
is_ephemeral: false
max_available: 1000
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.8xlarge.nvidia.gpu:
disk_size: 150
instance_type: g3.8xlarge
is_ephemeral: false
max_available: 400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g4dn.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g4dn.12xlarge
is_ephemeral: false
max_available: 250
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g4dn.metal.nvidia.gpu:
disk_size: 150
instance_type: g4dn.metal
is_ephemeral: false
max_available: 300
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g5.48xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.48xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g5.12xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.12xlarge
is_ephemeral: false
max_available: 150
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g5.4xlarge.nvidia.gpu:
disk_size: 150
instance_type: g5.4xlarge
is_ephemeral: false
max_available: 2400
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.g6.4xlarge.experimental.nvidia.gpu:
disk_size: 150
instance_type: g6.4xlarge
is_ephemeral: false
max_available: 30
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.large:
max_available: 1200
disk_size: 15
instance_type: c5.large
is_ephemeral: false
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64
lf.amz2023.linux.arm64.2xlarge:
disk_size: 256
instance_type: t4g.2xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
lf.amz2023.linux.arm64.m7g.4xlarge:
disk_size: 256
instance_type: m7g.4xlarge
is_ephemeral: false
max_available: 200
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64
lf.amz2023.linux.arm64.m7g.metal:
disk_size: 256
instance_type: m7g.metal
is_ephemeral: false
max_available: 100
os: linux
ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

View File

@ -86,6 +86,18 @@
- pull
- inductor
- name: OSS CI / pytorchbot / slow tests
patterns:
- test/slow_tests.json
approved_by:
- pytorchbot
ignore_flaky_failures: false
mandatory_checks_name:
- EasyCLA
- Lint
- pull
- slow
- name: OSS CI /pytorchbot / Executorch
patterns:
- .ci/docker/ci_commit_pins/executorch.txt
@ -107,8 +119,8 @@
mandatory_checks_name:
- EasyCLA
- Lint
- pull / linux-focal-py3_8-clang9-xla / build
- pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge)
- pull / linux-focal-py3_9-clang9-xla / build
- pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge)
- name: Documentation
patterns:
@ -282,9 +294,11 @@
- torch/_C/_distributed*
- torch/csrc/distributed/**
- torch/testing/_internal/distributed/**
- torch/multiprocessing/**
- test/distributed/**
- test/cpp/dist_autograd/**
- test/cpp/rpc/**
- test/*multiprocessing*
approved_by:
- wconstab
- mrshenli
@ -530,6 +544,7 @@
- anijain2305
- bdhirsh
- zou3519
- isuruf
mandatory_checks_name:
- EasyCLA
- Lint

5
.github/nitpicks.yml vendored Normal file
View File

@ -0,0 +1,5 @@
- markdown: |
## Attention! native_functions.yaml was changed
If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.
pathFilter:
- 'aten/src/ATen/native/native_functions.yaml'

View File

@ -9,6 +9,7 @@ ciflow_push_tags:
- ciflow/inductor-rocm
- ciflow/inductor-perf-compare
- ciflow/inductor-micro-benchmark
- ciflow/inductor-micro-benchmark-cpu-x86
- ciflow/inductor-cu124
- ciflow/linux-aarch64
- ciflow/mps

View File

@ -4,4 +4,4 @@ ninja=1.10.2
numpy=1.23.3
pyyaml=6.0
setuptools=68.2.2
typing-extensions=4.9.0
typing-extensions=4.11.0

View File

@ -1,6 +1,7 @@
boto3==1.19.12
hypothesis==6.56.4
expecttest==0.1.6
expecttest==0.2.1
fbscribelogger==0.1.6
librosa>=0.6.2
mpmath==1.3.0
networkx==2.8.7
@ -18,7 +19,7 @@ pytest-rerunfailures==10.3
pytest-flakefinder==1.1.0
scipy==1.10.1
sympy==1.12.1 ; python_version == "3.8"
sympy>=1.13.0 ; python_version >= "3.9"
sympy==1.13.1 ; python_version >= "3.9"
unittest-xml-reporting<=3.2.0,>=2.0.0
xdoctest==1.1.0
filelock==3.6.0
@ -30,3 +31,4 @@ optree==0.12.1
# NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
# which the stringify metadata is wrong when escaping double quote
protobuf==3.20.2
parameterized==0.8.1

View File

@ -15,9 +15,7 @@ REPO_DIR = SCRIPT_DIR.parent.parent
def read_triton_pin(device: str = "cuda") -> str:
triton_file = "triton.txt"
if device == "rocm":
triton_file = "triton-rocm.txt"
elif device == "xpu":
if device == "xpu":
triton_file = "triton-xpu.txt"
with open(REPO_DIR / ".ci" / "docker" / "ci_commit_pins" / triton_file) as f:
return f.read().strip()
@ -50,6 +48,25 @@ def patch_init_py(
f.write(orig)
# TODO: remove patch_setup_py() once we have a proper fix for https://github.com/triton-lang/triton/issues/4527
def patch_setup_py(path: Path) -> None:
with open(path) as f:
orig = f.read()
try:
orig = check_and_replace(
orig,
"https://tritonlang.blob.core.windows.net/llvm-builds/",
"https://oaitriton.blob.core.windows.net/public/llvm-builds/",
)
with open(path, "w") as f:
f.write(orig)
except RuntimeError as e:
print(
f"Applying patch_setup_py() for llvm-build package failed: {e}.",
"If you are trying to build a newer version of Triton, you can ignore this.",
)
def build_triton(
*,
version: str,
@ -91,6 +108,9 @@ def build_triton(
else:
check_call(["git", "checkout", commit_hash], cwd=triton_basedir)
# TODO: remove this and patch_setup_py() once we have a proper fix for https://github.com/triton-lang/triton/issues/4527
patch_setup_py(triton_pythondir / "setup.py")
if build_conda:
with open(triton_basedir / "meta.yaml", "w") as meta:
print(

View File

@ -27,6 +27,12 @@ def parse_args() -> Any:
parser = ArgumentParser("Check PR labels")
parser.add_argument("pr_num", type=int)
# add a flag to return a non-zero exit code if the PR does not have the required labels
parser.add_argument(
"--exit-non-zero",
action="store_true",
help="Return a non-zero exit code if the PR does not have the required labels",
)
return parser.parse_args()
@ -41,10 +47,13 @@ def main() -> None:
if not has_required_labels(pr):
print(LABEL_ERR_MSG)
add_label_err_comment(pr)
if args.exit_non_zero:
sys.exit(1)
else:
delete_all_label_err_comments(pr)
except Exception as e:
pass
if args.exit_non_zero:
sys.exit(1)
sys.exit(0)

View File

@ -169,7 +169,8 @@ def create_cherry_pick_branch(
repo.create_branch_and_checkout(branch=cherry_pick_branch)
# We might want to support ghstack later
repo._run_git("cherry-pick", "-x", "-X", "theirs", commit_sha)
# We don't want to resolve conflicts here.
repo._run_git("cherry-pick", "-x", commit_sha)
repo.push(branch=cherry_pick_branch, dry_run=False)
return cherry_pick_branch

View File

@ -18,13 +18,13 @@ from typing import Dict, List, Optional, Tuple
CUDA_ARCHES = ["11.8", "12.1", "12.4"]
CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.0"}
CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.1"}
CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.1": "9", "12.4": "9"}
ROCM_ARCHES = ["6.0", "6.1"]
ROCM_ARCHES = ["6.1", "6.2"]
XPU_ARCHES = ["xpu"]
@ -68,18 +68,18 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {
"nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
"12.4": (
"nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'"
"nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "
"nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'"
),
}
@ -325,6 +325,7 @@ def generate_wheels_matrix(
os: str,
arches: Optional[List[str]] = None,
python_versions: Optional[List[str]] = None,
use_split_build: bool = False,
) -> List[Dict[str, str]]:
package_type = "wheel"
if os == "linux" or os == "linux-aarch64" or os == "linux-s390x":
@ -340,7 +341,7 @@ def generate_wheels_matrix(
if os == "linux":
arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES + XPU_ARCHES
elif os == "windows":
arches += CUDA_ARCHES
arches += CUDA_ARCHES + XPU_ARCHES
elif os == "linux-aarch64":
# Only want the one arch as the CPU type is different and
# uses different build/test scripts
@ -365,13 +366,23 @@ def generate_wheels_matrix(
else arch_version
)
# TODO: Enable python 3.13 on rocm, xpu, aarch64, windows
# TODO: Enable python 3.13 on rocm, aarch64, windows
if (
gpu_arch_type in ["rocm", "xpu"] or os != "linux"
gpu_arch_type == "rocm" or (os != "linux" and os != "linux-s390x")
) and python_version == "3.13":
continue
if use_split_build and (
arch_version not in ["12.4", "12.1", "11.8", "cpu"] or os != "linux"
):
raise RuntimeError(
"Split build is only supported on linux with cuda 12.4, 12.1, 11.8, and cpu.\n"
f"Currently attempting to build on arch version {arch_version} and os {os}.\n"
"Please modify the matrix generation to exclude this combination."
)
# 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install
if (
arch_version in ["12.4", "12.1", "11.8"]
and os == "linux"
@ -385,6 +396,7 @@ def generate_wheels_matrix(
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "True" if use_split_build else "False",
"devtoolset": (
"cxx11-abi" if arch_version == "cuda-aarch64" else ""
),
@ -400,7 +412,8 @@ def generate_wheels_matrix(
),
}
)
if arch_version != "cuda-aarch64":
# Special build building to use on Colab. Python 3.11 for 12.1 CUDA
if python_version == "3.11" and arch_version == "12.1":
ret.append(
{
"python_version": python_version,
@ -409,40 +422,16 @@ def generate_wheels_matrix(
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "True",
"use_split_build": "True" if use_split_build else "False",
"devtoolset": "",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version] # fmt: skip
if os != "linux-aarch64"
else ""
),
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-split".replace( # noqa: B950
"pytorch_extra_install_requirements": "",
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace( # noqa: B950
".", "_"
),
}
)
# Special build building to use on Colab. PyThon 3.10 for 12.1 CUDA
if python_version == "3.10" and arch_version == "12.1":
ret.append(
{
"python_version": python_version,
"gpu_arch_type": gpu_arch_type,
"gpu_arch_version": gpu_arch_version,
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "False",
"devtoolset": "",
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
"pytorch_extra_install_requirements": "",
"build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace( # noqa: B950
".", "_"
),
}
)
else:
ret.append(
{
@ -452,10 +441,9 @@ def generate_wheels_matrix(
"desired_cuda": translate_desired_cuda(
gpu_arch_type, gpu_arch_version
),
"use_split_build": "True" if use_split_build else "False",
"devtoolset": (
"cxx11-abi"
if arch_version in ["cpu-cxx11-abi", "xpu"]
else ""
"cxx11-abi" if arch_version == "cpu-cxx11-abi" else ""
),
"container_image": WHEEL_CONTAINER_IMAGES[arch_version],
"package_type": package_type,
@ -464,11 +452,12 @@ def generate_wheels_matrix(
),
"pytorch_extra_install_requirements": (
PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"] # fmt: skip
if os != "linux"
if os != "linux" and gpu_arch_type != "xpu"
else ""
),
}
)
return ret

View File

@ -61,6 +61,7 @@ class BinaryBuildWorkflow:
# Mainly for macos
cross_compile_arm64: bool = False
macos_runner: str = "macos-14-xlarge"
use_split_build: bool = False
def __post_init__(self) -> None:
if self.abi_version:
@ -69,6 +70,9 @@ class BinaryBuildWorkflow:
)
else:
self.build_environment = f"{self.os}-binary-{self.package_type}"
if self.use_split_build:
# added to distinguish concurrency groups
self.build_environment += "-split"
def generate_workflow_file(self, workflow_template: jinja2.Template) -> None:
output_file_path = (
@ -110,6 +114,20 @@ LINUX_BINARY_BUILD_WORFKLOWS = [
isolated_workflow=True,
),
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
use_split_build=True,
arches=["11.8", "12.1", "12.4", "cpu"],
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},
isolated_workflow=True,
),
use_split_build=True,
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="conda",
@ -158,10 +176,25 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["11.8", "12.1", "12.4"],
python_versions=["3.8"],
python_versions=["3.9"],
),
branches="main",
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="manywheel",
build_configs=generate_binary_build_matrix.generate_wheels_matrix(
OperatingSystem.LINUX,
arches=["11.8", "12.1", "12.4"],
python_versions=["3.9"],
use_split_build=True,
),
ciflow_config=CIFlowConfig(
labels={LABEL_CIFLOW_PERIODIC},
),
branches="main",
use_split_build=True,
),
BinaryBuildWorkflow(
os=OperatingSystem.LINUX,
package_type="libtorch",

View File

@ -46,16 +46,24 @@ def gh_fetch_url_and_headers(
with urlopen(Request(url, headers=headers, data=data_, method=method)) as conn:
return conn.headers, reader(conn)
except HTTPError as err:
if err.code == 403 and all(
key in err.headers for key in ["X-RateLimit-Limit", "X-RateLimit-Used"]
if (
err.code == 403
and all(
key in err.headers
for key in ["X-RateLimit-Limit", "X-RateLimit-Remaining"]
)
and int(err.headers["X-RateLimit-Remaining"]) == 0
):
print(
f"""Rate limit exceeded:
f"""{url}
Rate limit exceeded:
Used: {err.headers['X-RateLimit-Used']}
Limit: {err.headers['X-RateLimit-Limit']}
Remaining: {err.headers['X-RateLimit-Remaining']}
Resets at: {err.headers['x-RateLimit-Reset']}"""
)
else:
print(f"Error fetching {url} {err}")
raise
@ -160,6 +168,14 @@ def gh_post_commit_comment(
)
def gh_close_pr(org: str, repo: str, pr_num: int, dry_run: bool = False) -> None:
url = f"{GITHUB_API_URL}/repos/{org}/{repo}/pulls/{pr_num}"
if dry_run:
print(f"Dry run closing PR {pr_num}")
else:
gh_fetch_url(url, method="PATCH", data={"state": "closed"})
def gh_delete_comment(org: str, repo: str, comment_id: int) -> None:
url = f"{GITHUB_API_URL}/repos/{org}/{repo}/issues/comments/{comment_id}"
gh_fetch_url(url, method="DELETE")

View File

@ -445,7 +445,6 @@ def retries_decorator(
print(
f'Attempt {idx} of {num_retries} to call {f.__name__} failed with "{e}"'
)
pass
return cast(T, rc)
return wrapper

View File

@ -17,6 +17,11 @@ if [[ -d "${CACHE_DIRECTORY}" ]]; then
cp -r "${CACHE_DIRECTORY}" . || true
fi
# if lintrunner is not installed, install it
if ! command -v lintrunner &> /dev/null; then
python3 -m pip install lintrunner==0.12.5
fi
# This has already been cached in the docker image
lintrunner init 2> /dev/null
@ -33,7 +38,7 @@ python3 torch/utils/data/datapipes/gen_pyi.py
RC=0
# Run lintrunner on all files
if ! lintrunner --force-color --all-files --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then
if ! lintrunner --force-color --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null; then
echo ""
echo -e "\e[1m\e[36mYou can reproduce these results locally by using \`lintrunner -m origin/main\`. (If you don't get the same results, run \'lintrunner init\' to update your local linter)\e[0m"
echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.\e[0m"

View File

@ -1,23 +1,96 @@
# flake8: noqa: G004
"""
This runner determinator is used to determine which set of runners to run a
GitHub job on. It uses the first comment of a GitHub issue (by default
https://github.com/pytorch/test-infra/issues/5132) to define the configuration
of which runners should be used to run which job.
The configuration has two parts, the settings and a list of opted-in users,
separated by a line containing "---". If the line is not present, the
settings are considered to be empty with only the second part, the user
list, defined.
The first part is a YAML block that defines the rollout settings. This can be
used to define any settings that are needed to determine which runners to use.
It's fields are defined by the RolloutSettings class below.
The second part is a list of users who are explicitly opted in to the LF fleet.
The user list is also a comma separated list of additional features or
experiments which the user could be opted in to.
The user list has the following rules:
- Users are GitHub usernames, which must start with the @ prefix
- Each user is also a comma-separated list of features/experiments to enable
- A "#" prefix opts the user out of all experiments
Example config:
# A list of experiments that can be opted into.
# This defines the behavior they'll induce when opted into.
# Expected syntax is:
# [experiment_name]: # Name of the experiment. Also used for the label prefix.
# rollout_perc: [int] # % of workflows to run with this experiment when users are not opted in.
experiments:
lf:
rollout_percent: 25
---
# Opt-ins:
# Users can opt into the LF fleet by adding their GitHub username to this list
# and specifying experiments to enable in a comma-separated list.
# Experiments should be from the above list.
@User1,lf,split_build
@User2,lf
@User3,split_build
"""
import logging
import os
import random
from argparse import ArgumentParser
from logging import LogRecord
from typing import Any, Iterable
from typing import Any, Dict, Iterable, List, NamedTuple, Tuple
import yaml
from github import Auth, Github
from github.Issue import Issue
WORKFLOW_LABEL_META = "" # use meta runners
DEFAULT_LABEL_PREFIX = "" # use meta runners
WORKFLOW_LABEL_LF = "lf." # use runners from the linux foundation
WORKFLOW_LABEL_LF_CANARY = "lf.c." # use canary runners from the linux foundation
GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")
GH_OUTPUT_KEY_AMI = "runner-ami"
GH_OUTPUT_KEY_LABEL_TYPE = "label-type"
SETTING_EXPERIMENTS = "experiments"
LF_FLEET_EXPERIMENT = "lf"
CANARY_FLEET_SUFFIX = ".c"
class Experiment(NamedTuple):
rollout_perc: float = (
0 # Percentage of workflows to experiment on when user is not opted-in.
)
# Add more fields as needed
class Settings(NamedTuple):
"""
Settings for the experiments that can be opted into.
"""
experiments: Dict[str, Experiment] = {}
class ColorFormatter(logging.Formatter):
"""Color codes the log messages based on the log level"""
@ -109,11 +182,14 @@ def get_issue(gh: Github, repo: str, issue_num: int) -> Issue:
def get_potential_pr_author(
gh: Github, repo: str, username: str, ref_type: str, ref_name: str
github_token: str, repo: str, username: str, ref_type: str, ref_name: str
) -> str:
# If the trigger was a new tag added by a bot, this is a ciflow case
# Fetch the actual username from the original PR. The PR number is
# embedded in the tag name: ciflow/<name>/<pr-number>
gh = get_gh_client(github_token)
if username == "pytorch-bot[bot]" and ref_type == "tag":
split_tag = ref_name.split("/")
if (
@ -135,80 +211,233 @@ def get_potential_pr_author(
def is_exception_branch(branch: str) -> bool:
"""
Branches that get opted out of all experiments and should always use Meta runners
"""
return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}
def get_workflow_type(issue: Issue, workflow_requestors: Iterable[str]) -> str:
def load_yaml(yaml_text: str) -> Any:
try:
first_comment = issue.get_comments()[0].body.strip("\n\t ")
data = yaml.safe_load(yaml_text)
return data
except yaml.YAMLError as exc:
log.exception("Error loading YAML")
raise
if first_comment[0] == "!":
log.info("LF Workflows are disabled for everyone. Using meta runners.")
return WORKFLOW_LABEL_META
elif first_comment[0] == "*":
log.info("LF Workflows are enabled for everyone. Using LF runners.")
return WORKFLOW_LABEL_LF
else:
all_opted_in_users = {
usr_raw.strip("\n\t@ ") for usr_raw in first_comment.split()
}
opted_in_requestors = {
usr for usr in workflow_requestors if usr in all_opted_in_users
}
if opted_in_requestors:
def extract_settings_user_opt_in_from_text(rollout_state: str) -> Tuple[str, str]:
"""
Extracts the text with settings, if any, and the opted in users from the rollout state.
If the issue body contains "---" then the text above that is the settings
and the text below is the list of opted in users.
If it doesn't contain "---" then the settings are empty and the rest is the users.
"""
rollout_state_parts = rollout_state.split("---")
if len(rollout_state_parts) >= 2:
return rollout_state_parts[0], rollout_state_parts[1]
else:
return "", rollout_state
class UserOptins(Dict[str, List[str]]):
"""
Dictionary of users with a list of features they have opted into
"""
def parse_user_opt_in_from_text(user_optin_text: str) -> UserOptins:
"""
Parse the user opt-in text into a key value pair of username and the list of features they have opted into
Users are GitHub usernames with the @ prefix. Each user is also a comma-separated list of features/experiments to enable.
- Example line: "@User1,lf,split_build"
- A "#" prefix indicates the user is opted out of all experiments
"""
optins = UserOptins()
for user in user_optin_text.split("\n"):
user = user.strip("\r\n\t -")
if not user or not user.startswith("@"):
# Not a valid user. Skip
continue
if user:
usr_name = user.split(",")[0].strip("@")
optins[usr_name] = [exp.strip(" ") for exp in user.split(",")[1:]]
return optins
def parse_settings_from_text(settings_text: str) -> Settings:
"""
Parse the experiments from the issue body into a list of ExperimentSettings
"""
try:
if settings_text:
# Escape the backtick as well so that we can have the settings in a code block on the GH issue
# for easy reading
# Note: Using ascii for the backtick so that the cat step in _runner-determinator.yml doesn't choke on
# the backtick character in shell commands.
backtick = chr(96) # backtick character
settings_text = settings_text.strip(f"\r\n\t{backtick} ")
settings = load_yaml(settings_text)
# For now we just load experiments. We can expand this if/when we add more settings
experiments = {}
for exp_name, exp_settings in settings.get(SETTING_EXPERIMENTS).items():
valid_settings = {}
for setting in exp_settings:
if setting not in Experiment._fields:
log.warning(
f"Unexpected setting in experiment: {setting} = {exp_settings[setting]}"
)
else:
valid_settings[setting] = exp_settings[setting]
experiments[exp_name] = Experiment(**valid_settings)
return Settings(experiments)
except Exception:
log.exception("Failed to parse settings")
return Settings()
def parse_settings(rollout_state: str) -> Settings:
"""
Parse settings, if any, from the rollout state.
If the issue body contains "---" then the text above that is the settings
and the text below is the list of opted in users.
If it doesn't contain "---" then the settings are empty and the default values are used.
"""
settings_text, _ = extract_settings_user_opt_in_from_text(rollout_state)
return parse_settings_from_text(settings_text)
def parse_users(rollout_state: str) -> UserOptins:
"""
Parse users from the rollout state.
"""
_, users_text = extract_settings_user_opt_in_from_text(rollout_state)
return parse_user_opt_in_from_text(users_text)
def is_user_opted_in(user: str, user_optins: UserOptins, experiment_name: str) -> bool:
"""
Check if a user is opted into an experiment
"""
return experiment_name in user_optins.get(user, [])
def get_runner_prefix(
rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False
) -> str:
settings = parse_settings(rollout_state)
user_optins = parse_users(rollout_state)
fleet_prefix = ""
prefixes = []
for experiment_name, experiment_settings in settings.experiments.items():
enabled = False
# Is any workflow_requestor opted in to this experiment?
opted_in_users = [
requestor
for requestor in workflow_requestors
if is_user_opted_in(requestor, user_optins, experiment_name)
]
if opted_in_users:
log.info(
f"{', '.join(opted_in_users)} have opted into experiment {experiment_name}."
)
enabled = True
elif experiment_settings.rollout_perc:
# If no user is opted in, then we randomly enable the experiment based on the rollout percentage
if random.uniform(0, 100) <= experiment_settings.rollout_perc:
log.info(
f"LF Workflows are enabled for {', '.join(opted_in_requestors)}. Using LF runners."
f"Based on rollout percentage of {experiment_settings.rollout_perc}%, enabling experiment {experiment_name}."
)
return WORKFLOW_LABEL_LF
enabled = True
if enabled:
label = experiment_name
if experiment_name == LF_FLEET_EXPERIMENT:
# We give some special treatment to the "lf" experiment since determines the fleet we use
# - If it's enabled, then we always list it's prefix first
# - If we're in the canary branch, then we append ".c" to the lf prefix
if is_canary:
label += CANARY_FLEET_SUFFIX
fleet_prefix = label
else:
log.info(
f"LF Workflows are disabled for {', '.join(workflow_requestors)}. Using meta runners."
)
return WORKFLOW_LABEL_META
prefixes.append(label)
except Exception as e:
if len(prefixes) > 1:
log.error(
f"Failed to get determine workflow type. Falling back to meta runners. Exception: {e}"
f"Only a fleet and one other experiment can be enabled for a job at any time. Enabling {prefixes[0]} and ignoring the rest, which are {', '.join(prefixes[1:])}"
)
return WORKFLOW_LABEL_META
prefixes = prefixes[:1]
# Fleet always comes first
if fleet_prefix:
prefixes.insert(0, fleet_prefix)
return ".".join(prefixes) + "." if prefixes else ""
def get_rollout_state_from_issue(github_token: str, repo: str, issue_num: int) -> str:
"""
Gets the first comment of the issue, which contains the desired rollout state.
The default issue we use - https://github.com/pytorch/test-infra/issues/5132
"""
gh = get_gh_client(github_token)
issue = get_issue(gh, repo, issue_num)
return str(issue.get_comments()[0].body.strip("\n\t "))
def main() -> None:
args = parse_args()
if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):
log.info(f"Exception branch: '{args.github_branch}', using meta runners")
label_type = WORKFLOW_LABEL_META
log.info(
f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."
)
runner_label_prefix = DEFAULT_LABEL_PREFIX
else:
try:
gh = get_gh_client(args.github_token)
# The default issue we use - https://github.com/pytorch/test-infra/issues/5132
issue = get_issue(gh, args.github_issue_repo, args.github_issue)
rollout_state = get_rollout_state_from_issue(
args.github_token, args.github_issue_repo, args.github_issue
)
username = get_potential_pr_author(
gh,
args.github_token,
args.github_repo,
args.github_actor,
args.github_ref_type,
args.github_branch,
)
label_type = get_workflow_type(
issue,
(
args.github_issue_owner,
username,
),
is_canary = args.github_repo == "pytorch/pytorch-canary"
runner_label_prefix = get_runner_prefix(
rollout_state, (args.github_issue_owner, username), is_canary
)
except Exception as e:
log.error(
f"Failed to get issue. Falling back to meta runners. Exception: {e}"
f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"
)
label_type = WORKFLOW_LABEL_META
# For Canary builds use canary runners
if args.github_repo == "pytorch/pytorch-canary" and label_type == WORKFLOW_LABEL_LF:
label_type = WORKFLOW_LABEL_LF_CANARY
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)
set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)
if __name__ == "__main__":

View File

@ -3,7 +3,7 @@
## Install prerequisites.
```
$ sudo dnf install docker
$ sudo dnf install podman podman-docker jq
```
## Add services.
@ -27,23 +27,48 @@ $ sudo systemctl enable --now qemu-user-static
## Rebuild the image
In order to build or update the `iiilinuxibmcom/actions-runner` image, e.g. to get the
latest OS security fixes, use the following commands:
First build s390x builder image `docker.io/pytorch/manylinuxs390x-builder`,
using following commands:
```
$ cd ~
$ git clone https://github.com/pytorch/pytorch
$ cd pytorch
$ git submodule update --init --recursive
$ GPU_ARCH_TYPE=cpu-s390x "$(pwd)/.ci/docker/manywheel/build.sh" manylinuxs390x-builder
$ docker image tag localhost/pytorch/manylinuxs390x-builder docker.io/pytorch/manylinuxs390x-builder:cpu-s390x
$ docker image save -o ~/manywheel-s390x.tar docker.io/pytorch/manylinuxs390x-builder:cpu-s390x
```
Next step is to build `actions-runner` image using:
```
$ cd self-hosted-builder
$ sudo docker build \
--build-arg repo=<owner>/<name> \
--build-arg token=<***> \
--pull \
-f actions-runner.Dockerfile \
-t iiilinuxibmcom/actions-runner \
-t iiilinuxibmcom/actions-runner.<name> \
.
```
If it fails, ensure that selinux doesn't prevent it from working.
If there are failures, ensure that selinux doesn't prevent it from working.
In worst case, selinux can be disabled with `setenforce 0`.
Now prepare all necessary files for runner registration:
```
$ sudo mkdir -p /etc/actions-runner/<name>
$ sudo chmod 700 /etc/actions-runner/<name>
$ sudo /bin/cp <github_app_private_key_file> /etc/actions-runner/<name>/key_private.pem
$ sudo echo <github_app_id> | sudo tee /etc/actions-runner/<name>/appid.env
$ sudo echo <github_app_install_id> | sudo tee /etc/actions-runner/<name>/installid.env
$ sudo echo NAME=<worker_name> | sudo tee /etc/actions-runner/<name>/env
$ sudo echo ORG=<github_org> | sudo tee -a /etc/actions-runner/<name>/env
$ cd self-hosted-builder
$ sudo /bin/cp helpers/*.sh /usr/local/bin/
$ sudo chmod 755 /usr/local/bin/app_token.sh /usr/local/bin/gh_token_generator.sh
```
## Autostart the runner.
```

View File

@ -1,12 +1,12 @@
# Self-Hosted IBM Z Github Actions Runner.
# Temporary image: amd64 dependencies.
FROM docker.io/amd64/ubuntu:22.04 as ld-prefix
FROM docker.io/amd64/ubuntu:23.10 as ld-prefix
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && apt-get -y install ca-certificates libicu70 libssl3
RUN apt-get update && apt-get -y install ca-certificates libicu72 libssl3
# Main image.
FROM docker.io/s390x/ubuntu:22.04
FROM docker.io/s390x/ubuntu:23.10
# Packages for pytorch building and testing.
ENV DEBIAN_FRONTEND=noninteractive
@ -16,6 +16,7 @@ RUN apt-get update && apt-get -y install \
gcc \
git \
jq \
zip \
libxml2-dev \
libxslt-dev \
ninja-build \
@ -43,24 +44,28 @@ COPY fs/ /
RUN chmod +x /usr/bin/actions-runner /usr/bin/entrypoint
# install podman
RUN apt -y install podman podman-docker
# amd64 Github Actions Runner.
RUN useradd -m actions-runner
USER actions-runner
WORKDIR /home/actions-runner
RUN curl -L https://github.com/actions/runner/releases/download/v2.309.0/actions-runner-linux-x64-2.309.0.tar.gz | tar -xz
# repository
ARG repo
# set up python virtual environment which is later used by runner.
# build workflows use "python -m pip install ...",
# and it doesn't work for non-root user
RUN virtualenv --system-site-packages venv
# repository token
ARG token
# copy prebuilt manywheel docker image for builds and tests
# build command is:
# GPU_ARCH_TYPE=cpu-s390x "$(pwd)/manywheel/build_docker.sh"
# and save command is:
# docker image save -o manywheel-s390x.tar pytorch/manylinuxs390x-builder:cpu-s390x
#
COPY --chown=actions-runner:actions-runner manywheel-s390x.tar /home/actions-runner/manywheel-s390x.tar
RUN ./config.sh \
--unattended \
--url "https://github.com/${repo}" \
--token "${token}" \
--no-default-labels \
--labels self-hosted,linux.s390x
RUN curl -L https://github.com/actions/runner/releases/download/v2.317.0/actions-runner-linux-x64-2.317.0.tar.gz | tar -xz
ENTRYPOINT ["/usr/bin/entrypoint"]
CMD ["/usr/bin/actions-runner"]

View File

@ -8,12 +8,16 @@ StartLimitIntervalSec=0
Type=simple
Restart=always
ExecStartPre=-/usr/bin/docker rm --force actions-runner.%i
ExecStartPre=-/usr/local/bin/gh_token_generator.sh /etc/actions-runner/%i/appid.env /etc/actions-runner/%i/installid.env /etc/actions-runner/%i/key_private.pem /etc/actions-runner/%i/ghtoken.env
ExecStart=/usr/bin/docker run \
--env-file=/etc/actions-runner/%i/env \
--env-file=/etc/actions-runner/%i/ghtoken.env \
--init \
--interactive \
--name=actions-runner.%i \
--rm \
iiilinuxibmcom/actions-runner
--privileged \
iiilinuxibmcom/actions-runner.%i
ExecStop=/bin/sh -c "docker exec actions-runner.%i kill -INT -- -1"
ExecStop=/bin/sh -c "docker wait actions-runner.%i"
ExecStop=/bin/sh -c "docker rm actions-runner.%i"

View File

@ -2,5 +2,45 @@
set -e -u
# first import docker image
if [ -f ./manywheel-s390x.tar ] ; then
docker image load --input manywheel-s390x.tar
docker image tag docker.io/pytorch/manylinuxs390x-builder:cpu-s390x docker.io/pytorch/manylinuxs390x-builder:cpu-s390x-main
rm -f manywheel-s390x.tar
fi
token_file=registration-token.json
# Generate registration token
curl \
-X POST \
-H "Accept: application/vnd.github.v3+json" \
-H "Authorization: Bearer ${ACCESS_TOKEN}" \
"https://api.github.com/orgs/${ORG}/actions/runners/registration-token" \
-o "$token_file"
unset ACCESS_TOKEN
# register runner as ephemeral runner
# it does one job, stops and unregisters
registration_token=$(jq --raw-output .token "$token_file")
./config.sh \
--unattended \
--ephemeral \
--url "https://github.com/${ORG}" \
--token "${registration_token}" \
--name "${NAME}" \
--no-default-labels \
--labels self-hosted,linux.s390x
unset registration_token
rm -f "$token_file"
# enter into python virtual environment.
# build workflows use "python -m pip install ...",
# and it doesn't work for non-root user
source venv/bin/activate
# Run one job.
./run.sh --once
./run.sh

View File

@ -0,0 +1,84 @@
#!/usr/bin/env bash
#
# Request an ACCESS_TOKEN to be used by a GitHub APP
# Environment variable that need to be set up:
# * APP_ID, the GitHub's app ID
# * INSTALL_ID, the Github's app's installation ID
# * APP_PRIVATE_KEY, the content of GitHub app's private key in PEM format.
#
# https://github.com/orgs/community/discussions/24743#discussioncomment-3245300
#
set -o pipefail
_GITHUB_HOST=${GITHUB_HOST:="github.com"}
# If URL is not github.com then use the enterprise api endpoint
if [[ ${GITHUB_HOST} = "github.com" ]]; then
URI="https://api.${_GITHUB_HOST}"
else
URI="https://${_GITHUB_HOST}/api/v3"
fi
API_VERSION=v3
API_HEADER="Accept: application/vnd.github.${API_VERSION}+json"
CONTENT_LENGTH_HEADER="Content-Length: 0"
APP_INSTALLATIONS_URI="${URI}/app/installations"
# JWT parameters based off
# https://docs.github.com/en/developers/apps/building-github-apps/authenticating-with-github-apps#authenticating-as-a-github-app
#
# JWT token issuance and expiration parameters
JWT_IAT_DRIFT=60
JWT_EXP_DELTA=600
JWT_JOSE_HEADER='{
"alg": "RS256",
"typ": "JWT"
}'
build_jwt_payload() {
now=$(date +%s)
iat=$((now - JWT_IAT_DRIFT))
jq -c \
--arg iat_str "${iat}" \
--arg exp_delta_str "${JWT_EXP_DELTA}" \
--arg app_id_str "${APP_ID}" \
'
($iat_str | tonumber) as $iat
| ($exp_delta_str | tonumber) as $exp_delta
| ($app_id_str | tonumber) as $app_id
| .iat = $iat
| .exp = ($iat + $exp_delta)
| .iss = $app_id
' <<< "{}" | tr -d '\n'
}
base64url() {
base64 | tr '+/' '-_' | tr -d '=\n'
}
rs256_sign() {
openssl dgst -binary -sha256 -sign <(echo "$1")
}
request_access_token() {
jwt_payload=$(build_jwt_payload)
encoded_jwt_parts=$(base64url <<<"${JWT_JOSE_HEADER}").$(base64url <<<"${jwt_payload}")
encoded_mac=$(echo -n "$encoded_jwt_parts" | rs256_sign "${APP_PRIVATE_KEY}" | base64url)
generated_jwt="${encoded_jwt_parts}.${encoded_mac}"
auth_header="Authorization: Bearer ${generated_jwt}"
app_installations_response=$(curl -sX POST \
-H "${auth_header}" \
-H "${API_HEADER}" \
--header "X-GitHub-Api-Version: 2022-11-28" \
--url "https://api.github.com/app/installations/${INSTALL_ID}/access_tokens" \
)
echo "$app_installations_response" | jq --raw-output '.token'
}
request_access_token

View File

@ -0,0 +1,10 @@
#!/usr/bin/env bash
SCRIPT_DIR=$(dirname "$0")
APP_ID=$1
INSTALL_ID=$2
APP_PRIVATE_KEY=$3
DST_FILE="$4"
ACCESS_TOKEN="$(APP_ID="$(<"${APP_ID}")" INSTALL_ID="$(<"${INSTALL_ID}")" APP_PRIVATE_KEY="$(<"${APP_PRIVATE_KEY}")" "${SCRIPT_DIR}/app_token.sh")"
echo "ACCESS_TOKEN=${ACCESS_TOKEN}" > "${DST_FILE}"

View File

@ -1,35 +0,0 @@
#!/bin/bash
set -eoux pipefail
SYNC_BRANCH=pytorch-stable-prototype
git config user.email "fake@example.com"
git config user.name "PyTorch Stable Bot"
git fetch origin main
git fetch origin "$SYNC_BRANCH"
git checkout "$SYNC_BRANCH"
# Using a hardcoded SHA here is a massive speedup as we can skip the entire history of the pytorch GitHub repo.
# This specific SHA was chosen as it was before the "branch point" of the stable branch
for SHA in $(git log ba3b05fdf37ddbc3c301294d6a560a816335e717..origin/main --pretty="%h" -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)
do
# `git merge-base --is-ancestor` exits with code 0 if the given SHA is an ancestor, and non-0 otherwise
if git merge-base --is-ancestor $SHA HEAD || [[ $(git log --grep="(cherry picked from commit $SHA") ]]
then
echo "Skipping $SHA"
continue
fi
echo "Copying $SHA"
git cherry-pick -x "$SHA" -X theirs
git reset --soft HEAD~1
git add torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed
git checkout .
git commit --reuse-message=HEAD@{1}
git clean -f
done
if [[ "${WITH_PUSH}" == true ]]; then
git push
fi

View File

@ -51,6 +51,8 @@ def main() -> None:
for platform_image in platform_images: # type: ignore[attr-defined]
for arch in platform_image.keys(): # type: ignore[attr-defined]
if arch == "cpu-s390x":
continue
tag_image(
platform_image[arch], # type: ignore[index]
default_tag,

View File

@ -18,6 +18,7 @@ def mock_parse_args() -> object:
class Object:
def __init__(self) -> None:
self.pr_num = 76123
self.exit_non_zero = False
return Object()

View File

@ -0,0 +1,237 @@
from unittest import main, TestCase
from unittest.mock import Mock, patch
import runner_determinator as rd
class TestRunnerDeterminatorIssueParser(TestCase):
def test_parse_settings(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 0
---
Users:
@User1,lf
@User2,lf,otherExp
"""
settings = rd.parse_settings(settings_text)
self.assertTupleEqual(
rd.Experiment(rollout_perc=25),
settings.experiments["lf"],
"lf settings not parsed correctly",
)
self.assertTupleEqual(
rd.Experiment(rollout_perc=0),
settings.experiments["otherExp"],
"otherExp settings not parsed correctly",
)
def test_parse_settings_in_code_block(self) -> None:
settings_text = """
```
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 0
```
---
Users:
@User1,lf
@User2,lf,otherExp
"""
settings = rd.parse_settings(settings_text)
self.assertTupleEqual(
rd.Experiment(rollout_perc=25),
settings.experiments["lf"],
"lf settings not parsed correctly",
)
self.assertTupleEqual(
rd.Experiment(rollout_perc=0),
settings.experiments["otherExp"],
"otherExp settings not parsed correctly",
)
def test_parse_users(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
---
Users:
@User1,lf
@User2,lf,otherExp
"""
users = rd.parse_users(settings_text)
self.assertDictEqual(
{"User1": ["lf"], "User2": ["lf", "otherExp"]},
users,
"Users not parsed correctly",
)
def test_parse_users_without_settings(self) -> None:
settings_text = """
@User1,lf
@User2,lf,otherExp
"""
users = rd.parse_users(settings_text)
self.assertDictEqual(
{"User1": ["lf"], "User2": ["lf", "otherExp"]},
users,
"Users not parsed correctly",
)
class TestRunnerDeterminatorGetRunnerPrefix(TestCase):
def test_opted_in_user(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
---
Users:
@User1,lf
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
self.assertEqual("lf.", prefix, "Runner prefix not correct for User1")
def test_opted_in_user_two_experiments(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
---
Users:
@User1,lf
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User2"])
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for User2")
@patch("random.uniform", return_value=50)
def test_opted_out_user(self, mock_uniform: Mock) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 25
---
Users:
@User1,lf
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User3"])
self.assertEqual("", prefix, "Runner prefix not correct for user")
@patch("random.uniform", return_value=10)
def test_opted_out_user_was_pulled_in_by_rollout(self, mock_uniform: Mock) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 25
otherExp:
rollout_perc: 25
---
Users:
@User1,lf
@User2,lf,otherExp
"""
# User3 is opted out, but is pulled into both experiments by the 10% rollout
prefix = rd.get_runner_prefix(settings_text, ["User3"])
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_lf_prefix_always_comes_first(self) -> None:
settings_text = """
experiments:
otherExp:
rollout_perc: 0
lf:
rollout_perc: 0
---
Users:
@User1,lf
@User2,otherExp,lf
"""
prefix = rd.get_runner_prefix(settings_text, ["User2"])
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
def test_ignores_commented_users(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
---
Users:
#@User1,lf
@User2,lf,otherExp
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
self.assertEqual("", prefix, "Runner prefix not correct for user")
def test_ignores_extra_experiments(self) -> None:
settings_text = """
experiments:
lf:
rollout_perc: 0
otherExp:
rollout_perc: 0
foo:
rollout_perc: 0
---
Users:
@User1,lf,otherExp,foo
"""
prefix = rd.get_runner_prefix(settings_text, ["User1"])
self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")
if __name__ == "__main__":
main()

View File

@ -36,6 +36,7 @@ from warnings import warn
import yaml
from github_utils import (
gh_close_pr,
gh_fetch_json_list,
gh_fetch_merge_base,
gh_fetch_url,
@ -1116,15 +1117,20 @@ class GitHubPR:
msg = self.get_title() + f" (#{self.pr_num})\n\n"
msg += msg_body
# Mention PR co-authors
for author_login, author_name in self.get_authors().items():
if author_login != self.get_pr_creator_login():
msg += f"\nCo-authored-by: {author_name}"
msg += f"\nPull Request resolved: {self.get_pr_url()}\n"
msg += f"Approved by: {approved_by_urls}\n"
if ghstack_deps:
msg += f"ghstack dependencies: {', '.join([f'#{pr.pr_num}' for pr in ghstack_deps])}\n"
# Mention PR co-authors, which should be at the end of the message
# And separated from the body by two newlines
first_coauthor = True
for author_login, author_name in self.get_authors().items():
if author_login != self.get_pr_creator_login():
if first_coauthor:
msg, first_coauthor = (msg + "\n", False)
msg += f"\nCo-authored-by: {author_name}"
return msg
def add_numbered_label(self, label_base: str, dry_run: bool) -> None:
@ -1169,11 +1175,11 @@ class GitHubPR:
for pr in additional_merged_prs:
pr.add_numbered_label(MERGE_COMPLETE_LABEL, dry_run)
if comment_id and self.pr_num:
# When the merge process reaches this part, we can assume that the commit
# has been successfully pushed to trunk
merge_commit_sha = repo.rev_parse(name=REMOTE_MAIN_BRANCH)
# When the merge process reaches this part, we can assume that the commit
# has been successfully pushed to trunk
merge_commit_sha = repo.rev_parse(name=self.default_branch())
if comment_id and self.pr_num:
# Finally, upload the record to Rockset. The list of pending and failed
# checks are at the time of the merge
save_merge_record(
@ -1198,6 +1204,17 @@ class GitHubPR:
else:
print("Missing comment ID or PR number, couldn't upload to Rockset")
# Usually Github will see that the commit has "resolves <pr_num>" in the
# commit message and close the PR, but sometimes it doesn't, leading to
# confusion. When it doesn't, we close it manually.
time.sleep(60) # Give Github some time to close the PR
manually_close_merged_pr(
pr=self,
additional_merged_prs=additional_merged_prs,
merge_commit_sha=merge_commit_sha,
dry_run=dry_run,
)
def merge_changes(
self,
repo: GitRepo,
@ -1498,6 +1515,34 @@ def checks_to_markdown_bullets(
]
def manually_close_merged_pr(
pr: GitHubPR,
additional_merged_prs: List[GitHubPR],
merge_commit_sha: str,
dry_run: bool,
) -> None:
def _comment_and_close(pr: GitHubPR, comment: str) -> None:
pr = GitHubPR(pr.org, pr.project, pr.pr_num) # Refresh the PR
if not pr.is_closed():
gh_post_pr_comment(pr.org, pr.project, pr.pr_num, comment, dry_run)
gh_close_pr(pr.org, pr.project, pr.pr_num, dry_run)
message = (
f"This PR (#{pr.pr_num}) was merged in {merge_commit_sha} but it is still open, likely due to a Github bug, "
"so mergebot is closing it manually. If you think this is a mistake, please feel free to reopen and contact Dev Infra."
)
_comment_and_close(pr, message)
for additional_pr in additional_merged_prs:
message = (
f"This PR (#{additional_pr.pr_num}) was merged as part of PR #{pr.pr_num} in the stack under {merge_commit_sha} "
"but it is still open, likely due to a Github bug, so mergebot is closing it manually. "
"If you think this is a mistake, please feel free to reopen and contact Dev Infra."
)
_comment_and_close(additional_pr, message)
print(f"PR {pr.pr_num} and all additional PRs in the stack have been closed.")
@retries_decorator()
def save_merge_record(
comment_id: int,

View File

@ -1,7 +1,7 @@
{%- set upload_artifact_s3_action = "seemethere/upload-artifact-s3@v5" -%}
{%- set download_artifact_s3_action = "seemethere/download-artifact-s3@v4" -%}
{%- set upload_artifact_action = "actions/upload-artifact@v3" -%}
{%- set download_artifact_action = "actions/download-artifact@v3" -%}
{%- set upload_artifact_action = "actions/upload-artifact@v4.4.0" -%}
{%- set download_artifact_action = "actions/download-artifact@v4.1.7" -%}
{%- set timeout_minutes = 240 -%}

View File

@ -52,19 +52,32 @@ env:
!{{ common.concurrency(build_environment) }}
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
{%- for config in build_configs %}
!{{ config["build_name"] }}-build:
if: ${{ github.repository_owner == 'pytorch' }}
uses: ./.github/workflows/_binary-build-linux.yml
needs: get-label-type
with:!{{ upload.binary_env_as_input(config) }}
{%- if "aarch64" in build_environment %}
runs_on: linux.arm64.m7g.4xlarge
runs_on: linux.arm64.m7g.4xlarge.ephemeral
ALPINE_IMAGE: "arm64v8/alpine"
{%- elif "s390x" in build_environment %}
runs_on: linux.s390x
ALPINE_IMAGE: "docker.io/s390x/alpine"
{%- elif "conda" in build_environment and config["gpu_arch_type"] == "cuda" %}
runs_on: linux.24xlarge
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.24xlarge.ephemeral
{%- else %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
{%- endif %}
build_name: !{{ config["build_name"] }}
build_environment: !{{ build_environment }}
@ -80,7 +93,9 @@ jobs:
{%- if config["gpu_arch_type"] != "cuda-aarch64" %}
!{{ config["build_name"] }}-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: !{{ config["build_name"] }}-build
needs:
- !{{ config["build_name"] }}-build
- get-label-type
{%- if config["gpu_arch_type"] not in ["rocm", "xpu"] %}
uses: ./.github/workflows/_binary-test-linux.yml
with:!{{ upload.binary_env_as_input(config) }}
@ -95,8 +110,10 @@ jobs:
{%- elif config["gpu_arch_type"] == "rocm" %}
runs_on: linux.rocm.gpu
{%- elif config["gpu_arch_type"] == "cuda" %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge.nvidia.gpu
{%- else %}
runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
runs_on: linux.4xlarge
{%- endif %}
secrets:

View File

@ -64,9 +64,6 @@ jobs:
{%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0 %}
PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
{%- endif %}
# For sccache access (only on non-forked PRs)
AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
steps:
!{{ set_runner_specific_vars() }}
- name: Install conda and dependencies
@ -84,7 +81,7 @@ jobs:
!{{ common.checkout(deep_clone=False, directory="pytorch") }}
!{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
- name: Install sccache (only for non-forked PRs, and pushes to trunk)
uses: nick-fields/retry@v2.8.2
uses: nick-fields/retry@v3.0.0
if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
with:
timeout_minutes: 5
@ -104,7 +101,7 @@ jobs:
# shellcheck disable=SC1091
source "${RUNNER_TEMP}/anaconda/bin/activate"
"${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
- uses: actions/upload-artifact@v3
- uses: actions/upload-artifact@v4.4.0
if: always()
with:
name: !{{ config["build_name"] }}

View File

@ -45,7 +45,7 @@
{%- if is_windows %}
# This is a dummy value for libtorch to work correctly with our batch scripts
# without this value pip does not get installed for some reason
DESIRED_PYTHON: "3.8"
DESIRED_PYTHON: "3.9"
{%- endif %}
{%- else %}

View File

@ -53,10 +53,24 @@ env:
!{{ common.concurrency(build_environment) }}
jobs:
get-label-type:
name: get-label-type
uses: ./.github/workflows/_runner-determinator.yml
with:
triggering_actor: ${{ github.triggering_actor }}
issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
curr_branch: ${{ github.head_ref || github.ref_name }}
curr_ref_type: ${{ github.ref_type }}
{%- for config in build_configs %}
!{{ config["build_name"] }}-build:
if: ${{ github.repository_owner == 'pytorch' }}
runs-on: windows.4xlarge.nonephemeral
needs: get-label-type
{%- if branches == "nightly" %}
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge"
{%- else %}
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
{%- endif %}
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config, True) }}
{%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0 %}
@ -85,15 +99,17 @@ jobs:
!{{ common.wait_and_kill_ssh_windows('pytorch') }}
!{{ config["build_name"] }}-test: # Testing
if: ${{ github.repository_owner == 'pytorch' }}
needs: !{{ config["build_name"] }}-build
needs:
- !{{ config["build_name"] }}-build
- get-label-type
{%- if config["gpu_arch_type"] == "cuda" %}
{%- if branches == "nightly" %}
runs-on: windows.8xlarge.nvidia.gpu
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.g4dn.xlarge"
{%- else %}
runs-on: windows.8xlarge.nvidia.gpu.nonephemeral
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.g4dn.xlarge.nonephemeral"
{%- endif %}
{%- else %}
runs-on: windows.4xlarge.nonephemeral
runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
{%- endif %}
timeout-minutes: !{{ common.timeout_minutes }}
!{{ upload.binary_env(config, True) }}

View File

@ -11,11 +11,16 @@ on:
required: true
type: string
description: The build environment
runner_prefix:
required: false
default: ""
type: string
description: prefix for runner label
runs_on:
required: false
default: linux.12xlarge
default: linux.12xlarge.ephemeral
type: string
description: Hardware to run this "build"job on, linux.12xlarge or linux.arm64.2xlarge.
description: Hardware to run this "build" job on, linux.12xlarge or linux.arm64.2xlarge.
timeout-minutes:
required: false
default: 210
@ -89,7 +94,7 @@ on:
jobs:
build:
runs-on: ${{ inputs.runs_on }}
runs-on: ${{ inputs.runner_prefix}}${{ inputs.runs_on }}
timeout-minutes: ${{ inputs.timeout-minutes }}
env:
PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}
@ -278,7 +283,7 @@ jobs:
# Ensure the working directory gets chowned back to the current user
docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .
- uses: actions/upload-artifact@v3
- uses: actions/upload-artifact@v4.4.0
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}
with:
name: ${{ inputs.build_name }}

View File

@ -59,6 +59,11 @@ on:
required: false
type: string
description: Desired python version
runner_prefix:
required: false
default: ""
type: string
description: prefix for runner label
runs_on:
required: true
type: string
@ -77,7 +82,7 @@ on:
jobs:
test:
runs-on: ${{ inputs.runs_on }}
runs-on: ${{ inputs.runner_prefix}}${{ inputs.runs_on }}
timeout-minutes: 240
env:
PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}
@ -205,7 +210,7 @@ jobs:
- name: Download Build Artifacts
if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}
uses: actions/download-artifact@v3
uses: actions/download-artifact@v4.1.7
with:
name: ${{ inputs.build_name }}
path: "${{ runner.temp }}/artifacts/"

View File

@ -126,7 +126,7 @@ jobs:
# NB: When the previous build job is skipped, there won't be any artifacts and
# this step will fail. Binary build jobs can only be skipped on CI, not nightly
continue-on-error: true
uses: actions/download-artifact@v3
uses: actions/download-artifact@v4.1.7
with:
name: ${{ inputs.build_name }}
path: "${{ runner.temp }}/artifacts/"

View File

@ -8,6 +8,11 @@ on:
type: string
description: |
A JSON description of what configs to run later on.
runner_prefix:
required: false
type: string
description: |
Prefix for runner label
defaults:
run:
@ -16,7 +21,7 @@ defaults:
jobs:
filter:
if: github.repository_owner == 'pytorch'
runs-on: [self-hosted, linux.large]
runs-on: [self-hosted, "${{ inputs.runner_prefix }}linux.large"]
outputs:
test-matrix: ${{ steps.filter.outputs.test-matrix }}
is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}
@ -59,7 +64,7 @@ jobs:
environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}
- name: Install Buck
uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
uses: nick-fields/retry@v3.0.0
with:
timeout_minutes: 10
max_attempts: 5
@ -69,7 +74,7 @@ jobs:
sudo apt install ./buck.2021.01.12.01_all.deb
- name: Download third party libraries and generate wrappers
uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
uses: nick-fields/retry@v3.0.0
with:
timeout_minutes: 10
max_attempts: 5

View File

@ -92,7 +92,7 @@ jobs:
fi
- name: Install brew dependencies
uses: nick-fields/retry@v2.8.2
uses: nick-fields/retry@v3.0.0
with:
timeout_minutes: 5
max_attempts: 3
@ -109,7 +109,7 @@ jobs:
pip-requirements-file: .github/requirements/pip-requirements-iOS.txt
- name: Setup Fastlane
uses: nick-fields/retry@v2.8.2
uses: nick-fields/retry@v3.0.0
with:
timeout_minutes: 5
max_attempts: 3
@ -292,7 +292,7 @@ jobs:
bundler-cache: true
- name: Download arm64 artifacts
uses: actions/download-artifact@v3
uses: actions/download-artifact@v4.1.7
with:
name: pytorch-ios-build-artifacts-arm64

View File

@ -34,6 +34,11 @@ on:
default: "5.2"
description: |
List of CUDA architectures CI build should target.
runner_prefix:
required: false
default: ""
type: string
description: Prefix for runner label
runner:
required: false
type: string
@ -77,6 +82,10 @@ on:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
SCRIBE_GRAPHQL_ACCESS_TOKEN:
required: false
description: |
FB app token to write to scribe endpoint
outputs:
@ -89,9 +98,10 @@ on:
jobs:
build:
environment: ${{ github.ref == 'refs/heads/main' && 'scribe-protected' || startsWith(github.ref, 'refs/heads/release/') && 'scribe-protected' || contains(github.event.pull_request.labels.*.name, 'ci-scribe') && 'scribe-pr' || '' }}
# Don't run on forked repos
if: github.repository_owner == 'pytorch'
runs-on: ${{ inputs.runner }}
runs-on: ${{ inputs.runner_prefix}}${{ inputs.runner }}
timeout-minutes: 240
outputs:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -99,6 +109,7 @@ jobs:
steps:
- name: Setup SSH (Click me for login details)
uses: pytorch/test-infra/.github/actions/setup-ssh@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
github-secret: ${{ secrets.GITHUB_TOKEN }}
@ -108,13 +119,16 @@ jobs:
# checkout. In other cases you should prefer a local checkout.
- name: Checkout PyTorch
uses: pytorch/pytorch/.github/actions/checkout-pytorch@main
with:
no-sudo: ${{ inputs.build-environment == 'linux-s390x-binary-manywheel' }}
- name: Setup Linux
uses: ./.github/actions/setup-linux
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
if: ${{ inputs.aws-role-to-assume != '' }}
if: ${{ inputs.aws-role-to-assume != '' && inputs.build-environment != 'linux-s390x-binary-manywheel' }}
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-build
@ -123,11 +137,13 @@ jobs:
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
@ -137,6 +153,7 @@ jobs:
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
@ -164,6 +181,7 @@ jobs:
- name: Download pytest cache
uses: ./.github/actions/pytest-cache-download
continue-on-error: true
if: inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
cache_dir: .pytest_cache
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
@ -185,13 +203,29 @@ jobs:
PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
DOCKER_IMAGE_S390X: ${{ inputs.docker-image-name }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
DEBUG: ${{ inputs.build-with-debug && '1' || '0' }}
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
USE_SPLIT_BUILD: ${{ inputs.use_split_build }}
run: |
if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then
JENKINS_USER=
USED_IMAGE="${DOCKER_IMAGE_S390X}"
# since some steps are skipped on s390x, if they are necessary, run them here
env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}"
env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}"
else
JENKINS_USER="--user jenkins"
USED_IMAGE="${DOCKER_IMAGE}"
fi
# detached container should get cleaned up by teardown_ec2_linux
# Used for JENKINS_USER, which can be empty
# shellcheck disable=SC2086
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS="$(nproc --ignore=2)" \
@ -214,10 +248,10 @@ jobs:
--cap-add=SYS_PTRACE \
--tty \
--detach \
--user jenkins \
${JENKINS_USER} \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
"${USED_IMAGE}"
)
docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'
@ -228,7 +262,7 @@ jobs:
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build && inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
@ -238,7 +272,7 @@ jobs:
- name: Store PyTorch Build Artifacts on S3 for split build
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build && inputs.build-environment != 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}-experimental-split-build
retention-days: 14
@ -246,8 +280,26 @@ jobs:
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Store PyTorch Build Artifacts for s390x
uses: actions/upload-artifact@v3
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && !inputs.use_split_build && inputs.build-environment == 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
if-no-files-found: error
path: artifacts.zip
- name: Store PyTorch Build Artifacts for s390x for split build
uses: actions/upload-artifact@v3
if: inputs.build-generates-artifacts && steps.build.outcome != 'skipped' && inputs.use_split_build && inputs.build-environment == 'linux-s390x-binary-manywheel'
with:
name: ${{ inputs.build-environment }}-experimental-split-build
retention-days: 14
if-no-files-found: error
path: artifacts.zip
- name: Upload sccache stats
if: steps.build.outcome != 'skipped'
if: steps.build.outcome != 'skipped' && inputs.build-environment != 'linux-s390x-binary-manywheel'
uses: seemethere/upload-artifact-s3@v5
with:
s3-prefix: |
@ -259,4 +311,13 @@ jobs:
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()
if: always() && inputs.build-environment != 'linux-s390x-binary-manywheel'
- name: Cleanup docker
if: always() && inputs.build-environment == 'linux-s390x-binary-manywheel'
shell: bash
run: |
# on s390x stop the container for clean worker stop
# ignore expansion of "docker ps -q" since it could be empty
# shellcheck disable=SC2046
docker stop $(docker ps -q) || true

View File

@ -52,6 +52,10 @@ on:
required: false
description: |
HF Auth token to avoid rate limits when downloading models or datasets from hub
SCRIBE_GRAPHQL_ACCESS_TOKEN:
required: false
description: |
FB app token to write to scribe endpoint
env:
GIT_DEFAULT_BRANCH: ${{ github.event.repository.default_branch }}
@ -63,6 +67,7 @@ jobs:
strategy:
matrix: ${{ fromJSON(inputs.test-matrix) }}
fail-fast: false
environment: ${{ github.ref == 'refs/heads/main' && 'scribe-protected' || startsWith(github.ref, 'refs/heads/release/') && 'scribe-protected' || contains(github.event.pull_request.labels.*.name, 'ci-scribe') && 'scribe-pr' || '' }}
runs-on: ${{ matrix.runner }}
timeout-minutes: ${{ matrix.mem_leak_check == 'mem_leak_check' && 600 || inputs.timeout-minutes }}
steps:
@ -212,6 +217,7 @@ jobs:
PYTORCH_TEST_RERUN_DISABLED_TESTS: ${{ matrix.rerun_disabled_tests && '1' || '0' }}
DASHBOARD_TAG: ${{ inputs.dashboard-tag }}
HUGGING_FACE_HUB_TOKEN: ${{ secrets.HUGGING_FACE_HUB_TOKEN }}
SCRIBE_GRAPHQL_ACCESS_TOKEN: ${{ secrets.SCRIBE_GRAPHQL_ACCESS_TOKEN }}
run: |
set -x
@ -266,6 +272,7 @@ jobs:
-e PYTORCH_TEST_RERUN_DISABLED_TESTS \
-e SKIP_SCCACHE_INITIALIZATION=1 \
-e HUGGING_FACE_HUB_TOKEN \
-e SCRIBE_GRAPHQL_ACCESS_TOKEN \
-e DASHBOARD_TAG \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \

Some files were not shown because too many files have changed in this diff Show More